Today we had a visit from Dr. Turner Osler, who is a medical researcher at the UVM medical school.
Just to touch on several of the important topics. You will have read the Scientific American article by Efron and Morris. The point here is that a better estimate of the individual items in a collection of items (baseball batting averages, proportion of toxoplasmosis patients in a city, proportion of mortalities in a burn unit) by combining them in a way that uses some of the information from all of the different items to get better estimates of the individual items. So if we observe the batting averages of a number of batters after they have been at-bat 40 times (early in the season), we can get a better overall estimate of the final batting averages of the players at the end of the season by using a formula like:
zi=Y + c(yi-Y)
where yi is the individual item (batting average, etc.), Y is the average of all of them, and zi is our best estimate of the true value. The number c is given by a formula, which is in the notes that Dr. Osler handed out. The important thing is that if c=1, then zi=yi, and if c=0 then zi=Y. The smaller c, the more the estimates are "shrunk" towards Y. So the ideal estimator is a combination of Y and yi, a weighted average of the two. (This is the so-called Efron-Morris estimator).
I remarked at the end of the class that the notion of "best" in this context is actually gotten by assuming a particular loss function that says that the farther away the true values are from the estimated value (in an "average" sense), the bigger the loss, so we try to choose c so as to minimize the overall loss. When you do this, the value of c in the handout pops out.
Dr. Osler is using these ideas to solve the following problem: we can estimate the proportion of "bad outcomes" in a burn unit, for example, by dividing the number of patients that die by the total number of patients. But this is like the batting average, it is based on a relatively small number of patients in many cases. But we can use the ideas of the Efron-Morris article to improve the estimates for the individual hospitals. And we learn that many of the hospitals that seemed at first glance to have excessively high mortality, probably do not. There was one hospital that looked suspicious.
Dr. Osler also discussed a "beta" prior, which we have been using without giving it that name. It is anything of the form pa(1-p)b for constants a and b. This will be a "bell-shaped curve," whose actual shape can be varied quite widely by choosing the constants a and b to match what we want. When you multiply this prior by the likelihood pn(1-p)N-n, which you get if n patients out of N die, you get a posterior that is also beta: pa+n(1-p)b+N-n. We've been doing this with spreadsheets, which are an approximation based on evaluating this formula at particular points between 0 and 1 (we used p=0.05, 0.15, 0.25,...,0.95 in class). As I remarked, the more points you use in the spreadsheet, the more accurate the results. We used 10 points because that's something easily done in class. But you could use 100 points, or 1000 of them to get more accurate answers. With a spreadsheet, it's just copying the formula down. But what Dr. Osler is doing is something we've been doing essentially for the entire semester. For example, when we used a non-uniform prior in the polling (voting) example, we were doing this.
Have a happy Thanksgiving!
Friday, November 21, 2008
Wednesday, November 19, 2008
Class, 11/19
Today I told you about the every-four-years Valencia conferences of Bayesians and the other series of meetings that happens two years after each Valencia conference. This conference has been going since, I think, 1978. A discussion of how the series began, written by Jose Bernardo (I mentioned him today) can be found here.
Because they have gotten quite large, in recent years the meetings have been held at a Mediterranean resort (one was held on the Canary Islands, part of Spain). One of the songs I played, "Frequentists and Bayesians," referred to that (Click here for music and lyrics). The songs and other things I mentioned come from the "Cabaret" that follows each of the roughly 5-day long meetings, when everyone is pretty exhausted and ready for some fun. The Cabarets feature songs with Bayesian flavor, set to well-known tunes (see the handout), skits, juggling, and other frivolity. Pictures of several of the Cabarets can be found here.
The song refers to MCMC, which stands for Markov chain Monte Carlo, which has pretty much become the default method of Bayesian calculations for the past 20 years. As I described it in class, it uses a "random walk" method to stagger from one state of nature to another one, in such a way that more time is spent on states of nature that have a high posterior probability. Then we can use the sample so generated to make the inferences we need, by just counting how often each state of nature is visited by the MCMC program; the proportion of time spent on a particular state of nature becomes an estimate of the posterior probability of that state of nature.
Another song I played, "These are Bayes," featured two things of interest. One is the mention of Sir Harold Jeffreys (no relation), who lived to almost 100 years of age and played a very important role over the years in making Bayesian statistics come into the mainstream. He invented a way of deciding on priors which, in the problems we have been doing, amounts to a uniform prior. (I met Sir Harold once, on the only trip he made to the U.S., when I was a graduate student. And, when I go to Bayesian meetings, the similarity between our last names is always a source of amusement.) The second thing is the running joke that Bayesians make about posteriors (Click here for music and lyrics).
There is a lively debate about how to choose priors. There's the Jeffreys prior, mentioned above, and other exotic things like Maximum Entropy priors, Reference priors, Group-theoretic priors and so on. We didn't discuss these at all since they are beyond the scope of the course, but one of the songs, "Confusing Priors," is really referring to this debate. It's on YouTube here.
Two other songs that are in the handout but which I did not play are "The MCMC Saga" and "Bayesian Believer". More YouTube clips of other songs can be found here. The lyrics of some of the earlier songs can be found on the Bayesian Songbook, which I excerpted for today's handout.
Because they have gotten quite large, in recent years the meetings have been held at a Mediterranean resort (one was held on the Canary Islands, part of Spain). One of the songs I played, "Frequentists and Bayesians," referred to that (Click here for music and lyrics). The songs and other things I mentioned come from the "Cabaret" that follows each of the roughly 5-day long meetings, when everyone is pretty exhausted and ready for some fun. The Cabarets feature songs with Bayesian flavor, set to well-known tunes (see the handout), skits, juggling, and other frivolity. Pictures of several of the Cabarets can be found here.
The song refers to MCMC, which stands for Markov chain Monte Carlo, which has pretty much become the default method of Bayesian calculations for the past 20 years. As I described it in class, it uses a "random walk" method to stagger from one state of nature to another one, in such a way that more time is spent on states of nature that have a high posterior probability. Then we can use the sample so generated to make the inferences we need, by just counting how often each state of nature is visited by the MCMC program; the proportion of time spent on a particular state of nature becomes an estimate of the posterior probability of that state of nature.
Another song I played, "These are Bayes," featured two things of interest. One is the mention of Sir Harold Jeffreys (no relation), who lived to almost 100 years of age and played a very important role over the years in making Bayesian statistics come into the mainstream. He invented a way of deciding on priors which, in the problems we have been doing, amounts to a uniform prior. (I met Sir Harold once, on the only trip he made to the U.S., when I was a graduate student. And, when I go to Bayesian meetings, the similarity between our last names is always a source of amusement.) The second thing is the running joke that Bayesians make about posteriors (Click here for music and lyrics).
There is a lively debate about how to choose priors. There's the Jeffreys prior, mentioned above, and other exotic things like Maximum Entropy priors, Reference priors, Group-theoretic priors and so on. We didn't discuss these at all since they are beyond the scope of the course, but one of the songs, "Confusing Priors," is really referring to this debate. It's on YouTube here.
Two other songs that are in the handout but which I did not play are "The MCMC Saga" and "Bayesian Believer". More YouTube clips of other songs can be found here. The lyrics of some of the earlier songs can be found on the Bayesian Songbook, which I excerpted for today's handout.
Monday, November 17, 2008
Class, 11/17
Today we discussed financial security, investments, insurance and related issues. The handout I gave out is here.
Saturday, November 15, 2008
Class, 11/14
Today, after some special announcements, we discussed the O. J. Simpson case. See Calculated Risks, Chapter 8. The basic thing here is that Alan Dershowitz, the Harvard lawyer who contributed to Simpson's defense, made a mistake when he argued that the fact that Simpson had battered his wife Nicole was not material, since the probability that a batterer would go on to murder his partner is only 1/2500 per year. This reasoning is wrong, because the probability that a woman would be murdered by a random person (not the batterer) in any year is 1/20000. This means that of any group of 100000 battered women, 40 of them would be killed by their batterer in any year, whereas only 5 would be murdered by a random person who is not the batterer. So, when we look at the fact that Nicole was murdered and battered, the probability is 40/45 that Simpson did it (on only this evidence). So the battery is relevant and should be admitted in evidence.
We then discussed the death penalty, which although it does not exist in Vermont state courts, does exist in Federal courts and at least one Vermont jury recently gave the death penalty to a person in Vermont who was tried in Federal court.
We used a decision tree approach. A decision box at the root of the tree has two branches corresponding to our two actions: Convict or Acquit. We put another decision box on the Convict branch, that is, Death or Life Imprisonment. Then each of these three branches has a probability circle with two branches: Guilty (probability p) and Innocent (probability 1-p).
For the losses, we put 0 on each of the two correct outcomes, Acquit Innocent and Convict Guilty with Life. We decided for various reasons that Convict Guilty and Death had a slight loss (reasons being things like moral hazard). It turns out not to matter, we could pick 0 and the result would be the same. We decided that the loss of giving an innocent person the death penalty was huge (we picked 1000, I think), much larger than the loss of sending an innocent person to prison for life. We didn't assign an actual number to acquitting a guilty person as we were running out of time. But the point was, we found that no matter what p is, so long as the loss for Innocent and Death is larger than the loss for Innocent and Life in Prison, we will never choose the death penalty.
We then discussed the death penalty, which although it does not exist in Vermont state courts, does exist in Federal courts and at least one Vermont jury recently gave the death penalty to a person in Vermont who was tried in Federal court.
We used a decision tree approach. A decision box at the root of the tree has two branches corresponding to our two actions: Convict or Acquit. We put another decision box on the Convict branch, that is, Death or Life Imprisonment. Then each of these three branches has a probability circle with two branches: Guilty (probability p) and Innocent (probability 1-p).
For the losses, we put 0 on each of the two correct outcomes, Acquit Innocent and Convict Guilty with Life. We decided for various reasons that Convict Guilty and Death had a slight loss (reasons being things like moral hazard). It turns out not to matter, we could pick 0 and the result would be the same. We decided that the loss of giving an innocent person the death penalty was huge (we picked 1000, I think), much larger than the loss of sending an innocent person to prison for life. We didn't assign an actual number to acquitting a guilty person as we were running out of time. But the point was, we found that no matter what p is, so long as the loss for Innocent and Death is larger than the loss for Innocent and Life in Prison, we will never choose the death penalty.
Class, 11/12
We discussed the Wuppertal, Germany case discussed in Calculated Risks, pp. 156-158. Our take on discussing the probability tree started with a 1 chance in 100000 that the person arrested was guilty, since there are about that many people in the area and with no data, the prior has to depend only on general information like this. So the probability of innocence in for all practical purposes equal to 1.
We then decided that there was an 8 in 10 chance that there would be blood on the shoes, given guilt, but only a 1 in 1000 chance, given innocence.
We further used the 2.7% match probability, given innocence, that was mentioned in the book, and a match probability of 1, given guilt.
All of this data still did not make the probability up to the 99% probability of guilt that we discussed in earlier classes.
I then pointed out that the suspect turns out to have had an ironclad alibi. He was 100 km away at the time of the murders.
We started discussing the O. J. Simpson case. We'll look at it in more detail next time.
We then decided that there was an 8 in 10 chance that there would be blood on the shoes, given guilt, but only a 1 in 1000 chance, given innocence.
We further used the 2.7% match probability, given innocence, that was mentioned in the book, and a match probability of 1, given guilt.
All of this data still did not make the probability up to the 99% probability of guilt that we discussed in earlier classes.
I then pointed out that the suspect turns out to have had an ironclad alibi. He was 100 km away at the time of the murders.
We started discussing the O. J. Simpson case. We'll look at it in more detail next time.
Wednesday, November 12, 2008
Tracking the flu
Here's a fascinating article about how Google is able to accurately determine the number of flu cases at a given time based on search requests for things like "flu-like symptoms" that are fed into its search engine. People are using the web in truly ingenious ways! (There was a report a few months ago that a scientist was able to show, using Google Earth images that could show the direction that cows faced (but not where the front end of the cow was), that cows tend to line up along a north-south direction, which may indicate that cows, like many animals, have a direction-sensing magnetic organ.)
Tuesday, November 11, 2008
Class, 11/10
We went over the exam.
Problem 1 is similar to the second half drug-testing problem on the study sheet; In the first half you would have to compute the posterior probability of the cure rate for each drug, but here I just told you what the posterior probability of the predict rate of Tom and Joe was. The second half is to arrange Tom and Joe's true predict rates and posterior probabilities along the sides and top of a square table, multiply pairwise to get the probability that Tom's true rate is, say 0.3 and Joe's 0.4, and then add up all of the rates above the "staircase" that separates equal predict rates from those where Joe is better than Tom. For equal, we add up the probabilities along the diagonal. This is not an inference problem, you do not need to list SON, priors, likelihoods, joint, and posterior probabilities. Some people didn't add up the correct boxes, but I didn't understand the principle that they used.
The second problem was also not an inference problem, so no priors and no likelihoods. It is instead a prediction problem. Here's how you can tell the difference. In an inference problem, there are unknown states of nature that can not be directly observed; you are trying to compute the probability of each of these unknown states of nature. Here, you are doing something different. You are trying to predict data that will be known for sure in the future (that is, if the shuttle is destroyed, everyone will know it; if it is not destroyed after ten flights, everyone will know that too. That is data, not a state of nature.) One way to answer the question is to say that if a disaster takes place, it will take place in the first flight or in the second flight or in the third flight...or in the tenth flight, so the probability of one of these happening is the sum of the individual probabilities. So, the answer would be 10/80. This isn't quite right, although I gave 18 points credit for this answer. It is approximately correct, though. The correct answer is to compute the probability of (no disaster in flight 1 and no disaster in flight 2 and...and no disaster in flight 10) to get the probability that no disaster will happen. Then subtract that from 1 to get the probability of disaster. The result is 1-(79/80)10=0.118, which is pretty close to the approximate answer of 0.125. We can know that the approximate answer is not really right (although it's a good approximation of both the individual probability and the number of flights is small), because if there had been 100 flights, it gives a probability of disaster of 100/80, which is greater than 1 and impossible because probabilities have to be between 0 and 1.
I drew a fairly elaborate decision tree for the second part of this problem. I did it in terms of losses, and assigned 0 loss if the Hubble was fixed and the ISS was serviced. Since the ISS can be serviced in any case by Russian rockets, there's no loss from that regardless of which branch we chose. With the (approximate) probability of catastrophe of 1/8 if all ten flights are made, and assuming that the catastrophe happens so that there is a 50% chance that it affects the Hubble mission, I got a loss of 1/8(C+H/2) for that branch, where H is the loss if the Hubble isn't serviced, and C is the loss if a catastrophe takes place. For the "Hubble only" branch, the loss is 1/80(C+H). That is always less than the ten missions branch, so we cut that one off. For the "Don't fly" branch, the loss is H since the Hubble isn't serviced. The loss is the same on the "Don't fly" branch as the "Service Hubble only" branch if H=1/80(C+H) or 79H=C. This is as far as we can go to help the NASA Administrator. Whether to fly or not depends upon which has the larger loss. If C is greater than 79H in the Administrator's mind, then the mission should not be flown.
I don't envy Michael Griffin.
Problem 3 is about drugs, but it is not similar to the drug problem on the study sheet in that I told you that the cure rate of one of the drugs is exactly 0.2 (based on a large amount of data). The discussion of the experimental drug is like the one on the study sheet, in that you need to set up states of nature for the cure rate (0.05, 0.15, 0.25,..., 0.95 for example), put a prior on each (flat for example), compute the likelihood for each (r15(1-r)35 for each cure rate r), compute the joints, compute the marginal likelihood, divide the joints by the marginal likelihood to get the posterior. Then to get the probability that the new drug is better than the old one, just add up the posteriors for cure rates bigger than 0.2.
Problem 4 is like the "diagnose" stage of a spam filter or a medical diagnosis system. The first half would be gathering information, for example on emails, and determining the probability of obtaining a given word given that its spam/not spam. The simple way would be just by frequency in each of the two categories. The second, or "diagnose" half looks for the words, and forms a likelihood by raising the probability for each word to a power equal to the number of times the word appears, and multiplying them all together. Note that the biggest mistake here was not raising to the power, or adding the terms in the likelihood instead of multiplying them.
Never, ever, add the terms in the likelihood! It's the probability of (data A and data B and data C), so the probabilities must be multiplied.
The final problem constructs a probability tree (not a decision tree). At the base of the tree, as usual, are the unknown (to the investigator) states of nature: HH, HT, TT. Some people wrote those correctly but didn't realize that HT is twice as likely as either HH or TT. Some others put the data at the base of the tree, which is never to be done. So, if a person answers "yes", it may be because he tossed HH, or it may be because he tossed HT and is telling the truth. Since 25% of the time HH gets tossed, a total of 15%=40%-25% of the responses are due to HT being tossed. Since that happens half the time, it must be that a total of 2*15%=30% of the subjects would have answered "yes" if they were all telling the truth.
To summarize the difficulties that people had: 1) States of nature, which are not observable, always go at the root of a probability tree. Data never goes there. The sub-branches of a probability tree will always be conditional probabilities of data given the state of nature. 2) Distinguish between prediction of data that will be observed in the future, and inference of states of nature (unobservable). 3) a decision box belongs at the root of a decision tree, and the branches coming out of it are the actions being contemplated (eg., don't fly, fly just Hubble, fly ten missions). Then there will be probability branches attached to each branch (sometimes only one, as in "don't fly"). Losses go at the tips of the probability branches, and are then propagated backwards down the tree. Choose the branch with the smallest loss. You can use utilities instead of losses, in which case you choose the branch with the greatest utility.
I finished my discussion by pointing out that the tables that we have been calculating for rates r=0.05, 0.15, 0.25,..., 0.95 are approximations to continuous functions. Then the summation of them is like a Riemann sum, and when you make the divisions finer and finer you are actually getting better and better approximations to an integral. Since integration is in general hard, statisticians often resort to various approximation techniques to get the answers they need.
We'll go back to crime on Wednesday. We need to decide on dates for the presentations, so I hope everyone will be in class on Wednesday.
There will be a guest talk by a research physician at UVM on Friday, November 21. We'll also discuss investing and Bayesian jokes/songs in the next week.
Problem 1 is similar to the second half drug-testing problem on the study sheet; In the first half you would have to compute the posterior probability of the cure rate for each drug, but here I just told you what the posterior probability of the predict rate of Tom and Joe was. The second half is to arrange Tom and Joe's true predict rates and posterior probabilities along the sides and top of a square table, multiply pairwise to get the probability that Tom's true rate is, say 0.3 and Joe's 0.4, and then add up all of the rates above the "staircase" that separates equal predict rates from those where Joe is better than Tom. For equal, we add up the probabilities along the diagonal. This is not an inference problem, you do not need to list SON, priors, likelihoods, joint, and posterior probabilities. Some people didn't add up the correct boxes, but I didn't understand the principle that they used.
The second problem was also not an inference problem, so no priors and no likelihoods. It is instead a prediction problem. Here's how you can tell the difference. In an inference problem, there are unknown states of nature that can not be directly observed; you are trying to compute the probability of each of these unknown states of nature. Here, you are doing something different. You are trying to predict data that will be known for sure in the future (that is, if the shuttle is destroyed, everyone will know it; if it is not destroyed after ten flights, everyone will know that too. That is data, not a state of nature.) One way to answer the question is to say that if a disaster takes place, it will take place in the first flight or in the second flight or in the third flight...or in the tenth flight, so the probability of one of these happening is the sum of the individual probabilities. So, the answer would be 10/80. This isn't quite right, although I gave 18 points credit for this answer. It is approximately correct, though. The correct answer is to compute the probability of (no disaster in flight 1 and no disaster in flight 2 and...and no disaster in flight 10) to get the probability that no disaster will happen. Then subtract that from 1 to get the probability of disaster. The result is 1-(79/80)10=0.118, which is pretty close to the approximate answer of 0.125. We can know that the approximate answer is not really right (although it's a good approximation of both the individual probability and the number of flights is small), because if there had been 100 flights, it gives a probability of disaster of 100/80, which is greater than 1 and impossible because probabilities have to be between 0 and 1.
I drew a fairly elaborate decision tree for the second part of this problem. I did it in terms of losses, and assigned 0 loss if the Hubble was fixed and the ISS was serviced. Since the ISS can be serviced in any case by Russian rockets, there's no loss from that regardless of which branch we chose. With the (approximate) probability of catastrophe of 1/8 if all ten flights are made, and assuming that the catastrophe happens so that there is a 50% chance that it affects the Hubble mission, I got a loss of 1/8(C+H/2) for that branch, where H is the loss if the Hubble isn't serviced, and C is the loss if a catastrophe takes place. For the "Hubble only" branch, the loss is 1/80(C+H). That is always less than the ten missions branch, so we cut that one off. For the "Don't fly" branch, the loss is H since the Hubble isn't serviced. The loss is the same on the "Don't fly" branch as the "Service Hubble only" branch if H=1/80(C+H) or 79H=C. This is as far as we can go to help the NASA Administrator. Whether to fly or not depends upon which has the larger loss. If C is greater than 79H in the Administrator's mind, then the mission should not be flown.
I don't envy Michael Griffin.
Problem 3 is about drugs, but it is not similar to the drug problem on the study sheet in that I told you that the cure rate of one of the drugs is exactly 0.2 (based on a large amount of data). The discussion of the experimental drug is like the one on the study sheet, in that you need to set up states of nature for the cure rate (0.05, 0.15, 0.25,..., 0.95 for example), put a prior on each (flat for example), compute the likelihood for each (r15(1-r)35 for each cure rate r), compute the joints, compute the marginal likelihood, divide the joints by the marginal likelihood to get the posterior. Then to get the probability that the new drug is better than the old one, just add up the posteriors for cure rates bigger than 0.2.
Problem 4 is like the "diagnose" stage of a spam filter or a medical diagnosis system. The first half would be gathering information, for example on emails, and determining the probability of obtaining a given word given that its spam/not spam. The simple way would be just by frequency in each of the two categories. The second, or "diagnose" half looks for the words, and forms a likelihood by raising the probability for each word to a power equal to the number of times the word appears, and multiplying them all together. Note that the biggest mistake here was not raising to the power, or adding the terms in the likelihood instead of multiplying them.
Never, ever, add the terms in the likelihood! It's the probability of (data A and data B and data C), so the probabilities must be multiplied.
The final problem constructs a probability tree (not a decision tree). At the base of the tree, as usual, are the unknown (to the investigator) states of nature: HH, HT, TT. Some people wrote those correctly but didn't realize that HT is twice as likely as either HH or TT. Some others put the data at the base of the tree, which is never to be done. So, if a person answers "yes", it may be because he tossed HH, or it may be because he tossed HT and is telling the truth. Since 25% of the time HH gets tossed, a total of 15%=40%-25% of the responses are due to HT being tossed. Since that happens half the time, it must be that a total of 2*15%=30% of the subjects would have answered "yes" if they were all telling the truth.
To summarize the difficulties that people had: 1) States of nature, which are not observable, always go at the root of a probability tree. Data never goes there. The sub-branches of a probability tree will always be conditional probabilities of data given the state of nature. 2) Distinguish between prediction of data that will be observed in the future, and inference of states of nature (unobservable). 3) a decision box belongs at the root of a decision tree, and the branches coming out of it are the actions being contemplated (eg., don't fly, fly just Hubble, fly ten missions). Then there will be probability branches attached to each branch (sometimes only one, as in "don't fly"). Losses go at the tips of the probability branches, and are then propagated backwards down the tree. Choose the branch with the smallest loss. You can use utilities instead of losses, in which case you choose the branch with the greatest utility.
I finished my discussion by pointing out that the tables that we have been calculating for rates r=0.05, 0.15, 0.25,..., 0.95 are approximations to continuous functions. Then the summation of them is like a Riemann sum, and when you make the divisions finer and finer you are actually getting better and better approximations to an integral. Since integration is in general hard, statisticians often resort to various approximation techniques to get the answers they need.
We'll go back to crime on Wednesday. We need to decide on dates for the presentations, so I hope everyone will be in class on Wednesday.
There will be a guest talk by a research physician at UVM on Friday, November 21. We'll also discuss investing and Bayesian jokes/songs in the next week.
Monday, November 10, 2008
Nate Silver
The New York Times had an interesting article about Nate Silver, the proprietor of the very successful fivethirtyeight.com website that I pointed you to earlier about making predictions of the election outcome. As anyone who was following this website knows, he was quite close to the actual results. Mr. Silver started out predicting baseball outcomes, also with great success.
Saturday, November 8, 2008
Occam's Razor
Here is an article that Jim Berger and I wrote on how Occam's Razor naturally falls out of Bayesian reasoning. I promised to post this after the test.
Wednesday, November 5, 2008
Class, 11/5
We continued our discussion on the General's Dilemma problem. Here, the situation is that there is a desperate battle to be won, and the results will depend on how many soldiers that the general can get to the battle. The Objective (in the PrOACT agenda) is to win the battle. If 600 soldiers get through, then there is a 90% probability that the battle will be won. If 200 make it through, then the probabilitiy is 50%, but if none make it through, then the probability is only 20%.
We assigned a utility of 100 to winning, and 0 to losing. We decided that the expected utility would be maximized if the "200 for sure" choice was made. The decision tree had at the bottom the decision between routes, then if "200 for sure" a probability choice of 0 if lost (80% prob) and 100 if won (20% prob). On the other branch we had to have a 2/3 chance of every soldier being lost versus an 1/3 chance that all would get through. Then (given that fact) whe had to put another probability branch with the appropriate probability of winning the battle (if none get through or if all get through).
When we ran the utilities through the tree backwards to to the left, we found that the general should go with the "200 for sure" route.
We also experimented with the idea that the utilities might be different for the general if he valued lives saved more than lives lost (that is, if the premises of the problem were not just "win the battle" but also "save our soldiers' lives.") That would be done by decreasing the utilities when more soldiers didn't make it through. But fiddling with the numbers didn't seem to make a difference as to the decision. This doesn't mean that some choice of utilities wouldn't change the result, just that it isn't obvious how to do this.
I then opened the discussion to questions.
We discussed priors, especially when the priors depend on the state of nature. In particular, is it necessary to "normalize" the prior (make it add up to 1)? The answer is that in most cases this is not necessary. The reason is that (in our spreadsheet scheme) if you multiply the prior by a constant number then the joint will also be multiplied by the same number, and the sum of the joints (the marginal likelihood) will also be multiplied by the same number, so when you divide the joints by the marginal likelihood, they will have the same constant multiplier, which will cancel out.
The exception is when you are testing a precise hypothesis (a coin is fair prob=1/2, a special die is fair, prob=1/3) against a vague alternative (a coin is not fair, prob has to have its own prior, etc.) In that case, it's important to normalize the priors and careful attention is required to do it right.
We briefly discussed the medical example, when we have tests on an old and new drug. The whole idea here is to compute the probability that the new drug is better than the old one. We've been using an approximation that sets the cure rate of a drug to particular values, 0.05, 0.15, 0.25, ..., 0.95. We know that if we make the division finer, we'll get a better approximation, but here we're trying to learn the principles. So, if we test the old drug (A) and get a certain set of posterior probabilities on the cure rate of the drug, and test the new drug (B) and get a different set of posterior probabilities, then we can set up a table of the joint posterior probabilities of the cure rates of each drug by simply multiplying the probs of the cure rates of each drug. We can do this because the two experiments were done on different and randomly selected individual patients, so the probs are independent. Then, it's just a matter of identifying which slots belong to (cure rate of B is greater than cure rate of A), adding them up (we add when we have (cure rate of A=0.05 AND cure rate of B=0.15) OR (cure rate of A=0.05 AND cure rate of B=0.25) OR ....) This gives us the posterior probability that drug B is better than drug A.
I think this is all we discussed, but if I have left something out important and you want me to address it before the test, post now and I'll respond.
We assigned a utility of 100 to winning, and 0 to losing. We decided that the expected utility would be maximized if the "200 for sure" choice was made. The decision tree had at the bottom the decision between routes, then if "200 for sure" a probability choice of 0 if lost (80% prob) and 100 if won (20% prob). On the other branch we had to have a 2/3 chance of every soldier being lost versus an 1/3 chance that all would get through. Then (given that fact) whe had to put another probability branch with the appropriate probability of winning the battle (if none get through or if all get through).
When we ran the utilities through the tree backwards to to the left, we found that the general should go with the "200 for sure" route.
We also experimented with the idea that the utilities might be different for the general if he valued lives saved more than lives lost (that is, if the premises of the problem were not just "win the battle" but also "save our soldiers' lives.") That would be done by decreasing the utilities when more soldiers didn't make it through. But fiddling with the numbers didn't seem to make a difference as to the decision. This doesn't mean that some choice of utilities wouldn't change the result, just that it isn't obvious how to do this.
I then opened the discussion to questions.
We discussed priors, especially when the priors depend on the state of nature. In particular, is it necessary to "normalize" the prior (make it add up to 1)? The answer is that in most cases this is not necessary. The reason is that (in our spreadsheet scheme) if you multiply the prior by a constant number then the joint will also be multiplied by the same number, and the sum of the joints (the marginal likelihood) will also be multiplied by the same number, so when you divide the joints by the marginal likelihood, they will have the same constant multiplier, which will cancel out.
The exception is when you are testing a precise hypothesis (a coin is fair prob=1/2, a special die is fair, prob=1/3) against a vague alternative (a coin is not fair, prob has to have its own prior, etc.) In that case, it's important to normalize the priors and careful attention is required to do it right.
We briefly discussed the medical example, when we have tests on an old and new drug. The whole idea here is to compute the probability that the new drug is better than the old one. We've been using an approximation that sets the cure rate of a drug to particular values, 0.05, 0.15, 0.25, ..., 0.95. We know that if we make the division finer, we'll get a better approximation, but here we're trying to learn the principles. So, if we test the old drug (A) and get a certain set of posterior probabilities on the cure rate of the drug, and test the new drug (B) and get a different set of posterior probabilities, then we can set up a table of the joint posterior probabilities of the cure rates of each drug by simply multiplying the probs of the cure rates of each drug. We can do this because the two experiments were done on different and randomly selected individual patients, so the probs are independent. Then, it's just a matter of identifying which slots belong to (cure rate of B is greater than cure rate of A), adding them up (we add when we have (cure rate of A=0.05 AND cure rate of B=0.15) OR (cure rate of A=0.05 AND cure rate of B=0.25) OR ....) This gives us the posterior probability that drug B is better than drug A.
I think this is all we discussed, but if I have left something out important and you want me to address it before the test, post now and I'll respond.
Tuesday, November 4, 2008
The Election and Bayes
Jim Albert, a statistics professor at Bowling Green State University, has posted a short article on his statistics blog showing how a simple Bayesian approach can be used to predict the electoral vote outcome in today's election. You should be able to understand his idea quite clearly, as it's not significantly different from what we have been doing. The main difference is that he takes into account people who will vote for a third party candidate, so there is a third term in the likelihood that is equal to the probability of a third party candidate raised to the power equal to the number of people in the poll who said they would vote for a third party candidate. The other thing we've already discussed briefly, but here's the idea: After computing the posterior probability (as a formula, not as a table as we have been doing it), he uses a computer to draw a sample of 5000 values of each of the three possibilities (McCain, Obama, Other) representing the proportion of people that the sample says voted for each. He then counts the number of samples that favored Obama over McCain, and calls the state for that candidate for each sample. He then gets a win probability for each state, and he's listed them in the blog.
The second part is even simpler. He uses the computer to flip a biased coin with the appropriate probability (from his table) for each state, assigns the electoral votes from that state to the winner, and adds them up over all states. This gives him a prediction for the electoral vote for Obama and McCain. He does this 5000 times to get 5000 predictions of the electoral vote, and plots them in the graph at the bottom of the page. His prediction is that Obama will get at least 300 electoral votes, and my eye indicates that the actual outcome is likely to be between 340 and 380, give or take.
This is pretty close to the method being used at fivethirtyeight.com.
This is a method called posterior simulation. What he's doing is drawing a large sample from the posterior distribution, and using that as a proxy for the actual posterior distribution. It is a method that is widely used by professional Bayesian statisticians. We can discuss it in more detail after the test, if you like.
Jim Albert, by the way, is the author of the textbook we will be using next semester in our statistics course on Bayesian statistics.
The second part is even simpler. He uses the computer to flip a biased coin with the appropriate probability (from his table) for each state, assigns the electoral votes from that state to the winner, and adds them up over all states. This gives him a prediction for the electoral vote for Obama and McCain. He does this 5000 times to get 5000 predictions of the electoral vote, and plots them in the graph at the bottom of the page. His prediction is that Obama will get at least 300 electoral votes, and my eye indicates that the actual outcome is likely to be between 340 and 380, give or take.
This is pretty close to the method being used at fivethirtyeight.com.
This is a method called posterior simulation. What he's doing is drawing a large sample from the posterior distribution, and using that as a proxy for the actual posterior distribution. It is a method that is widely used by professional Bayesian statisticians. We can discuss it in more detail after the test, if you like.
Jim Albert, by the way, is the author of the textbook we will be using next semester in our statistics course on Bayesian statistics.
Monday, November 3, 2008
Class, 11/3
We spent some time discussing the catch-and-release problem and similar ones. I made several points:
1) Since the likelihood is the probability of data point 1 AND data point 2 AND ...., you must always MULTIPLY the probabilities of the individual data points to get the likelihood of the entire data set. NEVER add them!
2) In this problem, as we draw samples (fish), the total number of fish in the lake and the total number of fish of each type (tagged, untagged) decrease by 1 each time a fish is caught. This means we are sampling without replacement. It also means that each time we catch a tagged fish, the next time we catch a tagged fish we have to use a numerator and a denominator that are decreased by 1. So, for example, with 100 fish in the lake, 10 of them tagged, the first tagged fish we catch has a probability of 10/100, the second a probability of 9/99, the third 8/98, etc. After catching the five tagged fish, then there are 95 fish left, 90 of them untagged. So the first untagged fish we catch has probability 90/95, the second 89/94, and so forth.
3) It doesn't matter what order the fish are caught, the likelihood will be the same. So, you might as well treat all of the first kind first and then handle those of the remaining kind.
4) After computing the rest of the table to get the posterior, you can add up the posterior probabilities for intervals in the number of fish, e.g., to get the probability that the total number of fish is between 15 and 25 (inclusive), just add the posterior probabilities for each of those numbers of fish.
5) If the number of items (fish, voters) is very large, you can approximate the ratios by the same number, just pretend that the number of fish in the lake and the number of each kind don't change as more fish are caught. The error committed will be quite small in this case. What you are doing is approximating sampling without replacement by sampling with replacement.
We talked about the astrology problem. The states of nature are the numbers p=0.05, o.15,...,0.95 the way we are setting up the problem. The prior could be uniform, but if your experience is that astrology is probably bunk, you might want to skew the prior to smaller numbers; or if your experience is that it works, you might want to skew it to larger numbers. This is not cheating, it is using information in your past experience.
The likelihood is p4(1-p)7 for each of the values of p. The rest of the table is filled out as usual. Then, the answer to the question (the probability that the astrologer is able to predict the future at least 85% of the time) is the sum of the two posterior probabilities for the values of p=0.85 and 0.95.
We discussed the expert systems problem. The basic idea is that you can train a Bayesian system by, for example, telling it the symptoms observed and the diagnosis for a number of patients. This allows the system to estimate the conditional probabilities
p(symptom|diagnosis)
for a lot of symptoms and diagnoses. These can then be used as terms in the likelihood for a new patient who comes in and whose symptoms are determined and put into the system. What we've described in class is known as a naive Bayes classifier. It's also the basis of Bayesian spam filters and many other useful practical applications of Bayesian methods (including artificial intelligence systems).
We started discussing basic decision problems by drawing the tree for the "general with two routes" problem we had discussed earlier; this time we postulated that the general might be risk averse, in which case he would choose the 200 soldiers for certain rather than the gamble between no soldiers surviving with probability 2/3 and all of them surviving with probability 1/3. Alternatively, a risk seeking general would choose the gamble. We'll pick up on this on Wednesday.
1) Since the likelihood is the probability of data point 1 AND data point 2 AND ...., you must always MULTIPLY the probabilities of the individual data points to get the likelihood of the entire data set. NEVER add them!
2) In this problem, as we draw samples (fish), the total number of fish in the lake and the total number of fish of each type (tagged, untagged) decrease by 1 each time a fish is caught. This means we are sampling without replacement. It also means that each time we catch a tagged fish, the next time we catch a tagged fish we have to use a numerator and a denominator that are decreased by 1. So, for example, with 100 fish in the lake, 10 of them tagged, the first tagged fish we catch has a probability of 10/100, the second a probability of 9/99, the third 8/98, etc. After catching the five tagged fish, then there are 95 fish left, 90 of them untagged. So the first untagged fish we catch has probability 90/95, the second 89/94, and so forth.
3) It doesn't matter what order the fish are caught, the likelihood will be the same. So, you might as well treat all of the first kind first and then handle those of the remaining kind.
4) After computing the rest of the table to get the posterior, you can add up the posterior probabilities for intervals in the number of fish, e.g., to get the probability that the total number of fish is between 15 and 25 (inclusive), just add the posterior probabilities for each of those numbers of fish.
5) If the number of items (fish, voters) is very large, you can approximate the ratios by the same number, just pretend that the number of fish in the lake and the number of each kind don't change as more fish are caught. The error committed will be quite small in this case. What you are doing is approximating sampling without replacement by sampling with replacement.
We talked about the astrology problem. The states of nature are the numbers p=0.05, o.15,...,0.95 the way we are setting up the problem. The prior could be uniform, but if your experience is that astrology is probably bunk, you might want to skew the prior to smaller numbers; or if your experience is that it works, you might want to skew it to larger numbers. This is not cheating, it is using information in your past experience.
The likelihood is p4(1-p)7 for each of the values of p. The rest of the table is filled out as usual. Then, the answer to the question (the probability that the astrologer is able to predict the future at least 85% of the time) is the sum of the two posterior probabilities for the values of p=0.85 and 0.95.
We discussed the expert systems problem. The basic idea is that you can train a Bayesian system by, for example, telling it the symptoms observed and the diagnosis for a number of patients. This allows the system to estimate the conditional probabilities
p(symptom|diagnosis)
for a lot of symptoms and diagnoses. These can then be used as terms in the likelihood for a new patient who comes in and whose symptoms are determined and put into the system. What we've described in class is known as a naive Bayes classifier. It's also the basis of Bayesian spam filters and many other useful practical applications of Bayesian methods (including artificial intelligence systems).
We started discussing basic decision problems by drawing the tree for the "general with two routes" problem we had discussed earlier; this time we postulated that the general might be risk averse, in which case he would choose the 200 soldiers for certain rather than the gamble between no soldiers surviving with probability 2/3 and all of them surviving with probability 1/3. Alternatively, a risk seeking general would choose the gamble. We'll pick up on this on Wednesday.
Saturday, November 1, 2008
Class, 10/31
We finished discussing the drug testing problem. From the data we have posterior probabilities for various cure rates for each drug. By multiplying them, we obtain the posterior probability that, for example, drug A has cure rate 0.25 and drug B has cure rate 0.35. We can then add up all those joint probabilities for which drug B has the greater cure rate, to get the probability that drug B is better than drug A.
We then discussed the marketing problem. We identified two stages, the first of which was to spend $20 million in testing the drug to FDA standards. We recognized that the drug might not pass this test; most experimental drugs do not. There is from our preliminary test a 75% chance that the drug will pass the test. We discussed "sunk costs," that is, costs that cannot be recovered. Even though getting to the point where we are (with 100 subjects tested) did cost some money, we can never get it back, so we may as well call our loss or utility exzctly zero at this point.
You can calculate using either losses or utilities. It's purely a matter of convenience. Since the drug company is very wealthy, its utility or loss function will be linear or nearly so.
We illustrated the process by imagining that if the company decides to continue the development of the drug, it will pass through a toll gate worth $20M. Then there will be a 75% probability that we will go on to develop the drug to marketing stage, which will cost an additional $80M and require another toll gate. We guessed that the drug had a 20% chance of commercial success (revenues of 20 years at $1B per year) and an 80% chance of "failure" (20 years at $10M/year, if I recall). These figures may not be what I wrote on the board, but you should have the correct numbers and the tree in your notes. We decided that the company should go ahead with the plan, given the figures we used.
We then discussed the fish problem. The states of nature are the numbers from 1 to 100. The prior we took to be equal (0.01 on each SON). One student asked, since we know there have to be at least 15 fish in the lake, why not set those to zero, but another student pointed out that that requires looking at the data, and the prior is supposed to reflect what you know before you look at the data, so to do this would be "cheating." We had a false start on the likelihood, which was my fault as I should have steered us to the correct solution more quickly. But several students were puzzled and we started over. Since there are N fish in the lake, and 10 of them are tagged, the probability of picking 5 tagged and 5 untagged is as follows:
For N=15, it's (10*9*8*7*6)*(5*4*3*2*1)/((15*14*13*12*11)*(10*9*8*7*6)). We get this because the probability of picking one tagged fish is 10/15, the probability of picking the second tagged fish is 9/14, and so on through the 5 tagged fish; then for the untagged fish it is 5/10 for the first one, 4/9 for the second one, and so on through the 5 untagged fish. As each fish is caught, the number of fish of that kind (tagged or untagged) decreases by 1, as does the total number of fish in the lake.
Similarly, if N=20, it's (10*9*8*7*6)*(10*9*8*7*6)/((20*19*18*17*16)*(15*14*13*12*11)), with the same kind of reasoning.
We'll discuss this more on Monday and then go on to the remaining problems on the study sheet.
We then discussed the marketing problem. We identified two stages, the first of which was to spend $20 million in testing the drug to FDA standards. We recognized that the drug might not pass this test; most experimental drugs do not. There is from our preliminary test a 75% chance that the drug will pass the test. We discussed "sunk costs," that is, costs that cannot be recovered. Even though getting to the point where we are (with 100 subjects tested) did cost some money, we can never get it back, so we may as well call our loss or utility exzctly zero at this point.
You can calculate using either losses or utilities. It's purely a matter of convenience. Since the drug company is very wealthy, its utility or loss function will be linear or nearly so.
We illustrated the process by imagining that if the company decides to continue the development of the drug, it will pass through a toll gate worth $20M. Then there will be a 75% probability that we will go on to develop the drug to marketing stage, which will cost an additional $80M and require another toll gate. We guessed that the drug had a 20% chance of commercial success (revenues of 20 years at $1B per year) and an 80% chance of "failure" (20 years at $10M/year, if I recall). These figures may not be what I wrote on the board, but you should have the correct numbers and the tree in your notes. We decided that the company should go ahead with the plan, given the figures we used.
We then discussed the fish problem. The states of nature are the numbers from 1 to 100. The prior we took to be equal (0.01 on each SON). One student asked, since we know there have to be at least 15 fish in the lake, why not set those to zero, but another student pointed out that that requires looking at the data, and the prior is supposed to reflect what you know before you look at the data, so to do this would be "cheating." We had a false start on the likelihood, which was my fault as I should have steered us to the correct solution more quickly. But several students were puzzled and we started over. Since there are N fish in the lake, and 10 of them are tagged, the probability of picking 5 tagged and 5 untagged is as follows:
For N=15, it's (10*9*8*7*6)*(5*4*3*2*1)/((15*14*13*12*11)*(10*9*8*7*6)). We get this because the probability of picking one tagged fish is 10/15, the probability of picking the second tagged fish is 9/14, and so on through the 5 tagged fish; then for the untagged fish it is 5/10 for the first one, 4/9 for the second one, and so on through the 5 untagged fish. As each fish is caught, the number of fish of that kind (tagged or untagged) decreases by 1, as does the total number of fish in the lake.
Similarly, if N=20, it's (10*9*8*7*6)*(10*9*8*7*6)/((20*19*18*17*16)*(15*14*13*12*11)), with the same kind of reasoning.
We'll discuss this more on Monday and then go on to the remaining problems on the study sheet.
Subscribe to:
Posts (Atom)