Bayes Rules

Friday, November 21, 2008

Class, 11/21

Today we had a visit from Dr. Turner Osler, who is a medical researcher at the UVM medical school.

Just to touch on several of the important topics. You will have read the Scientific American article by Efron and Morris. The point here is that a better estimate of the individual items in a collection of items (baseball batting averages, proportion of toxoplasmosis patients in a city, proportion of mortalities in a burn unit) by combining them in a way that uses some of the information from all of the different items to get better estimates of the individual items. So if we observe the batting averages of a number of batters after they have been at-bat 40 times (early in the season), we can get a better overall estimate of the final batting averages of the players at the end of the season by using a formula like:

z_i=Y + c(y_i-Y)

where y_i is the individual item (batting average, etc.), Y is the average of all of them, and z_i is our best estimate of the true value. The number c is given by a formula, which is in the notes that Dr. Osler handed out. The important thing is that if c=1, then z_i=y_i, and if c=0 then z_i=Y. The smaller c, the more the estimates are "shrunk" towards Y. So the ideal estimator is a combination of Y and y_i, a weighted average of the two. (This is the so-called Efron-Morris estimator).

I remarked at the end of the class that the notion of "best" in this context is actually gotten by assuming a particular loss function that says that the farther away the true values are from the estimated value (in an "average" sense), the bigger the loss, so we try to choose c so as to minimize the overall loss. When you do this, the value of c in the handout pops out.

Dr. Osler is using these ideas to solve the following problem: we can estimate the proportion of "bad outcomes" in a burn unit, for example, by dividing the number of patients that die by the total number of patients. But this is like the batting average, it is based on a relatively small number of patients in many cases. But we can use the ideas of the Efron-Morris article to improve the estimates for the individual hospitals. And we learn that many of the hospitals that seemed at first glance to have excessively high mortality, probably do not. There was one hospital that looked suspicious.

Dr. Osler also discussed a "beta" prior, which we have been using without giving it that name. It is anything of the form p^a(1-p)^b for constants a and b. This will be a "bell-shaped curve," whose actual shape can be varied quite widely by choosing the constants a and b to match what we want. When you multiply this prior by the likelihood pⁿ(1-p)^N-n, which you get if n patients out of N die, you get a posterior that is also beta: p^a+n(1-p)^b+N-n. We've been doing this with spreadsheets, which are an approximation based on evaluating this formula at particular points between 0 and 1 (we used p=0.05, 0.15, 0.25,...,0.95 in class). As I remarked, the more points you use in the spreadsheet, the more accurate the results. We used 10 points because that's something easily done in class. But you could use 100 points, or 1000 of them to get more accurate answers. With a spreadsheet, it's just copying the formula down. But what Dr. Osler is doing is something we've been doing essentially for the entire semester. For example, when we used a non-uniform prior in the polling (voting) example, we were doing this.

Have a happy Thanksgiving!

Wednesday, November 19, 2008

Class, 11/19

Today I told you about the every-four-years Valencia conferences of Bayesians and the other series of meetings that happens two years after each Valencia conference. This conference has been going since, I think, 1978. A discussion of how the series began, written by Jose Bernardo (I mentioned him today) can be found here.

Because they have gotten quite large, in recent years the meetings have been held at a Mediterranean resort (one was held on the Canary Islands, part of Spain). One of the songs I played, "Frequentists and Bayesians," referred to that (Click here for music and lyrics). The songs and other things I mentioned come from the "Cabaret" that follows each of the roughly 5-day long meetings, when everyone is pretty exhausted and ready for some fun. The Cabarets feature songs with Bayesian flavor, set to well-known tunes (see the handout), skits, juggling, and other frivolity. Pictures of several of the Cabarets can be found here.

The song refers to MCMC, which stands for Markov chain Monte Carlo, which has pretty much become the default method of Bayesian calculations for the past 20 years. As I described it in class, it uses a "random walk" method to stagger from one state of nature to another one, in such a way that more time is spent on states of nature that have a high posterior probability. Then we can use the sample so generated to make the inferences we need, by just counting how often each state of nature is visited by the MCMC program; the proportion of time spent on a particular state of nature becomes an estimate of the posterior probability of that state of nature.

Another song I played, "These are Bayes," featured two things of interest. One is the mention of Sir Harold Jeffreys (no relation), who lived to almost 100 years of age and played a very important role over the years in making Bayesian statistics come into the mainstream. He invented a way of deciding on priors which, in the problems we have been doing, amounts to a uniform prior. (I met Sir Harold once, on the only trip he made to the U.S., when I was a graduate student. And, when I go to Bayesian meetings, the similarity between our last names is always a source of amusement.) The second thing is the running joke that Bayesians make about posteriors (Click here for music and lyrics).

There is a lively debate about how to choose priors. There's the Jeffreys prior, mentioned above, and other exotic things like Maximum Entropy priors, Reference priors, Group-theoretic priors and so on. We didn't discuss these at all since they are beyond the scope of the course, but one of the songs, "Confusing Priors," is really referring to this debate. It's on YouTube here.

Two other songs that are in the handout but which I did not play are "The MCMC Saga" and "Bayesian Believer". More YouTube clips of other songs can be found here. The lyrics of some of the earlier songs can be found on the Bayesian Songbook, which I excerpted for today's handout.

Monday, November 17, 2008

Class, 11/17

Today we discussed financial security, investments, insurance and related issues. The handout I gave out is here.

Saturday, November 15, 2008

Class, 11/14

Today, after some special announcements, we discussed the O. J. Simpson case. See Calculated Risks, Chapter 8. The basic thing here is that Alan Dershowitz, the Harvard lawyer who contributed to Simpson's defense, made a mistake when he argued that the fact that Simpson had battered his wife Nicole was not material, since the probability that a batterer would go on to murder his partner is only 1/2500 per year. This reasoning is wrong, because the probability that a woman would be murdered by a random person (not the batterer) in any year is 1/20000. This means that of any group of 100000 battered women, 40 of them would be killed by their batterer in any year, whereas only 5 would be murdered by a random person who is not the batterer. So, when we look at the fact that Nicole was murdered and battered, the probability is 40/45 that Simpson did it (on only this evidence). So the battery is relevant and should be admitted in evidence.

We then discussed the death penalty, which although it does not exist in Vermont state courts, does exist in Federal courts and at least one Vermont jury recently gave the death penalty to a person in Vermont who was tried in Federal court.

We used a decision tree approach. A decision box at the root of the tree has two branches corresponding to our two actions: Convict or Acquit. We put another decision box on the Convict branch, that is, Death or Life Imprisonment. Then each of these three branches has a probability circle with two branches: Guilty (probability p) and Innocent (probability 1-p).

For the losses, we put 0 on each of the two correct outcomes, Acquit Innocent and Convict Guilty with Life. We decided for various reasons that Convict Guilty and Death had a slight loss (reasons being things like moral hazard). It turns out not to matter, we could pick 0 and the result would be the same. We decided that the loss of giving an innocent person the death penalty was huge (we picked 1000, I think), much larger than the loss of sending an innocent person to prison for life. We didn't assign an actual number to acquitting a guilty person as we were running out of time. But the point was, we found that no matter what p is, so long as the loss for Innocent and Death is larger than the loss for Innocent and Life in Prison, we will never choose the death penalty.

Class, 11/12

We discussed the Wuppertal, Germany case discussed in Calculated Risks, pp. 156-158. Our take on discussing the probability tree started with a 1 chance in 100000 that the person arrested was guilty, since there are about that many people in the area and with no data, the prior has to depend only on general information like this. So the probability of innocence in for all practical purposes equal to 1.

We then decided that there was an 8 in 10 chance that there would be blood on the shoes, given guilt, but only a 1 in 1000 chance, given innocence.

We further used the 2.7% match probability, given innocence, that was mentioned in the book, and a match probability of 1, given guilt.

All of this data still did not make the probability up to the 99% probability of guilt that we discussed in earlier classes.

I then pointed out that the suspect turns out to have had an ironclad alibi. He was 100 km away at the time of the murders.

We started discussing the O. J. Simpson case. We'll look at it in more detail next time.

Wednesday, November 12, 2008

Tracking the flu

Here's a fascinating article about how Google is able to accurately determine the number of flu cases at a given time based on search requests for things like "flu-like symptoms" that are fed into its search engine. People are using the web in truly ingenious ways! (There was a report a few months ago that a scientist was able to show, using Google Earth images that could show the direction that cows faced (but not where the front end of the cow was), that cows tend to line up along a north-south direction, which may indicate that cows, like many animals, have a direction-sensing magnetic organ.)

Tuesday, November 11, 2008

Class, 11/10

We went over the exam.

Problem 1 is similar to the second half drug-testing problem on the study sheet; In the first half you would have to compute the posterior probability of the cure rate for each drug, but here I just told you what the posterior probability of the predict rate of Tom and Joe was. The second half is to arrange Tom and Joe's true predict rates and posterior probabilities along the sides and top of a square table, multiply pairwise to get the probability that Tom's true rate is, say 0.3 and Joe's 0.4, and then add up all of the rates above the "staircase" that separates equal predict rates from those where Joe is better than Tom. For equal, we add up the probabilities along the diagonal. This is not an inference problem, you do not need to list SON, priors, likelihoods, joint, and posterior probabilities. Some people didn't add up the correct boxes, but I didn't understand the principle that they used.

The second problem was also not an inference problem, so no priors and no likelihoods. It is instead a prediction problem. Here's how you can tell the difference. In an inference problem, there are unknown states of nature that can not be directly observed; you are trying to compute the probability of each of these unknown states of nature. Here, you are doing something different. You are trying to predict data that will be known for sure in the future (that is, if the shuttle is destroyed, everyone will know it; if it is not destroyed after ten flights, everyone will know that too. That is data, not a state of nature.) One way to answer the question is to say that if a disaster takes place, it will take place in the first flight or in the second flight or in the third flight...or in the tenth flight, so the probability of one of these happening is the sum of the individual probabilities. So, the answer would be 10/80. This isn't quite right, although I gave 18 points credit for this answer. It is approximately correct, though. The correct answer is to compute the probability of (no disaster in flight 1 and no disaster in flight 2 and...and no disaster in flight 10) to get the probability that no disaster will happen. Then subtract that from 1 to get the probability of disaster. The result is 1-(79/80)¹⁰=0.118, which is pretty close to the approximate answer of 0.125. We can know that the approximate answer is not really right (although it's a good approximation of both the individual probability and the number of flights is small), because if there had been 100 flights, it gives a probability of disaster of 100/80, which is greater than 1 and impossible because probabilities have to be between 0 and 1.

I drew a fairly elaborate decision tree for the second part of this problem. I did it in terms of losses, and assigned 0 loss if the Hubble was fixed and the ISS was serviced. Since the ISS can be serviced in any case by Russian rockets, there's no loss from that regardless of which branch we chose. With the (approximate) probability of catastrophe of 1/8 if all ten flights are made, and assuming that the catastrophe happens so that there is a 50% chance that it affects the Hubble mission, I got a loss of 1/8(C+H/2) for that branch, where H is the loss if the Hubble isn't serviced, and C is the loss if a catastrophe takes place. For the "Hubble only" branch, the loss is 1/80(C+H). That is always less than the ten missions branch, so we cut that one off. For the "Don't fly" branch, the loss is H since the Hubble isn't serviced. The loss is the same on the "Don't fly" branch as the "Service Hubble only" branch if H=1/80(C+H) or 79H=C. This is as far as we can go to help the NASA Administrator. Whether to fly or not depends upon which has the larger loss. If C is greater than 79H in the Administrator's mind, then the mission should not be flown.

I don't envy Michael Griffin.

Problem 3 is about drugs, but it is not similar to the drug problem on the study sheet in that I told you that the cure rate of one of the drugs is exactly 0.2 (based on a large amount of data). The discussion of the experimental drug is like the one on the study sheet, in that you need to set up states of nature for the cure rate (0.05, 0.15, 0.25,..., 0.95 for example), put a prior on each (flat for example), compute the likelihood for each (r¹⁵(1-r)³⁵ for each cure rate r), compute the joints, compute the marginal likelihood, divide the joints by the marginal likelihood to get the posterior. Then to get the probability that the new drug is better than the old one, just add up the posteriors for cure rates bigger than 0.2.

Problem 4 is like the "diagnose" stage of a spam filter or a medical diagnosis system. The first half would be gathering information, for example on emails, and determining the probability of obtaining a given word given that its spam/not spam. The simple way would be just by frequency in each of the two categories. The second, or "diagnose" half looks for the words, and forms a likelihood by raising the probability for each word to a power equal to the number of times the word appears, and multiplying them all together. Note that the biggest mistake here was not raising to the power, or adding the terms in the likelihood instead of multiplying them.

Never, ever, add the terms in the likelihood! It's the probability of (data A and data B and data C), so the probabilities must be multiplied.

The final problem constructs a probability tree (not a decision tree). At the base of the tree, as usual, are the unknown (to the investigator) states of nature: HH, HT, TT. Some people wrote those correctly but didn't realize that HT is twice as likely as either HH or TT. Some others put the data at the base of the tree, which is never to be done. So, if a person answers "yes", it may be because he tossed HH, or it may be because he tossed HT and is telling the truth. Since 25% of the time HH gets tossed, a total of 15%=40%-25% of the responses are due to HT being tossed. Since that happens half the time, it must be that a total of 2*15%=30% of the subjects would have answered "yes" if they were all telling the truth.

To summarize the difficulties that people had: 1) States of nature, which are not observable, always go at the root of a probability tree. Data never goes there. The sub-branches of a probability tree will always be conditional probabilities of data given the state of nature. 2) Distinguish between prediction of data that will be observed in the future, and inference of states of nature (unobservable). 3) a decision box belongs at the root of a decision tree, and the branches coming out of it are the actions being contemplated (eg., don't fly, fly just Hubble, fly ten missions). Then there will be probability branches attached to each branch (sometimes only one, as in "don't fly"). Losses go at the tips of the probability branches, and are then propagated backwards down the tree. Choose the branch with the smallest loss. You can use utilities instead of losses, in which case you choose the branch with the greatest utility.

I finished my discussion by pointing out that the tables that we have been calculating for rates r=0.05, 0.15, 0.25,..., 0.95 are approximations to continuous functions. Then the summation of them is like a Riemann sum, and when you make the divisions finer and finer you are actually getting better and better approximations to an integral. Since integration is in general hard, statisticians often resort to various approximation techniques to get the answers they need.

We'll go back to crime on Wednesday. We need to decide on dates for the presentations, so I hope everyone will be in class on Wednesday.

There will be a guest talk by a research physician at UVM on Friday, November 21. We'll also discuss investing and Bayesian jokes/songs in the next week.