Bayes Rules

Monday, December 8, 2008

Class, 12/8

In class today, one of the things we discussed was Naseem Taleb's ideas, and I also mentioned Mark Thoma's blog. As it turns out, Mark Thoma put an article by Taleb on his blog just today. As you will see when you read it, it generated a lot of comments.

I also mentioned my friend and mentor (a person from whom I have learned much and who helped me greatly in my career), Jim Berger.

Mentors are very important for anyone's career. I have had several, Jim Berger was the latest, but also (in temporal order) Heinz Eichhorn (image on screen behind me), Harlan Smith, and Jurgen Moser. These were people that taught me much, with whom I had very close professional and personal relationships. They helped me build my career. So, you should find mentors who can do for your career what mine have done for me.

Friday, November 21, 2008

Class, 11/21

Today we had a visit from Dr. Turner Osler, who is a medical researcher at the UVM medical school.

Just to touch on several of the important topics. You will have read the Scientific American article by Efron and Morris. The point here is that a better estimate of the individual items in a collection of items (baseball batting averages, proportion of toxoplasmosis patients in a city, proportion of mortalities in a burn unit) by combining them in a way that uses some of the information from all of the different items to get better estimates of the individual items. So if we observe the batting averages of a number of batters after they have been at-bat 40 times (early in the season), we can get a better overall estimate of the final batting averages of the players at the end of the season by using a formula like:

z_i=Y + c(y_i-Y)

where y_i is the individual item (batting average, etc.), Y is the average of all of them, and z_i is our best estimate of the true value. The number c is given by a formula, which is in the notes that Dr. Osler handed out. The important thing is that if c=1, then z_i=y_i, and if c=0 then z_i=Y. The smaller c, the more the estimates are "shrunk" towards Y. So the ideal estimator is a combination of Y and y_i, a weighted average of the two. (This is the so-called Efron-Morris estimator).

I remarked at the end of the class that the notion of "best" in this context is actually gotten by assuming a particular loss function that says that the farther away the true values are from the estimated value (in an "average" sense), the bigger the loss, so we try to choose c so as to minimize the overall loss. When you do this, the value of c in the handout pops out.

Dr. Osler is using these ideas to solve the following problem: we can estimate the proportion of "bad outcomes" in a burn unit, for example, by dividing the number of patients that die by the total number of patients. But this is like the batting average, it is based on a relatively small number of patients in many cases. But we can use the ideas of the Efron-Morris article to improve the estimates for the individual hospitals. And we learn that many of the hospitals that seemed at first glance to have excessively high mortality, probably do not. There was one hospital that looked suspicious.

Dr. Osler also discussed a "beta" prior, which we have been using without giving it that name. It is anything of the form p^a(1-p)^b for constants a and b. This will be a "bell-shaped curve," whose actual shape can be varied quite widely by choosing the constants a and b to match what we want. When you multiply this prior by the likelihood pⁿ(1-p)^N-n, which you get if n patients out of N die, you get a posterior that is also beta: p^a+n(1-p)^b+N-n. We've been doing this with spreadsheets, which are an approximation based on evaluating this formula at particular points between 0 and 1 (we used p=0.05, 0.15, 0.25,...,0.95 in class). As I remarked, the more points you use in the spreadsheet, the more accurate the results. We used 10 points because that's something easily done in class. But you could use 100 points, or 1000 of them to get more accurate answers. With a spreadsheet, it's just copying the formula down. But what Dr. Osler is doing is something we've been doing essentially for the entire semester. For example, when we used a non-uniform prior in the polling (voting) example, we were doing this.

Have a happy Thanksgiving!

Wednesday, November 19, 2008

Class, 11/19

Today I told you about the every-four-years Valencia conferences of Bayesians and the other series of meetings that happens two years after each Valencia conference. This conference has been going since, I think, 1978. A discussion of how the series began, written by Jose Bernardo (I mentioned him today) can be found here.

Because they have gotten quite large, in recent years the meetings have been held at a Mediterranean resort (one was held on the Canary Islands, part of Spain). One of the songs I played, "Frequentists and Bayesians," referred to that (Click here for music and lyrics). The songs and other things I mentioned come from the "Cabaret" that follows each of the roughly 5-day long meetings, when everyone is pretty exhausted and ready for some fun. The Cabarets feature songs with Bayesian flavor, set to well-known tunes (see the handout), skits, juggling, and other frivolity. Pictures of several of the Cabarets can be found here.

The song refers to MCMC, which stands for Markov chain Monte Carlo, which has pretty much become the default method of Bayesian calculations for the past 20 years. As I described it in class, it uses a "random walk" method to stagger from one state of nature to another one, in such a way that more time is spent on states of nature that have a high posterior probability. Then we can use the sample so generated to make the inferences we need, by just counting how often each state of nature is visited by the MCMC program; the proportion of time spent on a particular state of nature becomes an estimate of the posterior probability of that state of nature.

Another song I played, "These are Bayes," featured two things of interest. One is the mention of Sir Harold Jeffreys (no relation), who lived to almost 100 years of age and played a very important role over the years in making Bayesian statistics come into the mainstream. He invented a way of deciding on priors which, in the problems we have been doing, amounts to a uniform prior. (I met Sir Harold once, on the only trip he made to the U.S., when I was a graduate student. And, when I go to Bayesian meetings, the similarity between our last names is always a source of amusement.) The second thing is the running joke that Bayesians make about posteriors (Click here for music and lyrics).

There is a lively debate about how to choose priors. There's the Jeffreys prior, mentioned above, and other exotic things like Maximum Entropy priors, Reference priors, Group-theoretic priors and so on. We didn't discuss these at all since they are beyond the scope of the course, but one of the songs, "Confusing Priors," is really referring to this debate. It's on YouTube here.

Two other songs that are in the handout but which I did not play are "The MCMC Saga" and "Bayesian Believer". More YouTube clips of other songs can be found here. The lyrics of some of the earlier songs can be found on the Bayesian Songbook, which I excerpted for today's handout.

Monday, November 17, 2008

Class, 11/17

Today we discussed financial security, investments, insurance and related issues. The handout I gave out is here.

Saturday, November 15, 2008

Class, 11/14

Today, after some special announcements, we discussed the O. J. Simpson case. See Calculated Risks, Chapter 8. The basic thing here is that Alan Dershowitz, the Harvard lawyer who contributed to Simpson's defense, made a mistake when he argued that the fact that Simpson had battered his wife Nicole was not material, since the probability that a batterer would go on to murder his partner is only 1/2500 per year. This reasoning is wrong, because the probability that a woman would be murdered by a random person (not the batterer) in any year is 1/20000. This means that of any group of 100000 battered women, 40 of them would be killed by their batterer in any year, whereas only 5 would be murdered by a random person who is not the batterer. So, when we look at the fact that Nicole was murdered and battered, the probability is 40/45 that Simpson did it (on only this evidence). So the battery is relevant and should be admitted in evidence.

We then discussed the death penalty, which although it does not exist in Vermont state courts, does exist in Federal courts and at least one Vermont jury recently gave the death penalty to a person in Vermont who was tried in Federal court.

We used a decision tree approach. A decision box at the root of the tree has two branches corresponding to our two actions: Convict or Acquit. We put another decision box on the Convict branch, that is, Death or Life Imprisonment. Then each of these three branches has a probability circle with two branches: Guilty (probability p) and Innocent (probability 1-p).

For the losses, we put 0 on each of the two correct outcomes, Acquit Innocent and Convict Guilty with Life. We decided for various reasons that Convict Guilty and Death had a slight loss (reasons being things like moral hazard). It turns out not to matter, we could pick 0 and the result would be the same. We decided that the loss of giving an innocent person the death penalty was huge (we picked 1000, I think), much larger than the loss of sending an innocent person to prison for life. We didn't assign an actual number to acquitting a guilty person as we were running out of time. But the point was, we found that no matter what p is, so long as the loss for Innocent and Death is larger than the loss for Innocent and Life in Prison, we will never choose the death penalty.

Class, 11/12

We discussed the Wuppertal, Germany case discussed in Calculated Risks, pp. 156-158. Our take on discussing the probability tree started with a 1 chance in 100000 that the person arrested was guilty, since there are about that many people in the area and with no data, the prior has to depend only on general information like this. So the probability of innocence in for all practical purposes equal to 1.

We then decided that there was an 8 in 10 chance that there would be blood on the shoes, given guilt, but only a 1 in 1000 chance, given innocence.

We further used the 2.7% match probability, given innocence, that was mentioned in the book, and a match probability of 1, given guilt.

All of this data still did not make the probability up to the 99% probability of guilt that we discussed in earlier classes.

I then pointed out that the suspect turns out to have had an ironclad alibi. He was 100 km away at the time of the murders.

We started discussing the O. J. Simpson case. We'll look at it in more detail next time.

Wednesday, November 12, 2008

Tracking the flu

Here's a fascinating article about how Google is able to accurately determine the number of flu cases at a given time based on search requests for things like "flu-like symptoms" that are fed into its search engine. People are using the web in truly ingenious ways! (There was a report a few months ago that a scientist was able to show, using Google Earth images that could show the direction that cows faced (but not where the front end of the cow was), that cows tend to line up along a north-south direction, which may indicate that cows, like many animals, have a direction-sensing magnetic organ.)