Bayes Rules: STAT 295 1/20/09

For now, here are the recently-posted charts for Thursday.

Jeff started the lecture by displaying the axioms of probability. Jeff prefers not to call the definition of conditional probability an axiom. I like to call it an axiom. Regardless, we need to introduce the concept of conditional probability, and this formula is central to everything that we'll be doing.

Jeff mentioned that you should try to finish the proof that P(A∨B)=P(A)+P(B)-P(A,B). He also showed how a partition of the space into a set of mutually exclusive and exhaustive alternatives can be used to get the marginalization formula P(B)=ΣP(B|A_i)P(A_i).

(I have now learned how to put a limited amount of mathematics into this blog. But I'll leave a link to a useful LaTeX equation translator in the blog here. You can paste a LaTeX equation into the box, then hit "Render Expression" to see what it looks like in "real" mathematical notation. So I recommend bookmarking this page.)

I apologize for the poor rendition of tables in the projected version; this seems to be an incompatibility between the Mac version of PowerPoint and the Windows version that is used on the computer in the classroom. I will try to fix this in the future. Meantime, the downloaded pdf's should be fine.

An example of a probability distribution is from the astronomical HD (Henry Draper) catalog, an early 20th century effort to classify stars according to their spectra, that is to say, the pattern of light and dark regions when you spread the light from a star out into its component colors. Stars of different temperatures have different patterns. In the early days, the stars were classified A, B, ... , O, but later it was found that some classes were duplicates, so they were merged. Also, it was discovered that the reason for the different patterns had to do with the temperature of the star, which ended up with a different order than the original one, namely, OBAFGKM, which many generations of mostly male astronomy students memorized using the vaguely sexist "Oh, be a fine girl, kiss me!" Later generations have used more neutral mnemonics.

I pointed out that this distribution, which is bimodal, is for the stars that appear brightest in the sky, and that the intrinsic distribution has the number of stars in each class increasing from left to right as the temperatures decrease. There are more M stars than any other class, but they are underrepresented in this distribution because M stars are intrinsically faint and thus not in the catalog. At the same time, the peak at A is due to the fact that these stars are very bright, and thus seen to very great distances, and so are overrepresented in this survey, which cut off at a faint limit.

We described the joint distribution of meteorite finds (you just walk along and spy a meteorite on the ground) versus falls (where the object is seen to fall and people go out and look for it). Stones and irons were the other classifier; stones look like stones, whereas iron meteorites are much denser and are magnetic; they also have a distinctive appearance. We talked about the marginal distributions, gotten by summing a row or column. I noted that amongst falls, stones were predominant whereas amongst finds, it's the irons that are most common. This is due to the fact that stones weather over time and eventually look just like terrestrial rocks, so are overlooked by most people. The irons, on the other hand, don't weather and continue to look different from terrestrial rocks. Presumably, the actual proportions of stones and irons in the solar system is more similar to that amongst the falls.

Jeff introduced the question of tossing a HH, a HT and a TT coin (using the notation on the charts). The question was, if you pick one of the three at random, and without looking at it toss it, and see that it comes up heads, what's the probability that the coin is HH? The (surprising to some) answer is that it is 2/3, not 1/2 as some surmised. Jeff then showed how this came about, first by drawing a probability tree and putting the P(HH)=P(HT)=P(TT)=1/3 branches at the root of the tree, and then the conditional probabilities P(S_H|HH)=1, P(S_H|HT)=1/2, etc., on the subbranches, and multiplying to get the joint probabilities at the leaves of the tree. Adding the joint probabilities that belong to S_H together we see that P(S_H)=1/2, and since P(S_H,HH)=1/3, dividing gives us P(HH|S_H)=2/3.

Another chart showed what happened when you flipped the tree.

Finally, the same thing was done using "natural frequencies." The idea here is to imagine a large number of identical cases, 3000 on the chart. Of these, 1000 will be a flip of the HH coin, and all of these will be seen as heads. 1000 will be a flip of the HT coin, but only 500 of these will be seen as heads. And of course, the 1000 flips of the TT coin will result in no heads being seen. But now there are twice as many cases where the HH coin was flipped and we see a head than when the HT coin is flipped. Again, we get 2/3.

The natural frequency method is especially good when trying to explain probability concepts to people who are mathematically naive. Keep it in your toolkit, it can be very useful for a professional statistician to know this. It's described in great detail in this book by Gerd Gigerenzer.

Jeff then described the cognitive dissonance experiment, which has been done by psychologists in various forms for many years. What is found is that if you give a monkey, for example, a choice between two items and it picks one, and if you then give the same monkey a choice between the item that was rejected and a third item, the monkey will tend to pick the third item over the one that was initially rejected. Psychologists had been explaining this by saying that the monkey rationalizes his rejection of the item by his initial rejection ("I really don't like that one."), but in 2007 it was discovered that a purely statistical explanation existed. The fact is that there are only three ways to rank three items such that the rejected item ranks below the one that was initially selected, and in two of those three ways, the third item ranks above the one that was rejected. So a lot of experiments on "cognitive dissonance" will have to be reexamined.

The example is related to the notorious Monty Hall Problem.

We discussed the Hubble Space Telescope. After the first launch, it was found that the solar panels had the unfortunate property of "creaking" when they went from the sun side of the Earth into the Earth's shadow. This caused "Hubble quakes," where the telescope would shake, producing unacceptably blurred images. We computed P(good data)=P(good data|no quake)P(no quake)+P(good data|quake)(1-P(no quake)).

Jeff then introduced marginalization as an important tool in inference, and in particular in Bayesian inference where it is used all the time. By summing or integrating over parameters we are not interested in, we can obtain a distribution over the remaining parameters. We did this for the distribution of luminosity and temperature of stars, and for the genetics of coat color in rats. We skipped over the bridge example.

Jeff defined independence: A is independent of B iff P(A|B)=P(A) for all A and B. Equivalently, if A is independent of B then B is independent of A, and also if A and B are independent then P(A,B)=P(A)P(B) for all A and B. Any of these may be taken as the definition of independence, and the other relationships can be derived. We asked you to do this, i.e., show that the first definition implies the second and third, and that the third implies the first. (Homework, not to be turned in.)

We discussed the fact that higher 5-year survival times for lung cancers detected by CT scans versus those discovered by X-rays can be a statistical artifact and not mean that the mortality (age at which a patient would die of the disease) has been reduced. For one thing, CT scans will detect a tumor at a smaller and hence earlier time; the patient might still die at the same age. Also, CT scans will detect more indolent tumors, which will not kill the patient. Turner Osler, who is familiar with the research, pointed out that the research was funded by the tobacco industry, a fact that was not made clear when the paper was published. The subtext was, go ahead and smoke, just get CT scans. This is, of course, quite unethical.

Finally, Jeff showed two ways of computing the sample mean, and showed how the second one, in the limit, goes over to the usual way of computing expectations. He wanted me to add the term "relative" to one of the lines, thus "observed relative frequency of x".

2 comments:

ramirob said...: I have been reading ETJaynes book and I found very interesting how he defined probabilities (an old&free version can be found in:
http://omega.math.albany.edu:8008/JaynesBook.html). He started with some "desiderata" about what "plausible reasoning" should be, and the probability axioms and properties came from there. It helped me, as I then I saw how the axioms of probability can come from principles one would like to have.; January 22, 2009 at 11:21 AM
Bill Jefferys said...: Yes, the late E. T. Jaynes was an important commentator on Bayesian methods (we discussed him in class on Thursday). The derivation of probability theory from "desiderata" is very useful, and it contrasts with the Dutch book approach I mentioned earlier. Essentially it says that if you want probability theory to be consistent with logic in the limits of p=0 or 1, and if you want it to be consistent with some pretty basic notions of reasoning, then you end up with something that is isomorphic to standard probability theory.; January 22, 2009 at 4:35 PM

Bayes Rules

Tuesday, January 20, 2009

STAT 295 1/20/09

2 comments:

Blog Archive

About Me