Bayes Rules

Tuesday, November 11, 2008

Class, 11/10

We went over the exam.

Problem 1 is similar to the second half drug-testing problem on the study sheet; In the first half you would have to compute the posterior probability of the cure rate for each drug, but here I just told you what the posterior probability of the predict rate of Tom and Joe was. The second half is to arrange Tom and Joe's true predict rates and posterior probabilities along the sides and top of a square table, multiply pairwise to get the probability that Tom's true rate is, say 0.3 and Joe's 0.4, and then add up all of the rates above the "staircase" that separates equal predict rates from those where Joe is better than Tom. For equal, we add up the probabilities along the diagonal. This is not an inference problem, you do not need to list SON, priors, likelihoods, joint, and posterior probabilities. Some people didn't add up the correct boxes, but I didn't understand the principle that they used.

The second problem was also not an inference problem, so no priors and no likelihoods. It is instead a prediction problem. Here's how you can tell the difference. In an inference problem, there are unknown states of nature that can not be directly observed; you are trying to compute the probability of each of these unknown states of nature. Here, you are doing something different. You are trying to predict data that will be known for sure in the future (that is, if the shuttle is destroyed, everyone will know it; if it is not destroyed after ten flights, everyone will know that too. That is data, not a state of nature.) One way to answer the question is to say that if a disaster takes place, it will take place in the first flight or in the second flight or in the third flight...or in the tenth flight, so the probability of one of these happening is the sum of the individual probabilities. So, the answer would be 10/80. This isn't quite right, although I gave 18 points credit for this answer. It is approximately correct, though. The correct answer is to compute the probability of (no disaster in flight 1 and no disaster in flight 2 and...and no disaster in flight 10) to get the probability that no disaster will happen. Then subtract that from 1 to get the probability of disaster. The result is 1-(79/80)¹⁰=0.118, which is pretty close to the approximate answer of 0.125. We can know that the approximate answer is not really right (although it's a good approximation of both the individual probability and the number of flights is small), because if there had been 100 flights, it gives a probability of disaster of 100/80, which is greater than 1 and impossible because probabilities have to be between 0 and 1.

I drew a fairly elaborate decision tree for the second part of this problem. I did it in terms of losses, and assigned 0 loss if the Hubble was fixed and the ISS was serviced. Since the ISS can be serviced in any case by Russian rockets, there's no loss from that regardless of which branch we chose. With the (approximate) probability of catastrophe of 1/8 if all ten flights are made, and assuming that the catastrophe happens so that there is a 50% chance that it affects the Hubble mission, I got a loss of 1/8(C+H/2) for that branch, where H is the loss if the Hubble isn't serviced, and C is the loss if a catastrophe takes place. For the "Hubble only" branch, the loss is 1/80(C+H). That is always less than the ten missions branch, so we cut that one off. For the "Don't fly" branch, the loss is H since the Hubble isn't serviced. The loss is the same on the "Don't fly" branch as the "Service Hubble only" branch if H=1/80(C+H) or 79H=C. This is as far as we can go to help the NASA Administrator. Whether to fly or not depends upon which has the larger loss. If C is greater than 79H in the Administrator's mind, then the mission should not be flown.

I don't envy Michael Griffin.

Problem 3 is about drugs, but it is not similar to the drug problem on the study sheet in that I told you that the cure rate of one of the drugs is exactly 0.2 (based on a large amount of data). The discussion of the experimental drug is like the one on the study sheet, in that you need to set up states of nature for the cure rate (0.05, 0.15, 0.25,..., 0.95 for example), put a prior on each (flat for example), compute the likelihood for each (r¹⁵(1-r)³⁵ for each cure rate r), compute the joints, compute the marginal likelihood, divide the joints by the marginal likelihood to get the posterior. Then to get the probability that the new drug is better than the old one, just add up the posteriors for cure rates bigger than 0.2.

Problem 4 is like the "diagnose" stage of a spam filter or a medical diagnosis system. The first half would be gathering information, for example on emails, and determining the probability of obtaining a given word given that its spam/not spam. The simple way would be just by frequency in each of the two categories. The second, or "diagnose" half looks for the words, and forms a likelihood by raising the probability for each word to a power equal to the number of times the word appears, and multiplying them all together. Note that the biggest mistake here was not raising to the power, or adding the terms in the likelihood instead of multiplying them.

Never, ever, add the terms in the likelihood! It's the probability of (data A and data B and data C), so the probabilities must be multiplied.

The final problem constructs a probability tree (not a decision tree). At the base of the tree, as usual, are the unknown (to the investigator) states of nature: HH, HT, TT. Some people wrote those correctly but didn't realize that HT is twice as likely as either HH or TT. Some others put the data at the base of the tree, which is never to be done. So, if a person answers "yes", it may be because he tossed HH, or it may be because he tossed HT and is telling the truth. Since 25% of the time HH gets tossed, a total of 15%=40%-25% of the responses are due to HT being tossed. Since that happens half the time, it must be that a total of 2*15%=30% of the subjects would have answered "yes" if they were all telling the truth.

To summarize the difficulties that people had: 1) States of nature, which are not observable, always go at the root of a probability tree. Data never goes there. The sub-branches of a probability tree will always be conditional probabilities of data given the state of nature. 2) Distinguish between prediction of data that will be observed in the future, and inference of states of nature (unobservable). 3) a decision box belongs at the root of a decision tree, and the branches coming out of it are the actions being contemplated (eg., don't fly, fly just Hubble, fly ten missions). Then there will be probability branches attached to each branch (sometimes only one, as in "don't fly"). Losses go at the tips of the probability branches, and are then propagated backwards down the tree. Choose the branch with the smallest loss. You can use utilities instead of losses, in which case you choose the branch with the greatest utility.

I finished my discussion by pointing out that the tables that we have been calculating for rates r=0.05, 0.15, 0.25,..., 0.95 are approximations to continuous functions. Then the summation of them is like a Riemann sum, and when you make the divisions finer and finer you are actually getting better and better approximations to an integral. Since integration is in general hard, statisticians often resort to various approximation techniques to get the answers they need.

We'll go back to crime on Wednesday. We need to decide on dates for the presentations, so I hope everyone will be in class on Wednesday.

There will be a guest talk by a research physician at UVM on Friday, November 21. We'll also discuss investing and Bayesian jokes/songs in the next week.

Monday, November 10, 2008

Nate Silver

The New York Times had an interesting article about Nate Silver, the proprietor of the very successful fivethirtyeight.com website that I pointed you to earlier about making predictions of the election outcome. As anyone who was following this website knows, he was quite close to the actual results. Mr. Silver started out predicting baseball outcomes, also with great success.

Saturday, November 8, 2008

Occam's Razor

Here is an article that Jim Berger and I wrote on how Occam's Razor naturally falls out of Bayesian reasoning. I promised to post this after the test.

Wednesday, November 5, 2008

Class, 11/5

We continued our discussion on the General's Dilemma problem. Here, the situation is that there is a desperate battle to be won, and the results will depend on how many soldiers that the general can get to the battle. The Objective (in the PrOACT agenda) is to win the battle. If 600 soldiers get through, then there is a 90% probability that the battle will be won. If 200 make it through, then the probabilitiy is 50%, but if none make it through, then the probability is only 20%.

We assigned a utility of 100 to winning, and 0 to losing. We decided that the expected utility would be maximized if the "200 for sure" choice was made. The decision tree had at the bottom the decision between routes, then if "200 for sure" a probability choice of 0 if lost (80% prob) and 100 if won (20% prob). On the other branch we had to have a 2/3 chance of every soldier being lost versus an 1/3 chance that all would get through. Then (given that fact) whe had to put another probability branch with the appropriate probability of winning the battle (if none get through or if all get through).

When we ran the utilities through the tree backwards to to the left, we found that the general should go with the "200 for sure" route.

We also experimented with the idea that the utilities might be different for the general if he valued lives saved more than lives lost (that is, if the premises of the problem were not just "win the battle" but also "save our soldiers' lives.") That would be done by decreasing the utilities when more soldiers didn't make it through. But fiddling with the numbers didn't seem to make a difference as to the decision. This doesn't mean that some choice of utilities wouldn't change the result, just that it isn't obvious how to do this.

I then opened the discussion to questions.

We discussed priors, especially when the priors depend on the state of nature. In particular, is it necessary to "normalize" the prior (make it add up to 1)? The answer is that in most cases this is not necessary. The reason is that (in our spreadsheet scheme) if you multiply the prior by a constant number then the joint will also be multiplied by the same number, and the sum of the joints (the marginal likelihood) will also be multiplied by the same number, so when you divide the joints by the marginal likelihood, they will have the same constant multiplier, which will cancel out.

The exception is when you are testing a precise hypothesis (a coin is fair prob=1/2, a special die is fair, prob=1/3) against a vague alternative (a coin is not fair, prob has to have its own prior, etc.) In that case, it's important to normalize the priors and careful attention is required to do it right.

We briefly discussed the medical example, when we have tests on an old and new drug. The whole idea here is to compute the probability that the new drug is better than the old one. We've been using an approximation that sets the cure rate of a drug to particular values, 0.05, 0.15, 0.25, ..., 0.95. We know that if we make the division finer, we'll get a better approximation, but here we're trying to learn the principles. So, if we test the old drug (A) and get a certain set of posterior probabilities on the cure rate of the drug, and test the new drug (B) and get a different set of posterior probabilities, then we can set up a table of the joint posterior probabilities of the cure rates of each drug by simply multiplying the probs of the cure rates of each drug. We can do this because the two experiments were done on different and randomly selected individual patients, so the probs are independent. Then, it's just a matter of identifying which slots belong to (cure rate of B is greater than cure rate of A), adding them up (we add when we have (cure rate of A=0.05 AND cure rate of B=0.15) OR (cure rate of A=0.05 AND cure rate of B=0.25) OR ....) This gives us the posterior probability that drug B is better than drug A.

I think this is all we discussed, but if I have left something out important and you want me to address it before the test, post now and I'll respond.

Tuesday, November 4, 2008

The Election and Bayes

Jim Albert, a statistics professor at Bowling Green State University, has posted a short article on his statistics blog showing how a simple Bayesian approach can be used to predict the electoral vote outcome in today's election. You should be able to understand his idea quite clearly, as it's not significantly different from what we have been doing. The main difference is that he takes into account people who will vote for a third party candidate, so there is a third term in the likelihood that is equal to the probability of a third party candidate raised to the power equal to the number of people in the poll who said they would vote for a third party candidate. The other thing we've already discussed briefly, but here's the idea: After computing the posterior probability (as a formula, not as a table as we have been doing it), he uses a computer to draw a sample of 5000 values of each of the three possibilities (McCain, Obama, Other) representing the proportion of people that the sample says voted for each. He then counts the number of samples that favored Obama over McCain, and calls the state for that candidate for each sample. He then gets a win probability for each state, and he's listed them in the blog.

The second part is even simpler. He uses the computer to flip a biased coin with the appropriate probability (from his table) for each state, assigns the electoral votes from that state to the winner, and adds them up over all states. This gives him a prediction for the electoral vote for Obama and McCain. He does this 5000 times to get 5000 predictions of the electoral vote, and plots them in the graph at the bottom of the page. His prediction is that Obama will get at least 300 electoral votes, and my eye indicates that the actual outcome is likely to be between 340 and 380, give or take.

This is pretty close to the method being used at fivethirtyeight.com.

This is a method called posterior simulation. What he's doing is drawing a large sample from the posterior distribution, and using that as a proxy for the actual posterior distribution. It is a method that is widely used by professional Bayesian statisticians. We can discuss it in more detail after the test, if you like.

Jim Albert, by the way, is the author of the textbook we will be using next semester in our statistics course on Bayesian statistics.

Monday, November 3, 2008

Class, 11/3

We spent some time discussing the catch-and-release problem and similar ones. I made several points:

1) Since the likelihood is the probability of data point 1 AND data point 2 AND ...., you must always MULTIPLY the probabilities of the individual data points to get the likelihood of the entire data set. NEVER add them!

2) In this problem, as we draw samples (fish), the total number of fish in the lake and the total number of fish of each type (tagged, untagged) decrease by 1 each time a fish is caught. This means we are sampling without replacement. It also means that each time we catch a tagged fish, the next time we catch a tagged fish we have to use a numerator and a denominator that are decreased by 1. So, for example, with 100 fish in the lake, 10 of them tagged, the first tagged fish we catch has a probability of 10/100, the second a probability of 9/99, the third 8/98, etc. After catching the five tagged fish, then there are 95 fish left, 90 of them untagged. So the first untagged fish we catch has probability 90/95, the second 89/94, and so forth.

3) It doesn't matter what order the fish are caught, the likelihood will be the same. So, you might as well treat all of the first kind first and then handle those of the remaining kind.

4) After computing the rest of the table to get the posterior, you can add up the posterior probabilities for intervals in the number of fish, e.g., to get the probability that the total number of fish is between 15 and 25 (inclusive), just add the posterior probabilities for each of those numbers of fish.

5) If the number of items (fish, voters) is very large, you can approximate the ratios by the same number, just pretend that the number of fish in the lake and the number of each kind don't change as more fish are caught. The error committed will be quite small in this case. What you are doing is approximating sampling without replacement by sampling with replacement.

We talked about the astrology problem. The states of nature are the numbers p=0.05, o.15,...,0.95 the way we are setting up the problem. The prior could be uniform, but if your experience is that astrology is probably bunk, you might want to skew the prior to smaller numbers; or if your experience is that it works, you might want to skew it to larger numbers. This is not cheating, it is using information in your past experience.

The likelihood is p⁴(1-p)⁷ for each of the values of p. The rest of the table is filled out as usual. Then, the answer to the question (the probability that the astrologer is able to predict the future at least 85% of the time) is the sum of the two posterior probabilities for the values of p=0.85 and 0.95.

We discussed the expert systems problem. The basic idea is that you can train a Bayesian system by, for example, telling it the symptoms observed and the diagnosis for a number of patients. This allows the system to estimate the conditional probabilities

p(symptom|diagnosis)

for a lot of symptoms and diagnoses. These can then be used as terms in the likelihood for a new patient who comes in and whose symptoms are determined and put into the system. What we've described in class is known as a naive Bayes classifier. It's also the basis of Bayesian spam filters and many other useful practical applications of Bayesian methods (including artificial intelligence systems).

We started discussing basic decision problems by drawing the tree for the "general with two routes" problem we had discussed earlier; this time we postulated that the general might be risk averse, in which case he would choose the 200 soldiers for certain rather than the gamble between no soldiers surviving with probability 2/3 and all of them surviving with probability 1/3. Alternatively, a risk seeking general would choose the gamble. We'll pick up on this on Wednesday.

Saturday, November 1, 2008

Class, 10/31

We finished discussing the drug testing problem. From the data we have posterior probabilities for various cure rates for each drug. By multiplying them, we obtain the posterior probability that, for example, drug A has cure rate 0.25 and drug B has cure rate 0.35. We can then add up all those joint probabilities for which drug B has the greater cure rate, to get the probability that drug B is better than drug A.

We then discussed the marketing problem. We identified two stages, the first of which was to spend $20 million in testing the drug to FDA standards. We recognized that the drug might not pass this test; most experimental drugs do not. There is from our preliminary test a 75% chance that the drug will pass the test. We discussed "sunk costs," that is, costs that cannot be recovered. Even though getting to the point where we are (with 100 subjects tested) did cost some money, we can never get it back, so we may as well call our loss or utility exzctly zero at this point.

You can calculate using either losses or utilities. It's purely a matter of convenience. Since the drug company is very wealthy, its utility or loss function will be linear or nearly so.

We illustrated the process by imagining that if the company decides to continue the development of the drug, it will pass through a toll gate worth $20M. Then there will be a 75% probability that we will go on to develop the drug to marketing stage, which will cost an additional $80M and require another toll gate. We guessed that the drug had a 20% chance of commercial success (revenues of 20 years at $1B per year) and an 80% chance of "failure" (20 years at $10M/year, if I recall). These figures may not be what I wrote on the board, but you should have the correct numbers and the tree in your notes. We decided that the company should go ahead with the plan, given the figures we used.

We then discussed the fish problem. The states of nature are the numbers from 1 to 100. The prior we took to be equal (0.01 on each SON). One student asked, since we know there have to be at least 15 fish in the lake, why not set those to zero, but another student pointed out that that requires looking at the data, and the prior is supposed to reflect what you know before you look at the data, so to do this would be "cheating." We had a false start on the likelihood, which was my fault as I should have steered us to the correct solution more quickly. But several students were puzzled and we started over. Since there are N fish in the lake, and 10 of them are tagged, the probability of picking 5 tagged and 5 untagged is as follows:

For N=15, it's (10*9*8*7*6)*(5*4*3*2*1)/((15*14*13*12*11)*(10*9*8*7*6)). We get this because the probability of picking one tagged fish is 10/15, the probability of picking the second tagged fish is 9/14, and so on through the 5 tagged fish; then for the untagged fish it is 5/10 for the first one, 4/9 for the second one, and so on through the 5 untagged fish. As each fish is caught, the number of fish of that kind (tagged or untagged) decreases by 1, as does the total number of fish in the lake.

Similarly, if N=20, it's (10*9*8*7*6)*(10*9*8*7*6)/((20*19*18*17*16)*(15*14*13*12*11)), with the same kind of reasoning.

We'll discuss this more on Monday and then go on to the remaining problems on the study sheet.