Bayes Rules

Monday, November 3, 2008

Class, 11/3

We spent some time discussing the catch-and-release problem and similar ones. I made several points:

1) Since the likelihood is the probability of data point 1 AND data point 2 AND ...., you must always MULTIPLY the probabilities of the individual data points to get the likelihood of the entire data set. NEVER add them!

2) In this problem, as we draw samples (fish), the total number of fish in the lake and the total number of fish of each type (tagged, untagged) decrease by 1 each time a fish is caught. This means we are sampling without replacement. It also means that each time we catch a tagged fish, the next time we catch a tagged fish we have to use a numerator and a denominator that are decreased by 1. So, for example, with 100 fish in the lake, 10 of them tagged, the first tagged fish we catch has a probability of 10/100, the second a probability of 9/99, the third 8/98, etc. After catching the five tagged fish, then there are 95 fish left, 90 of them untagged. So the first untagged fish we catch has probability 90/95, the second 89/94, and so forth.

3) It doesn't matter what order the fish are caught, the likelihood will be the same. So, you might as well treat all of the first kind first and then handle those of the remaining kind.

4) After computing the rest of the table to get the posterior, you can add up the posterior probabilities for intervals in the number of fish, e.g., to get the probability that the total number of fish is between 15 and 25 (inclusive), just add the posterior probabilities for each of those numbers of fish.

5) If the number of items (fish, voters) is very large, you can approximate the ratios by the same number, just pretend that the number of fish in the lake and the number of each kind don't change as more fish are caught. The error committed will be quite small in this case. What you are doing is approximating sampling without replacement by sampling with replacement.

We talked about the astrology problem. The states of nature are the numbers p=0.05, o.15,...,0.95 the way we are setting up the problem. The prior could be uniform, but if your experience is that astrology is probably bunk, you might want to skew the prior to smaller numbers; or if your experience is that it works, you might want to skew it to larger numbers. This is not cheating, it is using information in your past experience.

The likelihood is p⁴(1-p)⁷ for each of the values of p. The rest of the table is filled out as usual. Then, the answer to the question (the probability that the astrologer is able to predict the future at least 85% of the time) is the sum of the two posterior probabilities for the values of p=0.85 and 0.95.

We discussed the expert systems problem. The basic idea is that you can train a Bayesian system by, for example, telling it the symptoms observed and the diagnosis for a number of patients. This allows the system to estimate the conditional probabilities

p(symptom|diagnosis)

for a lot of symptoms and diagnoses. These can then be used as terms in the likelihood for a new patient who comes in and whose symptoms are determined and put into the system. What we've described in class is known as a naive Bayes classifier. It's also the basis of Bayesian spam filters and many other useful practical applications of Bayesian methods (including artificial intelligence systems).

We started discussing basic decision problems by drawing the tree for the "general with two routes" problem we had discussed earlier; this time we postulated that the general might be risk averse, in which case he would choose the 200 soldiers for certain rather than the gamble between no soldiers surviving with probability 2/3 and all of them surviving with probability 1/3. Alternatively, a risk seeking general would choose the gamble. We'll pick up on this on Wednesday.

Saturday, November 1, 2008

Class, 10/31

We finished discussing the drug testing problem. From the data we have posterior probabilities for various cure rates for each drug. By multiplying them, we obtain the posterior probability that, for example, drug A has cure rate 0.25 and drug B has cure rate 0.35. We can then add up all those joint probabilities for which drug B has the greater cure rate, to get the probability that drug B is better than drug A.

We then discussed the marketing problem. We identified two stages, the first of which was to spend $20 million in testing the drug to FDA standards. We recognized that the drug might not pass this test; most experimental drugs do not. There is from our preliminary test a 75% chance that the drug will pass the test. We discussed "sunk costs," that is, costs that cannot be recovered. Even though getting to the point where we are (with 100 subjects tested) did cost some money, we can never get it back, so we may as well call our loss or utility exzctly zero at this point.

You can calculate using either losses or utilities. It's purely a matter of convenience. Since the drug company is very wealthy, its utility or loss function will be linear or nearly so.

We illustrated the process by imagining that if the company decides to continue the development of the drug, it will pass through a toll gate worth $20M. Then there will be a 75% probability that we will go on to develop the drug to marketing stage, which will cost an additional $80M and require another toll gate. We guessed that the drug had a 20% chance of commercial success (revenues of 20 years at $1B per year) and an 80% chance of "failure" (20 years at $10M/year, if I recall). These figures may not be what I wrote on the board, but you should have the correct numbers and the tree in your notes. We decided that the company should go ahead with the plan, given the figures we used.

We then discussed the fish problem. The states of nature are the numbers from 1 to 100. The prior we took to be equal (0.01 on each SON). One student asked, since we know there have to be at least 15 fish in the lake, why not set those to zero, but another student pointed out that that requires looking at the data, and the prior is supposed to reflect what you know before you look at the data, so to do this would be "cheating." We had a false start on the likelihood, which was my fault as I should have steered us to the correct solution more quickly. But several students were puzzled and we started over. Since there are N fish in the lake, and 10 of them are tagged, the probability of picking 5 tagged and 5 untagged is as follows:

For N=15, it's (10*9*8*7*6)*(5*4*3*2*1)/((15*14*13*12*11)*(10*9*8*7*6)). We get this because the probability of picking one tagged fish is 10/15, the probability of picking the second tagged fish is 9/14, and so on through the 5 tagged fish; then for the untagged fish it is 5/10 for the first one, 4/9 for the second one, and so on through the 5 untagged fish. As each fish is caught, the number of fish of that kind (tagged or untagged) decreases by 1, as does the total number of fish in the lake.

Similarly, if N=20, it's (10*9*8*7*6)*(10*9*8*7*6)/((20*19*18*17*16)*(15*14*13*12*11)), with the same kind of reasoning.

We'll discuss this more on Monday and then go on to the remaining problems on the study sheet.

Friday, October 31, 2008

Bayes and the election

Andrew Gelman has posted an interesting article on the use of Bayesian methods to predict the election outcome here. It discusses the website fivethirtyeight.com, maintained by Nate Silver.

Wednesday, October 29, 2008

Class, 10/29

We started by discussing the homework. I emphasized that there will never be '+' signs separating the probabilities of the individual events in the likelihood. You will always get the likelihood by multiplying the probabilities of the individual events together (whatever they are). It was evident from several of the homeworks that the likelihood had been calculated incorrectly. In one case, enough Excel code had been given to me to know that a '+' sign had been used instead of a '*' sign. I do not know what happened in the other cases. In any case, any group whose total score was less than 36/40 may resubmit on Friday for partial additional credit.

The other problems were minor.

I did note that there is another, and maybe better way (other than in the problem statement) to get the answer to the first problem. That is simply to put prior probabilities on the states of nature, with half on the "null hypothesis" that the die is unbiased, and distributing the remainder among the alternatives. Then do the usual thing: prior*likelihood = joint, sum joint to get marginal likelihood under all hypotheses, divide that into the joint to get the posterior. And then, just look at the posterior probability p of the "null hypothesis". The odds on the null hypothesis are then p/(1-p). At least one group actually used this method, which made me proud!

We then started on the practice problems.

Problem #1 is similar to the copyright problem we discussed in class. We expect a student to get answers right, because they are supposed to know the material. No one would suspect students, even if they got the answers right, because that's what's supposed to happen. But the mistakes (just like in the copyright problem) are the key. Mistakes should be made at random. So if one student copies another, then s/he will copy the mistakes perfectly, but if not, then with probability 1/5 (since there are five possible answers). The two states of nature are Cheat, and No Cheat. We discussed the prior and decided that on Cheat it might be 1/10. It could be larger or smaller and arguments were given for each. The likelihood is 1 for Cheat and 0.2⁷ for No Cheat, since each coincidence has probability 0.2 and there are seven coincidences. We found that with our prior, the posterior probability of cheating is nearly 1, and the professor ought to take appropriate action.

In Problem #2, the aim was not to actually calculate anything, but to explain how a calculation could be arranged. So the SON are the possible number of taxis, from 1 to (we decided) not more than 50,000. We discussed ways to set a prior. One would be to pick a probability, say 0.9 or 0.99, and raise it to the power of the number of taxis in the SON. A second was to use a straight-line ramp from 1 (highest) to 50,000 (lowest). A third was to use something like 1/N where N is the number of taxis in the SON. Whichever method we use, we just write down the numbers in an Excel spreadsheet, add them up, divide by the sum and then enter these numbers into the normalized prior.

The likelihood is 0 if the number of taxis is less than 150 (because you can't see taxi number 37, for example, if there are only 36 taxis), and is (1/N)⁷ for each SON where N is greater than 149, because this is "sampling with replacement," so each taxi seen has probability of its number being observed of 1/N, and the likelihood is the product of these for each observed taxi (there are 7).

Then the usual: prior*likelihood= joint, sum the joint, etc....

Then, you can decide on what probability you want for the number of taxis. If you want the probability to be greater than 0.99, just sum down the posterior until you get 0.99. The number corresponding to that last line is your best guess.

For the third problem, you have to start by keeping the two cases (standard drug and new drug) separate. For each of these, you want to compute the posterior probability that the particular drug cures the disease (r or s). This is just a standard calculation, like the one for the homework on Monday. The trick is what to do with this information to decide what the probability is that the new drug is better than the old one. We'll talk about this next time.

Tuesday, October 28, 2008

How Not To Collect Statistics

Here's an interesting article about how not to collect statistics.

Another interesting article

The New York Times has an interesting article today on decision-making and how bad people are at it.

Monday, October 27, 2008

Class, 10/27

Today we discussed criminal trials from the juror's point of view. We decided, after some discussion, that the worst thing would be to convict someone who was actually innocent. We know from the Innocence Project that an unacceptably high proportion of people in prison are probably innocent. We set up a decision tree with branches Convict Innocent, Acquit Innocent (the worst and best outcomes) in a probability fork with u being the probability of CI and (1-u) the probability of AI, and the "for certain" branch of the tree being Acquit Guilty. After some discussion we decided on something like u=0.o1 which would mean that 99% of people sent to prison would actually be guilty (assuming we can evaluate that probability as a juror). With a loss of 0 for AI and 1000 for CI, we found that the loss for AG would be 10 to make us indifferent between the two branches.

We also discussed the case of Convict Guilty, and although some thought that AI would personally be better than CG (both correct decisions), this didn't seem to hold up when we replaced AG with CG in the decision tree we drew. CG for certain seemed better than CI with probability even as small as 0.001.

We also discussed whether the seriousness of the case and the harshness of the punishment should not also change our losses. Surely, some thought, the penalty for a traffic ticket is not as onerous a penalty as 20 years in prison for a serious crime, if the person accused were actually innocent, and the death penalty is even more unacceptable if the accused were actually innocent (even though Vermont doesn't have the death penalty, a recent Vermont jury did give the death penalty in a federal case, so it's not entirely moot even for Vermonters). So, the loss for CI ought to be larger if the penalty is more serious, some said. One student would never give the death penalty...for that student, the loss is effectively infinite.

The next several classes will be devoted to discussing the practice problems for the second test. We will pick up the juror discussion again after the test.