Bayes Rules: 2008

Monday, December 8, 2008

Class, 12/8

In class today, one of the things we discussed was Naseem Taleb's ideas, and I also mentioned Mark Thoma's blog. As it turns out, Mark Thoma put an article by Taleb on his blog just today. As you will see when you read it, it generated a lot of comments.

I also mentioned my friend and mentor (a person from whom I have learned much and who helped me greatly in my career), Jim Berger.

Mentors are very important for anyone's career. I have had several, Jim Berger was the latest, but also (in temporal order) Heinz Eichhorn (image on screen behind me), Harlan Smith, and Jurgen Moser. These were people that taught me much, with whom I had very close professional and personal relationships. They helped me build my career. So, you should find mentors who can do for your career what mine have done for me.

Friday, November 21, 2008

Class, 11/21

Today we had a visit from Dr. Turner Osler, who is a medical researcher at the UVM medical school.

Just to touch on several of the important topics. You will have read the Scientific American article by Efron and Morris. The point here is that a better estimate of the individual items in a collection of items (baseball batting averages, proportion of toxoplasmosis patients in a city, proportion of mortalities in a burn unit) by combining them in a way that uses some of the information from all of the different items to get better estimates of the individual items. So if we observe the batting averages of a number of batters after they have been at-bat 40 times (early in the season), we can get a better overall estimate of the final batting averages of the players at the end of the season by using a formula like:

z_i=Y + c(y_i-Y)

where y_i is the individual item (batting average, etc.), Y is the average of all of them, and z_i is our best estimate of the true value. The number c is given by a formula, which is in the notes that Dr. Osler handed out. The important thing is that if c=1, then z_i=y_i, and if c=0 then z_i=Y. The smaller c, the more the estimates are "shrunk" towards Y. So the ideal estimator is a combination of Y and y_i, a weighted average of the two. (This is the so-called Efron-Morris estimator).

I remarked at the end of the class that the notion of "best" in this context is actually gotten by assuming a particular loss function that says that the farther away the true values are from the estimated value (in an "average" sense), the bigger the loss, so we try to choose c so as to minimize the overall loss. When you do this, the value of c in the handout pops out.

Dr. Osler is using these ideas to solve the following problem: we can estimate the proportion of "bad outcomes" in a burn unit, for example, by dividing the number of patients that die by the total number of patients. But this is like the batting average, it is based on a relatively small number of patients in many cases. But we can use the ideas of the Efron-Morris article to improve the estimates for the individual hospitals. And we learn that many of the hospitals that seemed at first glance to have excessively high mortality, probably do not. There was one hospital that looked suspicious.

Dr. Osler also discussed a "beta" prior, which we have been using without giving it that name. It is anything of the form p^a(1-p)^b for constants a and b. This will be a "bell-shaped curve," whose actual shape can be varied quite widely by choosing the constants a and b to match what we want. When you multiply this prior by the likelihood pⁿ(1-p)^N-n, which you get if n patients out of N die, you get a posterior that is also beta: p^a+n(1-p)^b+N-n. We've been doing this with spreadsheets, which are an approximation based on evaluating this formula at particular points between 0 and 1 (we used p=0.05, 0.15, 0.25,...,0.95 in class). As I remarked, the more points you use in the spreadsheet, the more accurate the results. We used 10 points because that's something easily done in class. But you could use 100 points, or 1000 of them to get more accurate answers. With a spreadsheet, it's just copying the formula down. But what Dr. Osler is doing is something we've been doing essentially for the entire semester. For example, when we used a non-uniform prior in the polling (voting) example, we were doing this.

Have a happy Thanksgiving!

Wednesday, November 19, 2008

Class, 11/19

Today I told you about the every-four-years Valencia conferences of Bayesians and the other series of meetings that happens two years after each Valencia conference. This conference has been going since, I think, 1978. A discussion of how the series began, written by Jose Bernardo (I mentioned him today) can be found here.

Because they have gotten quite large, in recent years the meetings have been held at a Mediterranean resort (one was held on the Canary Islands, part of Spain). One of the songs I played, "Frequentists and Bayesians," referred to that (Click here for music and lyrics). The songs and other things I mentioned come from the "Cabaret" that follows each of the roughly 5-day long meetings, when everyone is pretty exhausted and ready for some fun. The Cabarets feature songs with Bayesian flavor, set to well-known tunes (see the handout), skits, juggling, and other frivolity. Pictures of several of the Cabarets can be found here.

The song refers to MCMC, which stands for Markov chain Monte Carlo, which has pretty much become the default method of Bayesian calculations for the past 20 years. As I described it in class, it uses a "random walk" method to stagger from one state of nature to another one, in such a way that more time is spent on states of nature that have a high posterior probability. Then we can use the sample so generated to make the inferences we need, by just counting how often each state of nature is visited by the MCMC program; the proportion of time spent on a particular state of nature becomes an estimate of the posterior probability of that state of nature.

Another song I played, "These are Bayes," featured two things of interest. One is the mention of Sir Harold Jeffreys (no relation), who lived to almost 100 years of age and played a very important role over the years in making Bayesian statistics come into the mainstream. He invented a way of deciding on priors which, in the problems we have been doing, amounts to a uniform prior. (I met Sir Harold once, on the only trip he made to the U.S., when I was a graduate student. And, when I go to Bayesian meetings, the similarity between our last names is always a source of amusement.) The second thing is the running joke that Bayesians make about posteriors (Click here for music and lyrics).

There is a lively debate about how to choose priors. There's the Jeffreys prior, mentioned above, and other exotic things like Maximum Entropy priors, Reference priors, Group-theoretic priors and so on. We didn't discuss these at all since they are beyond the scope of the course, but one of the songs, "Confusing Priors," is really referring to this debate. It's on YouTube here.

Two other songs that are in the handout but which I did not play are "The MCMC Saga" and "Bayesian Believer". More YouTube clips of other songs can be found here. The lyrics of some of the earlier songs can be found on the Bayesian Songbook, which I excerpted for today's handout.

Monday, November 17, 2008

Class, 11/17

Today we discussed financial security, investments, insurance and related issues. The handout I gave out is here.

Saturday, November 15, 2008

Class, 11/14

Today, after some special announcements, we discussed the O. J. Simpson case. See Calculated Risks, Chapter 8. The basic thing here is that Alan Dershowitz, the Harvard lawyer who contributed to Simpson's defense, made a mistake when he argued that the fact that Simpson had battered his wife Nicole was not material, since the probability that a batterer would go on to murder his partner is only 1/2500 per year. This reasoning is wrong, because the probability that a woman would be murdered by a random person (not the batterer) in any year is 1/20000. This means that of any group of 100000 battered women, 40 of them would be killed by their batterer in any year, whereas only 5 would be murdered by a random person who is not the batterer. So, when we look at the fact that Nicole was murdered and battered, the probability is 40/45 that Simpson did it (on only this evidence). So the battery is relevant and should be admitted in evidence.

We then discussed the death penalty, which although it does not exist in Vermont state courts, does exist in Federal courts and at least one Vermont jury recently gave the death penalty to a person in Vermont who was tried in Federal court.

We used a decision tree approach. A decision box at the root of the tree has two branches corresponding to our two actions: Convict or Acquit. We put another decision box on the Convict branch, that is, Death or Life Imprisonment. Then each of these three branches has a probability circle with two branches: Guilty (probability p) and Innocent (probability 1-p).

For the losses, we put 0 on each of the two correct outcomes, Acquit Innocent and Convict Guilty with Life. We decided for various reasons that Convict Guilty and Death had a slight loss (reasons being things like moral hazard). It turns out not to matter, we could pick 0 and the result would be the same. We decided that the loss of giving an innocent person the death penalty was huge (we picked 1000, I think), much larger than the loss of sending an innocent person to prison for life. We didn't assign an actual number to acquitting a guilty person as we were running out of time. But the point was, we found that no matter what p is, so long as the loss for Innocent and Death is larger than the loss for Innocent and Life in Prison, we will never choose the death penalty.

Class, 11/12

We discussed the Wuppertal, Germany case discussed in Calculated Risks, pp. 156-158. Our take on discussing the probability tree started with a 1 chance in 100000 that the person arrested was guilty, since there are about that many people in the area and with no data, the prior has to depend only on general information like this. So the probability of innocence in for all practical purposes equal to 1.

We then decided that there was an 8 in 10 chance that there would be blood on the shoes, given guilt, but only a 1 in 1000 chance, given innocence.

We further used the 2.7% match probability, given innocence, that was mentioned in the book, and a match probability of 1, given guilt.

All of this data still did not make the probability up to the 99% probability of guilt that we discussed in earlier classes.

I then pointed out that the suspect turns out to have had an ironclad alibi. He was 100 km away at the time of the murders.

We started discussing the O. J. Simpson case. We'll look at it in more detail next time.

Wednesday, November 12, 2008

Tracking the flu

Here's a fascinating article about how Google is able to accurately determine the number of flu cases at a given time based on search requests for things like "flu-like symptoms" that are fed into its search engine. People are using the web in truly ingenious ways! (There was a report a few months ago that a scientist was able to show, using Google Earth images that could show the direction that cows faced (but not where the front end of the cow was), that cows tend to line up along a north-south direction, which may indicate that cows, like many animals, have a direction-sensing magnetic organ.)

Tuesday, November 11, 2008

Class, 11/10

We went over the exam.

Problem 1 is similar to the second half drug-testing problem on the study sheet; In the first half you would have to compute the posterior probability of the cure rate for each drug, but here I just told you what the posterior probability of the predict rate of Tom and Joe was. The second half is to arrange Tom and Joe's true predict rates and posterior probabilities along the sides and top of a square table, multiply pairwise to get the probability that Tom's true rate is, say 0.3 and Joe's 0.4, and then add up all of the rates above the "staircase" that separates equal predict rates from those where Joe is better than Tom. For equal, we add up the probabilities along the diagonal. This is not an inference problem, you do not need to list SON, priors, likelihoods, joint, and posterior probabilities. Some people didn't add up the correct boxes, but I didn't understand the principle that they used.

The second problem was also not an inference problem, so no priors and no likelihoods. It is instead a prediction problem. Here's how you can tell the difference. In an inference problem, there are unknown states of nature that can not be directly observed; you are trying to compute the probability of each of these unknown states of nature. Here, you are doing something different. You are trying to predict data that will be known for sure in the future (that is, if the shuttle is destroyed, everyone will know it; if it is not destroyed after ten flights, everyone will know that too. That is data, not a state of nature.) One way to answer the question is to say that if a disaster takes place, it will take place in the first flight or in the second flight or in the third flight...or in the tenth flight, so the probability of one of these happening is the sum of the individual probabilities. So, the answer would be 10/80. This isn't quite right, although I gave 18 points credit for this answer. It is approximately correct, though. The correct answer is to compute the probability of (no disaster in flight 1 and no disaster in flight 2 and...and no disaster in flight 10) to get the probability that no disaster will happen. Then subtract that from 1 to get the probability of disaster. The result is 1-(79/80)¹⁰=0.118, which is pretty close to the approximate answer of 0.125. We can know that the approximate answer is not really right (although it's a good approximation of both the individual probability and the number of flights is small), because if there had been 100 flights, it gives a probability of disaster of 100/80, which is greater than 1 and impossible because probabilities have to be between 0 and 1.

I drew a fairly elaborate decision tree for the second part of this problem. I did it in terms of losses, and assigned 0 loss if the Hubble was fixed and the ISS was serviced. Since the ISS can be serviced in any case by Russian rockets, there's no loss from that regardless of which branch we chose. With the (approximate) probability of catastrophe of 1/8 if all ten flights are made, and assuming that the catastrophe happens so that there is a 50% chance that it affects the Hubble mission, I got a loss of 1/8(C+H/2) for that branch, where H is the loss if the Hubble isn't serviced, and C is the loss if a catastrophe takes place. For the "Hubble only" branch, the loss is 1/80(C+H). That is always less than the ten missions branch, so we cut that one off. For the "Don't fly" branch, the loss is H since the Hubble isn't serviced. The loss is the same on the "Don't fly" branch as the "Service Hubble only" branch if H=1/80(C+H) or 79H=C. This is as far as we can go to help the NASA Administrator. Whether to fly or not depends upon which has the larger loss. If C is greater than 79H in the Administrator's mind, then the mission should not be flown.

I don't envy Michael Griffin.

Problem 3 is about drugs, but it is not similar to the drug problem on the study sheet in that I told you that the cure rate of one of the drugs is exactly 0.2 (based on a large amount of data). The discussion of the experimental drug is like the one on the study sheet, in that you need to set up states of nature for the cure rate (0.05, 0.15, 0.25,..., 0.95 for example), put a prior on each (flat for example), compute the likelihood for each (r¹⁵(1-r)³⁵ for each cure rate r), compute the joints, compute the marginal likelihood, divide the joints by the marginal likelihood to get the posterior. Then to get the probability that the new drug is better than the old one, just add up the posteriors for cure rates bigger than 0.2.

Problem 4 is like the "diagnose" stage of a spam filter or a medical diagnosis system. The first half would be gathering information, for example on emails, and determining the probability of obtaining a given word given that its spam/not spam. The simple way would be just by frequency in each of the two categories. The second, or "diagnose" half looks for the words, and forms a likelihood by raising the probability for each word to a power equal to the number of times the word appears, and multiplying them all together. Note that the biggest mistake here was not raising to the power, or adding the terms in the likelihood instead of multiplying them.

Never, ever, add the terms in the likelihood! It's the probability of (data A and data B and data C), so the probabilities must be multiplied.

The final problem constructs a probability tree (not a decision tree). At the base of the tree, as usual, are the unknown (to the investigator) states of nature: HH, HT, TT. Some people wrote those correctly but didn't realize that HT is twice as likely as either HH or TT. Some others put the data at the base of the tree, which is never to be done. So, if a person answers "yes", it may be because he tossed HH, or it may be because he tossed HT and is telling the truth. Since 25% of the time HH gets tossed, a total of 15%=40%-25% of the responses are due to HT being tossed. Since that happens half the time, it must be that a total of 2*15%=30% of the subjects would have answered "yes" if they were all telling the truth.

To summarize the difficulties that people had: 1) States of nature, which are not observable, always go at the root of a probability tree. Data never goes there. The sub-branches of a probability tree will always be conditional probabilities of data given the state of nature. 2) Distinguish between prediction of data that will be observed in the future, and inference of states of nature (unobservable). 3) a decision box belongs at the root of a decision tree, and the branches coming out of it are the actions being contemplated (eg., don't fly, fly just Hubble, fly ten missions). Then there will be probability branches attached to each branch (sometimes only one, as in "don't fly"). Losses go at the tips of the probability branches, and are then propagated backwards down the tree. Choose the branch with the smallest loss. You can use utilities instead of losses, in which case you choose the branch with the greatest utility.

I finished my discussion by pointing out that the tables that we have been calculating for rates r=0.05, 0.15, 0.25,..., 0.95 are approximations to continuous functions. Then the summation of them is like a Riemann sum, and when you make the divisions finer and finer you are actually getting better and better approximations to an integral. Since integration is in general hard, statisticians often resort to various approximation techniques to get the answers they need.

We'll go back to crime on Wednesday. We need to decide on dates for the presentations, so I hope everyone will be in class on Wednesday.

There will be a guest talk by a research physician at UVM on Friday, November 21. We'll also discuss investing and Bayesian jokes/songs in the next week.

Monday, November 10, 2008

Nate Silver

The New York Times had an interesting article about Nate Silver, the proprietor of the very successful fivethirtyeight.com website that I pointed you to earlier about making predictions of the election outcome. As anyone who was following this website knows, he was quite close to the actual results. Mr. Silver started out predicting baseball outcomes, also with great success.

Saturday, November 8, 2008

Occam's Razor

Here is an article that Jim Berger and I wrote on how Occam's Razor naturally falls out of Bayesian reasoning. I promised to post this after the test.

Wednesday, November 5, 2008

Class, 11/5

We continued our discussion on the General's Dilemma problem. Here, the situation is that there is a desperate battle to be won, and the results will depend on how many soldiers that the general can get to the battle. The Objective (in the PrOACT agenda) is to win the battle. If 600 soldiers get through, then there is a 90% probability that the battle will be won. If 200 make it through, then the probabilitiy is 50%, but if none make it through, then the probability is only 20%.

We assigned a utility of 100 to winning, and 0 to losing. We decided that the expected utility would be maximized if the "200 for sure" choice was made. The decision tree had at the bottom the decision between routes, then if "200 for sure" a probability choice of 0 if lost (80% prob) and 100 if won (20% prob). On the other branch we had to have a 2/3 chance of every soldier being lost versus an 1/3 chance that all would get through. Then (given that fact) whe had to put another probability branch with the appropriate probability of winning the battle (if none get through or if all get through).

When we ran the utilities through the tree backwards to to the left, we found that the general should go with the "200 for sure" route.

We also experimented with the idea that the utilities might be different for the general if he valued lives saved more than lives lost (that is, if the premises of the problem were not just "win the battle" but also "save our soldiers' lives.") That would be done by decreasing the utilities when more soldiers didn't make it through. But fiddling with the numbers didn't seem to make a difference as to the decision. This doesn't mean that some choice of utilities wouldn't change the result, just that it isn't obvious how to do this.

I then opened the discussion to questions.

We discussed priors, especially when the priors depend on the state of nature. In particular, is it necessary to "normalize" the prior (make it add up to 1)? The answer is that in most cases this is not necessary. The reason is that (in our spreadsheet scheme) if you multiply the prior by a constant number then the joint will also be multiplied by the same number, and the sum of the joints (the marginal likelihood) will also be multiplied by the same number, so when you divide the joints by the marginal likelihood, they will have the same constant multiplier, which will cancel out.

The exception is when you are testing a precise hypothesis (a coin is fair prob=1/2, a special die is fair, prob=1/3) against a vague alternative (a coin is not fair, prob has to have its own prior, etc.) In that case, it's important to normalize the priors and careful attention is required to do it right.

We briefly discussed the medical example, when we have tests on an old and new drug. The whole idea here is to compute the probability that the new drug is better than the old one. We've been using an approximation that sets the cure rate of a drug to particular values, 0.05, 0.15, 0.25, ..., 0.95. We know that if we make the division finer, we'll get a better approximation, but here we're trying to learn the principles. So, if we test the old drug (A) and get a certain set of posterior probabilities on the cure rate of the drug, and test the new drug (B) and get a different set of posterior probabilities, then we can set up a table of the joint posterior probabilities of the cure rates of each drug by simply multiplying the probs of the cure rates of each drug. We can do this because the two experiments were done on different and randomly selected individual patients, so the probs are independent. Then, it's just a matter of identifying which slots belong to (cure rate of B is greater than cure rate of A), adding them up (we add when we have (cure rate of A=0.05 AND cure rate of B=0.15) OR (cure rate of A=0.05 AND cure rate of B=0.25) OR ....) This gives us the posterior probability that drug B is better than drug A.

I think this is all we discussed, but if I have left something out important and you want me to address it before the test, post now and I'll respond.

Tuesday, November 4, 2008

The Election and Bayes

Jim Albert, a statistics professor at Bowling Green State University, has posted a short article on his statistics blog showing how a simple Bayesian approach can be used to predict the electoral vote outcome in today's election. You should be able to understand his idea quite clearly, as it's not significantly different from what we have been doing. The main difference is that he takes into account people who will vote for a third party candidate, so there is a third term in the likelihood that is equal to the probability of a third party candidate raised to the power equal to the number of people in the poll who said they would vote for a third party candidate. The other thing we've already discussed briefly, but here's the idea: After computing the posterior probability (as a formula, not as a table as we have been doing it), he uses a computer to draw a sample of 5000 values of each of the three possibilities (McCain, Obama, Other) representing the proportion of people that the sample says voted for each. He then counts the number of samples that favored Obama over McCain, and calls the state for that candidate for each sample. He then gets a win probability for each state, and he's listed them in the blog.

The second part is even simpler. He uses the computer to flip a biased coin with the appropriate probability (from his table) for each state, assigns the electoral votes from that state to the winner, and adds them up over all states. This gives him a prediction for the electoral vote for Obama and McCain. He does this 5000 times to get 5000 predictions of the electoral vote, and plots them in the graph at the bottom of the page. His prediction is that Obama will get at least 300 electoral votes, and my eye indicates that the actual outcome is likely to be between 340 and 380, give or take.

This is pretty close to the method being used at fivethirtyeight.com.

This is a method called posterior simulation. What he's doing is drawing a large sample from the posterior distribution, and using that as a proxy for the actual posterior distribution. It is a method that is widely used by professional Bayesian statisticians. We can discuss it in more detail after the test, if you like.

Jim Albert, by the way, is the author of the textbook we will be using next semester in our statistics course on Bayesian statistics.

Monday, November 3, 2008

Class, 11/3

We spent some time discussing the catch-and-release problem and similar ones. I made several points:

1) Since the likelihood is the probability of data point 1 AND data point 2 AND ...., you must always MULTIPLY the probabilities of the individual data points to get the likelihood of the entire data set. NEVER add them!

2) In this problem, as we draw samples (fish), the total number of fish in the lake and the total number of fish of each type (tagged, untagged) decrease by 1 each time a fish is caught. This means we are sampling without replacement. It also means that each time we catch a tagged fish, the next time we catch a tagged fish we have to use a numerator and a denominator that are decreased by 1. So, for example, with 100 fish in the lake, 10 of them tagged, the first tagged fish we catch has a probability of 10/100, the second a probability of 9/99, the third 8/98, etc. After catching the five tagged fish, then there are 95 fish left, 90 of them untagged. So the first untagged fish we catch has probability 90/95, the second 89/94, and so forth.

3) It doesn't matter what order the fish are caught, the likelihood will be the same. So, you might as well treat all of the first kind first and then handle those of the remaining kind.

4) After computing the rest of the table to get the posterior, you can add up the posterior probabilities for intervals in the number of fish, e.g., to get the probability that the total number of fish is between 15 and 25 (inclusive), just add the posterior probabilities for each of those numbers of fish.

5) If the number of items (fish, voters) is very large, you can approximate the ratios by the same number, just pretend that the number of fish in the lake and the number of each kind don't change as more fish are caught. The error committed will be quite small in this case. What you are doing is approximating sampling without replacement by sampling with replacement.

We talked about the astrology problem. The states of nature are the numbers p=0.05, o.15,...,0.95 the way we are setting up the problem. The prior could be uniform, but if your experience is that astrology is probably bunk, you might want to skew the prior to smaller numbers; or if your experience is that it works, you might want to skew it to larger numbers. This is not cheating, it is using information in your past experience.

The likelihood is p⁴(1-p)⁷ for each of the values of p. The rest of the table is filled out as usual. Then, the answer to the question (the probability that the astrologer is able to predict the future at least 85% of the time) is the sum of the two posterior probabilities for the values of p=0.85 and 0.95.

We discussed the expert systems problem. The basic idea is that you can train a Bayesian system by, for example, telling it the symptoms observed and the diagnosis for a number of patients. This allows the system to estimate the conditional probabilities

p(symptom|diagnosis)

for a lot of symptoms and diagnoses. These can then be used as terms in the likelihood for a new patient who comes in and whose symptoms are determined and put into the system. What we've described in class is known as a naive Bayes classifier. It's also the basis of Bayesian spam filters and many other useful practical applications of Bayesian methods (including artificial intelligence systems).

We started discussing basic decision problems by drawing the tree for the "general with two routes" problem we had discussed earlier; this time we postulated that the general might be risk averse, in which case he would choose the 200 soldiers for certain rather than the gamble between no soldiers surviving with probability 2/3 and all of them surviving with probability 1/3. Alternatively, a risk seeking general would choose the gamble. We'll pick up on this on Wednesday.

Saturday, November 1, 2008

Class, 10/31

We finished discussing the drug testing problem. From the data we have posterior probabilities for various cure rates for each drug. By multiplying them, we obtain the posterior probability that, for example, drug A has cure rate 0.25 and drug B has cure rate 0.35. We can then add up all those joint probabilities for which drug B has the greater cure rate, to get the probability that drug B is better than drug A.

We then discussed the marketing problem. We identified two stages, the first of which was to spend $20 million in testing the drug to FDA standards. We recognized that the drug might not pass this test; most experimental drugs do not. There is from our preliminary test a 75% chance that the drug will pass the test. We discussed "sunk costs," that is, costs that cannot be recovered. Even though getting to the point where we are (with 100 subjects tested) did cost some money, we can never get it back, so we may as well call our loss or utility exzctly zero at this point.

You can calculate using either losses or utilities. It's purely a matter of convenience. Since the drug company is very wealthy, its utility or loss function will be linear or nearly so.

We illustrated the process by imagining that if the company decides to continue the development of the drug, it will pass through a toll gate worth $20M. Then there will be a 75% probability that we will go on to develop the drug to marketing stage, which will cost an additional $80M and require another toll gate. We guessed that the drug had a 20% chance of commercial success (revenues of 20 years at $1B per year) and an 80% chance of "failure" (20 years at $10M/year, if I recall). These figures may not be what I wrote on the board, but you should have the correct numbers and the tree in your notes. We decided that the company should go ahead with the plan, given the figures we used.

We then discussed the fish problem. The states of nature are the numbers from 1 to 100. The prior we took to be equal (0.01 on each SON). One student asked, since we know there have to be at least 15 fish in the lake, why not set those to zero, but another student pointed out that that requires looking at the data, and the prior is supposed to reflect what you know before you look at the data, so to do this would be "cheating." We had a false start on the likelihood, which was my fault as I should have steered us to the correct solution more quickly. But several students were puzzled and we started over. Since there are N fish in the lake, and 10 of them are tagged, the probability of picking 5 tagged and 5 untagged is as follows:

For N=15, it's (10*9*8*7*6)*(5*4*3*2*1)/((15*14*13*12*11)*(10*9*8*7*6)). We get this because the probability of picking one tagged fish is 10/15, the probability of picking the second tagged fish is 9/14, and so on through the 5 tagged fish; then for the untagged fish it is 5/10 for the first one, 4/9 for the second one, and so on through the 5 untagged fish. As each fish is caught, the number of fish of that kind (tagged or untagged) decreases by 1, as does the total number of fish in the lake.

Similarly, if N=20, it's (10*9*8*7*6)*(10*9*8*7*6)/((20*19*18*17*16)*(15*14*13*12*11)), with the same kind of reasoning.

We'll discuss this more on Monday and then go on to the remaining problems on the study sheet.

Friday, October 31, 2008

Bayes and the election

Andrew Gelman has posted an interesting article on the use of Bayesian methods to predict the election outcome here. It discusses the website fivethirtyeight.com, maintained by Nate Silver.

Wednesday, October 29, 2008

Class, 10/29

We started by discussing the homework. I emphasized that there will never be '+' signs separating the probabilities of the individual events in the likelihood. You will always get the likelihood by multiplying the probabilities of the individual events together (whatever they are). It was evident from several of the homeworks that the likelihood had been calculated incorrectly. In one case, enough Excel code had been given to me to know that a '+' sign had been used instead of a '*' sign. I do not know what happened in the other cases. In any case, any group whose total score was less than 36/40 may resubmit on Friday for partial additional credit.

The other problems were minor.

I did note that there is another, and maybe better way (other than in the problem statement) to get the answer to the first problem. That is simply to put prior probabilities on the states of nature, with half on the "null hypothesis" that the die is unbiased, and distributing the remainder among the alternatives. Then do the usual thing: prior*likelihood = joint, sum joint to get marginal likelihood under all hypotheses, divide that into the joint to get the posterior. And then, just look at the posterior probability p of the "null hypothesis". The odds on the null hypothesis are then p/(1-p). At least one group actually used this method, which made me proud!

We then started on the practice problems.

Problem #1 is similar to the copyright problem we discussed in class. We expect a student to get answers right, because they are supposed to know the material. No one would suspect students, even if they got the answers right, because that's what's supposed to happen. But the mistakes (just like in the copyright problem) are the key. Mistakes should be made at random. So if one student copies another, then s/he will copy the mistakes perfectly, but if not, then with probability 1/5 (since there are five possible answers). The two states of nature are Cheat, and No Cheat. We discussed the prior and decided that on Cheat it might be 1/10. It could be larger or smaller and arguments were given for each. The likelihood is 1 for Cheat and 0.2⁷ for No Cheat, since each coincidence has probability 0.2 and there are seven coincidences. We found that with our prior, the posterior probability of cheating is nearly 1, and the professor ought to take appropriate action.

In Problem #2, the aim was not to actually calculate anything, but to explain how a calculation could be arranged. So the SON are the possible number of taxis, from 1 to (we decided) not more than 50,000. We discussed ways to set a prior. One would be to pick a probability, say 0.9 or 0.99, and raise it to the power of the number of taxis in the SON. A second was to use a straight-line ramp from 1 (highest) to 50,000 (lowest). A third was to use something like 1/N where N is the number of taxis in the SON. Whichever method we use, we just write down the numbers in an Excel spreadsheet, add them up, divide by the sum and then enter these numbers into the normalized prior.

The likelihood is 0 if the number of taxis is less than 150 (because you can't see taxi number 37, for example, if there are only 36 taxis), and is (1/N)⁷ for each SON where N is greater than 149, because this is "sampling with replacement," so each taxi seen has probability of its number being observed of 1/N, and the likelihood is the product of these for each observed taxi (there are 7).

Then the usual: prior*likelihood= joint, sum the joint, etc....

Then, you can decide on what probability you want for the number of taxis. If you want the probability to be greater than 0.99, just sum down the posterior until you get 0.99. The number corresponding to that last line is your best guess.

For the third problem, you have to start by keeping the two cases (standard drug and new drug) separate. For each of these, you want to compute the posterior probability that the particular drug cures the disease (r or s). This is just a standard calculation, like the one for the homework on Monday. The trick is what to do with this information to decide what the probability is that the new drug is better than the old one. We'll talk about this next time.

Tuesday, October 28, 2008

How Not To Collect Statistics

Here's an interesting article about how not to collect statistics.

Another interesting article

The New York Times has an interesting article today on decision-making and how bad people are at it.

Monday, October 27, 2008

Class, 10/27

Today we discussed criminal trials from the juror's point of view. We decided, after some discussion, that the worst thing would be to convict someone who was actually innocent. We know from the Innocence Project that an unacceptably high proportion of people in prison are probably innocent. We set up a decision tree with branches Convict Innocent, Acquit Innocent (the worst and best outcomes) in a probability fork with u being the probability of CI and (1-u) the probability of AI, and the "for certain" branch of the tree being Acquit Guilty. After some discussion we decided on something like u=0.o1 which would mean that 99% of people sent to prison would actually be guilty (assuming we can evaluate that probability as a juror). With a loss of 0 for AI and 1000 for CI, we found that the loss for AG would be 10 to make us indifferent between the two branches.

We also discussed the case of Convict Guilty, and although some thought that AI would personally be better than CG (both correct decisions), this didn't seem to hold up when we replaced AG with CG in the decision tree we drew. CG for certain seemed better than CI with probability even as small as 0.001.

We also discussed whether the seriousness of the case and the harshness of the punishment should not also change our losses. Surely, some thought, the penalty for a traffic ticket is not as onerous a penalty as 20 years in prison for a serious crime, if the person accused were actually innocent, and the death penalty is even more unacceptable if the accused were actually innocent (even though Vermont doesn't have the death penalty, a recent Vermont jury did give the death penalty in a federal case, so it's not entirely moot even for Vermonters). So, the loss for CI ought to be larger if the penalty is more serious, some said. One student would never give the death penalty...for that student, the loss is effectively infinite.

The next several classes will be devoted to discussing the practice problems for the second test. We will pick up the juror discussion again after the test.

Sunday, October 26, 2008

Class, 10/24

We decided on November 7 as the date for the next test. A study guide can be found here, which we will be discussing next week and the following week. You'll get a paper copy on Monday.

Today we talked about juror's decisions. We didn't do any math, but we discussed the various options available to a juror, including the consequences of making the wrong decision (convict someone who is innocent, acquit someone who is guilty). We'll pick the discussion up again on Monday and try to quantify the results of our conversation on Friday.

Friday, October 24, 2008

An interesting short article

The New York Times published this interesting article on statistics, baseball and health care today.

Wednesday, October 22, 2008

Class, 10/22

OK, so today I revisited the insurance problem, but from the point of view of expected loss rather than expected utility. It's not any different, but I drew a picture on the blackboard that showed that there was a range in insurance premiums between the minimum premium that the insuarnce company would sell the insurance policy for (p=m/h, see previous posting for definitions) and the maximum amount the homeowner would buy it for (p=loss(m)/loss(h)). The homeowner hopes that between these two limits, there will be competition between insurance companies that will give him or her a good deal on insurance.

We then discussed testing various hypotheses based on data observed.

First, we discussed testing a coin which may be fair or unfair.

We decided that if it was fair, a reasonable prior would be P(fair)=0.5, and P(unfair)=0.5.

But then, what does P(unfair) mean? If the coin is fair, it is supposed to be Heads with probability 0.5. But if it is unfair? What is the probability? We decided to split the probability up equally amongst the possibilities, which we chose to be 0.05, 0.15, 0.25,...0.85, 0.95 with each possibility having prior probability 0.05 (after some discussion that reminded us that the the total probability has to be 1, and already we have expended 0.5 on the "null hypothesis" that the coin is fair.)

So we set up a "spreadsheet" calculation. We discussed how to actually do it if we were doing it with Excel.

We didn't actually do the calculation, but I will tell you that the result is: With 60 heads and 40 tails, the probability of obtaining a result that extreme or more extreme (60 or more heads, or 40 or less tails) is about 0.05, but the probability that the coin is fair, given that we have observed the data (60 heads and 40 tails) is about 0.5.

This is very interesting. The standard statistical test of statistical significance, how extreme the result is, is very different from what the Bayesian result is.

We also discussed the problem of estimating the probability that an unknown proportion (here, the bias of the coin, or in the problem set, the cure rate of the new drug) is greater than some fixed value (say 0.2). The spreadsheet is the same except for no special picking out of 0.5; this means we crossed out this line and used 0.1 for the alternatives 0.5, 0.15,...0.95, and to determine the probability that the the new drug is better than the old one, we just add the probabilities for the states of nature that are greater than 0.2. (Again we didn't do the actual calculation.

We finished with a challenge: How to decide, if you are on a jury, whether to convict or acaquit a defendant in a criminal case. More clearly, what does "beyond a reasonable doubt" mean?

Tuesday, October 21, 2008

Class, 10/20

We talked about utilities. First we looked at the shapes of the curves that you all derived over the weekend. We learned that a straight line is neutral, a utility curve that curves up is risk-seeking and one that curves down is risk-averse.

We then discussed insurance on a house. We found that if h is the value of the house and m is the premium we pay for the insurance, and p is the probability of disaster (e.g., a fire burns the house down), then the insurance company will demand that p be less than m/h. On the other hand, the owner of the house (if her utility is neutral) will demand that p be greater than m/h, and no transaction can take place. But insurance is bought and sold, so there has to be an explanation for this. And there is an explanation, because although insurance companies have a nearly neutral utility curve except for truly huge amounts, people do not, and most people have risk-averse utility curves. They will demand that p be greater than u(-m)/u(-h) where u(-) means the value of the curve at the point in question (the quantities are negative because in both the case of the premium and the potential catastrophe, the person ends up with less assets). But, if you have a utility curve that curves down, that means that the ratio u(-m)/u(-h) will be less than m/h, so the person will be willing to buy the insurance after all. Therefore, the insurance company can now set a value of m that the consumer will be willing to pay and which will also give a profit to the company, thus keeping the stockholders happy.

I remarked that this is actually how all commerce works. There are two parties, a seller and a buyer. They are willing to make a transaction because their utility curves are different, and so it is a "win-win" situation where everyone, both the buyer and the seller, feel themselves better off (in terms of utilities) than they did before the transaction took place.

One student had remarked in class and in journals that this approach (decision theory) might not be adequate when considering lotteries, where there is a huge payoff of very low probability. Should someone wager to win the lottery, even if taxes and annuitization made it a positive payoff on expected return basis? My answer is, "Not Really." The reason is that we don't (or shouldn't) make decisions based on expected return. We should make decisions based on expected utility or expected loss. I posed the question, would you rather have $280 million with probability 1/2, or $10 million for sure. The overwhelming choice of the class was, take the $10 million. This means, that to most of the people in the class, having $280 million isn't that much better than having $10 million. This means that in the lottery problem, you probably won't want to use $280 million as the leaves on the ends of the decision tree. You probably will make as good a decision if you just put $10 million there. And if you did this, your decision would be just as rational, and would tell you that the lottery is not really a good place to invest your money (unless your only reward is the thrill of entering the lottery!) Final comment is that the student who raised this issue initially agreed that when utilities or losses were used as the payoff, then it would not be a problem.

I finally drew on the board a generally useful way to estimate utilities for any events whatsoever. I used the Monty Hall example of a car, a goat, and a trip to Hawaii. Presumably the car is the best and the goat is the worst, with the Hawaii trip in between. Draw a decision tree, put the car and the goat on the probability branches and the Hawaii trip on the "get for certain" branch. Then, pick a probability for getting the car that makes you neutral between the two branches of the decision tree. There should be a point where you are neutral, for if the probability of getting the car is 1, you'd take the car for sure, but if the probability of getting the car is 0, you'd pick the Hawaii trip for sure. One student volunteered p=0.8. That means that her utility for the Hawaii trip is 0.8, since at that point, both branches of the decision tree have exactly the same value.

I remarked finally that if you use this method for evaluating utilities, then the utilities so calculated are actually probabilities!

Saturday, October 18, 2008

Class, 10/17

Today we spent most of the class discussing the lottery problem I left you with last time.

What we need to compute is the probability that no one wins the lottery, the probability that exactly one person wins the lottery, exactly two people, and so forth.

After some discussion we decided that the probability that no one wins is the probability that the first person loses AND that the second person loses AND....AND that the last person loses. Since these are independent events, we need to multiply the probabilities of each of these events (AND always means multiply the probabilities). If I write w=1/80,000,000, the probability that a given person wins the lottery, then the probability that that person loses is (1-w). The probability that everyon loses is (1-w)^N, where N=200,000,000 is the number of tickets sold. Although that looks terrible to compute, actually a hand calculator correctly computed this number to be 0.082.

To get the probability that exactly one person wins, we decided that it is equal to the probability that (the first person wins AND all the others lose) OR (the second person wins AND all the others lose) OR...OR the last person wins ane all the others lose). The AND means multiplication, and the OR means adding probabilities. To get the probability that one specified person wins AND all the others lose, this is equal to w*(1-w)^N-1 which is hardly different from w*(1-w)^N since the extra factor of (1-w) is very, very clost to 1. But there are N tickets, so the OR means we add this number to itself N times and the probability that exactly one person wins, one of the N tickets, is (Nw)*0.082 or 0.205.

For two people we follow the same principle: We compute probability that (the first person wins AND the probability that exactly one of the other people wins AND that all the other people lose) OR (the second person wins AND exactly one of the other people wins AND that all the others lose) OR...OR (the last person wins AND exactly one of the other people wins AND that all the others lose). We just computed the probability that one of the other people wins AND that all the others lose, it's 0.208. And the probability that a particular person wins is still w. And, there are N identical numbers that are OR'ed together, so we have to multiply by N again, getting (Nw)*0.208. However, there is a little complication, because if you look at the first two numbers above, in both of them there is a piece that comes from the first person winning AND the second person winning. And a similar thing can be said about any pair of items above. So what has happened is that each pair of people appears twice in the sum and is therefore counted twice as much as it should be. So the probability we want has to be divided by 2, and the answer we need is: The probability that exactly two people win is (Nw)*0.208/2=0.257.

In a similar way, the probability that exactly three people win can be computed as (Nw)*0.257/3=0.214; similarly to the case of two people, we note that each triple of tickets gets counted three times, so we have to divide by 3. And so forth for the case of exactly four, five, six and so on. Once you get to seven people, there's less than a 1% chance that that many people will win.

Now we can compute the expected value of a ticket. Your probability of winning is w. If no one else wins (probability 0.082) then you would win $280M. If exactly one other person wins (probability 0.205) then you would win $140M. And so forth. Adding it all up, the expected value of a ticket is $1.29. We calculated $1.12 by dividing $280M by 2.5=200M tickets/80M probability of winning.

But still, taxes and the fact that you can't take home the entire jackpot if you want it all at once means that it is not worth it (from an expected value point of view). The only reason to buy a ticket is for the fun of it.

I gave out some worksheets for estimating your utility function for money. You should work them out this weekend. We'll discuss them on Monday.

Thursday, October 16, 2008

Decision tree diagram

Here is the photo I took on Wednesday. The quality isn't great, but you should be able to copy it to your clipboard and look at it in more detail. In fact, I just clicked on it (double click on a Mac, I don't know what you do with a PC) and Firefox presented it in a separate window, and the numbers were easily readable.

In addition to the decision tree we discussed the PowerBall lottery. p=1/80,000,000 to win, 200,000,000 tickets sold, $280,000,000 jackpot. The question is, does it pay (in an expected return sense) to enter? We immediately noticed that there might be more than one winner, and we estimated roughly 2.5 winners on average. This makes a ticket worth $1.41, so at first sight it appears to be a good idea to enter. But there are several problems, which we uncovered on further discussion. One is taxes: You would be taxed at the highest bracket, which is in the 35-39% range (depending on the tax law), as well as Vermont income tax. Also, you don't get the money all at once, but in installments over 20 years. To get the money at once, you have to take a discount of about 50% (since the way the lottery works, the state buys you an annuity that pays out over 20 years, and you would only get the amount that they would pay the insurance company to buy the annuity). Thus, it seems that it isn't worthwhile after all.

I left you with the problem of trying to get a more precise estimate of the expected return, considering the probability that 1, 2, 3,... more winners will win the lottery other than you.

Tuesday, October 14, 2008

Class, 10/13

We went through the test, and I won't repeat what we talked about except to note several things that I want to emphasize.

1) On the Fermi problems, several tips. Don't try to be too fancy, as in trying to estimate low, middle and high income populations and housing prices, and rolling it together to get an average. Better is to estimate something close to the median cost of a house and just multiplying by the number of houses. Very few people have expensive houses, and it isn't going to increase your accuracy by trying to factor that information in. Also, don't forget to divide the population of the U.S. (300 million) by the average family size (around 4).

2) On the coin problem, the easiest way is to recognize that the method chosen (pick coin at random and flip) has an equal probability of seeing any particular side. Cross off the tails (5 instances) and there are 7 ways to get a head. Of these, 4 will have a head on the other side. If you use the "spreadsheet" method, recognize that there are three states of nature, HH, HT and TT, with prior probability of 2/6, 3/6 and 1/6, respectively. The same idea works if you use a tree...recall that the branches at the base of any probability tree will always be the states of nature and their prior probabilities. This is why it is important to start any analysis by identifying the distinct states of nature, and then their prior probabilities, no matter what method you use. Then the likelihoods are branches off the base branches, identified as the data observed and the probability of observing that data (H in this case) given that the corresponding state of nature is true.

3) Here the important thing to recognize is that the taxis continue to drive around, so it is sampling with replacement, and the three factors in the likelihood will not change from observation to observation since the number of taxis available does not change.

4) This one is probably best solved with the natural frequencies method. Take 1500 students in the group. 150 will have taken the drug and 1350 will not have taken it. Of those that took it, 147 will be caught by the test and 3 will escape detection. Of those that did not take it, 40.5 will be falsely caught and 1309.5 will correctly be identified as not taking the drug. This gives the answer to the last part of the question (40.5), and by computing the ratio 147/(147+40.5)=0.78 we get the probability that a student has taken the drug, given that he tests positive.

5) The table is dependent, since the entries in at least one cell do not equal the product of the marginal probabilities in the corresponding row and column. To get an independent table, just multiply those marginals and enter the product into the corresponding row and column.

6) The easiest way to do this one is to focus on the gains and losses, rather than the absolute amount that you get back at the end. Thus, the gain is $500 for the bond, and for the mutual fund it is 0.7*$9700*0.15-0.3*$9700*0.1=$727.50. However, you have to pay a commission out of this (-$300 tollgate), so you'll actually have an expected profit of $427.50. Since this is less than $500, you'll prefer the bond.

If you focus on how much you get back, you have to be careful, because on the mutual fund branch you will already have subtracted the commission so you don't want a tollgate or you'd be paying the commission twice.

Generally speaking, it's a lot easier to do these problems by focusing on the gain or loss rather than the amount you get back after a year.

Reading update

Finish reading Smart Choices. Also, read Chapters 18, 19, 21 and 22 of Flip.

Wednesday, October 8, 2008

Class, 10/8

We finished discussing the study guide.

First we finished the "take balls out, mark them, and put them back" scenario for estimating the number of balls in the urn. The basic idea here is, make a picture to yourself of the number of marked balls that have to be in any urn at each sampling event. The probability of picking the particular ball (marked or unmarked) is the fraction of balls of that type in that urn. Calculate that for each urn (State of Nature) and multiply the likelihood by that fraction. Then go on to the next sampling event, remembering to mark the ball so that the probabilities will change on the next sampling event. Then, you know the routine: Multiply prior times likelihood to get joint; sum the joint; divide the sum into each joint probability to get the posterior; sum the posterior to verify that the sum is 1.

The same idea is operative for the balls marked by numbers.

As we discussed in class, there are really two possibilities for this. The problem in the study sheet is for sampling without replacement. This means that the number of balls changes after every sampling event. This would be appropriate if, for example, we were interested in figuring how many German tanks had been produced from the serial numbers of captured tanks (which are out of commission after capture).

We could have put the balls back in the urn after sampling them; that leads to a slightly different problem, where it may be possible to sample the same object more than once. For example, you might be an airplane-spotter: You want to estimate the number of airplanes owned by an airline, and you might know that the numbers on their tailfins are sequential. Since you might see the same airplane more than once, this would be sampling without replacement.

The difference between the two scenarios is: With sampling without replacement, the number of items in the urn being sampled (airplanes, for example) doesn't change, so the denominator remains constant at the number of items originally in the urn. With sampling with replacement, the denominator decreases by 1 each time an item is sampled. Otherwise, the two cases are the same.

But these problems have basically the same structure as the "catch-and-release" problem with the unmarked balls that we mark. Identify the states of nature (the unknown number of items in the original set-up), put a prior on them, calculate the likelihood, by multiplying the probability of sampling each item in turn together (considering the particular method of sampling/marking/number displayed), compute the joint, calculate the sum, divide to get the posterior, check that the posterior sums to 1.

The next problem, ants and beetles, is exactly like the polling problem we discussed earlier. Instead of voters who say they will vote for candidate A or B, we have insects that we identify as ants or beetles. In both cases, the states of nature are all of the true proportions of each kind that exist in the population being sampled (voters, insects). Both of these assume that the number of items in the population is very large.

Finally, on the Decision Problem: The way we attacked this problem was to look at a simpler problem than the one in the study sheet: Should you just produce parts, or should you sacrifice one part, pay $150, and produce the remaining parts knowing that the machine would be in a "good" state and would produce a much higher proportion of "good" parts.

We set up a decision tree: First, a square box, representing the decision we had to make...they are: fix machine first and produce 23 parts; or just produce 24 parts, regardless.

On the first scenario, we set up a "toll gate" for -$150 on that branch. Since we knew that machine to be in a "good" state on the right of that, we could then decide that the expected return on that branch would be 0.95*23*$2000 since bad parts aren't worth anything.

On the second scenario, there is no initial cost of $150 so no "toll gate." But we now have two branches, one with probability 0.9 in which the machine is "good", and one with probability 0.1 in which the machine is "bad". If the machine is "good," 0.95 of the parts will be useful. If it is "bad," only 0.7 of the parts will be useful. By tracing the probability tree backwards, the expected number of parts that will be useful is (0.9*0.95+0.1*0.7). This is multiplied by 24*$2000 to get the expected profit.

We found that the best scenario under these rules was to abandon caution and just produce parts. The extra part (24 instead of 23) produces more expected profit than the more reliable machine minus the cost of $150 to make sure the machine is OK.

One student mentioned that there might be other circumstances, like the need to make a minimum profit. This is quite true, but it wasn't part of the assumptions of the problem. An example might be that if you don't make the minimum profit, someone might come over and break your legs. That's a different problem, but it can be analyzed by the tools we are developing. You just have to build that into the decision tree you build.

Don't forget: Bring your calculators, come early if you can, stay a little late if you can (but not later than 10:05 for the next class) and be sure to attempt to answer as many questions as you can. Budget your time. Make sure you convince me that you know how to answer a question even if you don't have time to do the complete calculation.

Monday, October 6, 2008

Interesting things in this week's journals

One person remarked on the volatility of the stock market, particularly as we are experiencing now. Generally speaking, the stock market is a risky proposition in the sense that over short periods of time it can be quite volatile. Over the past several months, it is down close to 30%. One should not put money into the stock market that you're going to need soon, even within the next five years or more. That's where you should put money you won't need for a decade or more. The second thing is diversification, which is best achieved by investing in mutual funds that represent a broad cross-section of the market. The third thing is to have a mix of stocks and fixed income investments like bonds and money market funds whose volatility is much less, even though their long-term potential for return is lower than with stocks. (Historically, stocks have returned of order 10% per year over a long period, although they can be down sharply in any given year. Bonds typically return a few percent over inflation.)

Another person also talked about the stock market, and mentioned recent volatility. I did mention that recent volatility, although bad, is by no means a percentage record. That is, a 700 point drop in one day is about 7%. There have been much larger percentage drops in history, although 700 points may be a point record. But really, only percentage changes actually reflect what's really happening.

Several people reported that they'd like more clarification of various points. There are many places where you can get this kind of response. Your journal is one; class is another, and you know by now that I'm happy to get your questions. Or you can talk to me out of class. Or you can ask questions by posting them as comments to this blog. Or you can send me email. I welcome all of these.

One person had read Lewis' book and asked about the Prisoner's Dilemma problem. This is of course a problem in game theory, which isn't really part of this course. But it is an interesting problem nonetheless, as it raises the question of whether there is a way for the prisoners to avoid falling into the jailors' trap, thus ending up with sentences that are more favorable to both. There are approaches that can do this, by embedding this particular game into a larger one. You might find information about this on WikiPedia or on the web.

One person did an analysis of all of the possible shooting orders for the Trewel problem, not just ABC but also all other permutations such as CBA, CAB, BAC, etc. In all of these situations, the result is that the best shooter has the best probability of surviving, and the worst shooter has the second best probability of surviving. It is Bob that has the lowest probability of making it out alive. Very interesting!

One person asked about Fermi problems, "If you are possibly starting with a completely wrong number, what's the point?" The point is that you often are in a situation where a decision must be made on imperfect knowledge, so you have to make such estimates. So, it is a good skill to perfect, and practice makes perfect. The more practice you have, the more skilled and confident you become, the better you'll do.

Another person asked about polls taken over a period of time and asked, does each poll have its own bell-shaped curve? The answer is yes, each poll has some uncertainty and therefore its own bell-shaped curve. Now peoples' opinions change over time, so we can't just average the polls over several months together to figure out what is going to happen next. The average of several polls is more likely to reflect what opinion was halfway through the polling period. More sophisticated approaches (something statisticians call "regresssion", which is the subject of a Burack lecture next Monday afternoon) would be required.

One person made a mathematical mistake, which I want to point out. If we have several probabilities expressed as percentages, e.g., 3% and 2%, you cannot multiply them to get the probability of a joint event as 6%. That's because, expressed as probabilities, these are 0.03 and 0.02, respectively, so the probability of the joint event is 0.0006, 0r 0.06%.

One person mentioned an interest in political science and polls, and I mentioned that Prof. Andrew Gelman at Columbia University has a blog to which he posts daily. He is a Bayesian statistician and a political science, author of an interesting book, "Red State, Blue State." Some of what he posts is advanced but much is quite accessible to nonstatisticians. His blog can be found here. I read it every day.

A very interesting problem was posed by one person, who mentioned hanging out with a friend and finding, within a short distance of each other, two four-leaf clovers. If the probabililty of finding one four-leaf clover is 10^-4=1/10000, does this mean that finding two nearby each other is 10^-8? Actually, it probably isn't, for several reasons. The first is basically a fallacy: It may be that this is correct for any two people sitting together at random places around the earth, but the fact is that if you find one four-leaf clover, all of a sudden your attention is drawn to this low-probability event, but it is an event that has already happened. So the probability that is really relevant, once you have found one, is P(find a second|found one), and that is at least 10^-4. If you hadn't found a second one, the first one probably wouldn't have been written about. The same fallacy underlies the occasional news show where you learn that someone who has already won the lottery has won again. The probability that you win a second time, given that you won once, is the same as the probability that you win once (assuming independence). But the only reason that the event made the news is because of the second win. It is a mistake to be very surprised that occasionally someone wins twice.

The other reason is that four-leaved clovers are (as the person mentioned) due to genetics, or to soil condition, or to other external factors. That means that it probably isn't the case the P(find a second one|found one)=10^-4. Because of these factors, this probability probably isn't independent, so writing that P(find a second one|found one) might be much larger than P(find one). It's not unlikely that four-leaved clovers grow in proximity, that is, in clusters.

Another person asked about the formula square root of N*p*(1-p) for the expected uncertainty of the number of coin flips or voters voting for a candidate, where p is the true probability in the entire population. I pointed out that this formula isn't a part of the course, but was brought up to answer a question that was asked in class. You are not responsible for this formula, and I will not derive it. But one thing puzzled this person: the uncertainty is smaller, the farther away from 0.5 p is. But it really is true. One way to see this is to ask about the case p=0. In that case, the voters are unanimous in favoring candidate B, or the coin has two tails. You have no variation at all, so this formula evaluates to 0, as it should. By extension, as you go away from zero, the variation will increase, and because of symmetry (after all heads and tails are symmetric states; voting for A and voting for B are similarly symmetric), it will decrease again as you go to values of p greater than 0.5.

Class, 10/6

We continued studying the study guide for the quiz on Friday.

I had left you with the "three cards" problem. I brought in my trick coins and we determined that there are three states of nature, HH, HT and TT. We did the calculation in a spreadsheet format, taking a prior of 1/3 on each SON, and recognized that if we observe H, the likelihood of observing H is 1 if it is the HH coin but only 1/2 if it is the HT coin. So this yields a spreadsheet that is similar to the Monty Hall (standard) problem, and if we see a H, the posterior probability is 2/3 that the other side is also H.

We then finished the cancer problem. The probability that a member of the general population who tests positive does not have the gene is 495/593 or about 5/6, and 1/6 that this individual does have the gene. Part (3) of the question asks first, what's the probability that someone with a positive test gets the disease: That is
1/6*0.2+5/6*0.0002 or about 0.03333+0.00017=0.0335. Of these, most have the gene. Only 0.00017/0.0335 or about 0.005. That's only 1/2 of 1 percent.

We discussed the galaxy problem. As many of you pointed out, it is exactly like the Shakespeare and Marlowe problem. The problem sets the prior at P(E)=0.8, P(S)=0.2; the likelihood is gotten by raising the probability for each type of object to a power equal to the number of that type of object that the machine found, and multiplying the values together for the three types of object. We found (using a spreadsheet calculation) that the posterior probability was about equal for E and S after this evidence.

On the plagiarism problem, a question was asked about what the meaning of the code is. We discussed how mathematical tables are constructed: You calculate the number to more significant digits than you plan to publish, and round up or down according the the digit that follows the one you plan to publish. If that following digit is a '5', you have a choice to round up or down. By flipping a coin you can round up or down in a random way such that is unlikely to be duplicated by someone else independently putting together a table. So in this way you embed a secret code, known only to you, into the table by the rounding pattern. We calculated that if the prior probability is 1/2 for plagiarism vs. accidental agreement (no cheating), then the posterior probability is about 10^-30 that no cheating was involved if the code is duplicated exactly.

We discussed the reason for choosing equal priors: The law says in civil cases that the side with the preponderance of evidence wins the case. That is anything more than 50%.

I also mentioned that this technique is used to prevent plagiarism in other cases, e.g., map making, by putting small but innocuous mistakes in a map. Also, mistakes in the genome from generation to generation can be used as a "clock" to tell how far back in time two present-day organisms had a common ancestor, as well as the degree of relationship between a number of organisms, for example, how closely related human beings from various parts of the world are when traced back in time.

Finally, we got most of the way through the first urn problem. We decided that there are 10 states of nature corresponding to there being 1, 2,...,10 balls in the urn. The probability that the first ball is unmarked is of course 1, independent of the SON. The second ball is unmarked, but here the probability of that given the SON is 0 if the urn contains only one ball (because then the only ball in the urn is marked), 1/2 in the case of 2 balls in the urn, 2/3 in the case of 3 balls in the urn, and so forth. The third ball was marked, and now there are two marked balls in the urn so we're picking one that we'd already popped back in. Now for SONs 2, 3, ..., 10 the probability of picking a marked ball is 2/2, 2/3, 2/4, ..., 2/10. That is where we left it. We'll return to this problem on Wednesday.

I mentioned that this sort of thing is used, e.g., by biologists who catch fish, tag them, and release them, then after the population has had a chance to mix up, catching a sample again and seeing what proportion of the fish caught the second time are tagged. This can be used to estimate the size of a population of fish in a lake.

Friday, October 3, 2008

Class, 10/3

We passed over the Fermi problem bullet after I re-explained the geometric mean method.

We reminded everyone of the basic equation of conditional probability that underlies everything we are doing: P(A,B)=P(A|B)P(B)=P(B|A)P(A). We talked about the three equivalent facts about whether a distribution is independent or not. It is independent if and only if P(A|B)=P(A) for every A and B, it is independent if P(A,B)=P(A)P(B) for every A and B, and if P(A|B)=P(A) for every A and B, then necessarily P(B|A)=P(A) for every A and B.

We then showed how to construct the unique table of joint probabilities when we are given the marginal probabilities: Just multiply the marginal in a row with the marginal in a column and put the result in the corresponding cell.

We then took a table of independent probabilities and, by changing four cells in a square, just adding an arbitrary number to two of the cells on a diagonal and subtracting the same number from the cells on the other diagonal, and got a table where the probabilities are not independent.

We discussed the Monty Hall problem and variants. We found that if there are four doors, and Regular Monty opens two of them, each showing a goat, then the probability of getting the prize goes from 1/4 to 3/4 if we switch. We found that it goes from 1/4 to 3/8 if Monty opens one door and we switch to one of the others. We did this by a spreadsheet calculation. We then thought of a simpler way: Since your probability of initially picking the right door is 1/4, the probability that one of the other doors has the prize is 3/4. That doesn't change when Monty opens one of them and shows you a goat. So, since there are two doors left, the probability that you'll get the right one if you switch is 1/2 times 3/4, or 3/8.

I left you with another related problem to think about: There are three cards which have been made by pasting together two cards so that the backs are visible. One has two red backs, one has a red back and a blue back, and one has two blue backs. The cards are put in a hat and shaken, and you pick one out, looking at only one back. It is red. What is the probability that the other side is red?

We'll discuss that next time.

We went onto the cancer problem. We took a population of 10000 individuals. The problem statement says that 1% of the population has the gene, so that's 100 who have the gene and 9900 who don't. Of those who have the gene, 98 will be detected by the test and 2 missed (false negative). Of those who don't have the gene, the test will falsely identify 5% as having the gene, or 495 in all (false positives) and will correctly say that the remaining 9405 do not have the gene. Looking at just the positives, there are 98 true positives and 495 false positives, so that the probabilty that a person has the gene if they test positive is only 98/593 or about 1/6; the remaining 495/593, or about 5/6, do not have the gene.

We ran out of time here and will continue on Monday, finishing this problem and then going on in the study guide.

In answer to a question, I stated that if there is an item that we don't get to in our review, then similar items will not appear on the test. I also stated that I expected there to be five questions on the test, and that there should be enough time for everyone to do all of them. I pointed out that usual test-taking strategy says to go for the easy ones first and save the bulk of the time for the harder ones. I also said that it is very important to at least try to answer every question, since I cannot give credit if an item goes completely unanswered. We agreed that people who come early (not earlier thatn 10 AM, please) could start early, and that you may be able to stay an extra 5 minutes (but not more, because of the class that meets next in this room) to finish.

Be sure to bring your calculators. I do not have a loaner calculator!

Class, 10/1

We discussed the fourth problem set, which was done pretty well by you all. I pointed out several errors that were made:

One group forgot, on the second problem, that three different widgets were sampled independently, so that the likelihood had three factors in it, not one.

One group didn't recognize that the third problem had only two states of nature, that is, whether it is Urn #1 or Urn #2. Somehow this group ended up with five states of nature, that is, 1R, 1W, 2R, 2W and 2B, where the number is the urn number and the letter the color. The point is that the thing you don't know and want to learn is always the way to determine what the states of nature are. Here, what we don't know is which urn we've picked, so that tells us what the states of nature are.

One group got the states of nature right, but in the third ball selection forgot that it is the number of balls in the urn when the ball is picked that gives the denominator. True, this is a step made without replacement, but since the first two steps all involved returning the ball (that is, with replacement), there are still ten balls in the urn when the third ball is picked.

On the last problem, one group computed correctly the contribution to the likelihood for each word, but then added them instead of multiplying to get the likelihood. Since the likelihood is the probability that we got 3 of the first word AND 5 of the second AND 3 of the third, you have to multiply. When you compute the probability of one thing AND the probability of another thing, you always multiply. Addition is for when you want to compute the probability of one thing OR the probability of another thing. For example, when you add the joint probabilities in a spreadsheet calculation, you are computing the probability of (data,SON1) OR the probability of (data,SON2) OR ..., to get the probability of the data, irregardless of which SON is true.

We then finished the "Trewel" problem and calculated that the best thing for Alan to do is to shoot in the air, letting Bob and Charlie duke it out, and then, with one of them dead for sure, to come in on his second try and try to kill the survivor. We also recalculated the probability of Alan eventually killing Bob when Alan goes first. See Class, 9/29 for a calculation.

Finally, we discussed the first item on the study sheet, Fermi problems. We discussed the length of the Nile river...and I reminded everyone about the geometric mean trick. If you can put a reasonable lower bound on a quantity and a reasonable upper bound, so that you are pretty sure that the true value is between those bounds, then a decent guess at the correct value is to multiply the lower bound by the upper bound and take the square root of that number. For the Nile, a lower bound might be 100 miles and an upper bound 10,000 miles, which would give an estimate of 1000 miles. Wikipedia says 4100 miles, so this is not a great estimate. For the Mississippi, we imagined the U.S. as a box 3000 miles wide and 2000 miles high, so a length of 2000 miles. The actual length is 2340 miles, so that worked better.

I pointed out that the important thing with regard to Fermi problems is how you got the answer, not the actual value of the answer.

Monday, September 29, 2008

Class, 9/29

Today we discussed further the polling example and actually calculated a simple result. I'll try to post a copy of the calculation later. We found that with 6 favoring candidate A out of a sample of 10, the posterior probability that candidate A wins (has over 50% of the vote) is about 71%. The posterior distribution is closely bell-shaped and peaked at r=0.6=6/10. That's a general rule.

We also asked what would happen if we used a more realistic prior that put more weight near 1/2. In response to a question, I pointed out that you can't put it near 0.6 because that would be "cheating," using the same data twice. You have to do it without looking at the data. We found that a prior that rises to a maximum at 0.5 and then falls again will do two things: It will narrow the posterior distribution somewhat, and will also shift the peak closer to 0.5. If there is a whole lot of data, then the effect of the prior will be negligible, but in our example it can be significant.

The following webpage has election predictions with a chart (lower right hand) that shows a similar posterior distribution, based on calculating the electoral vote in a simulation (this is a modern computational technique, even more powerful than the spreadsheet method we discussed). The left-hand chart has other information that summarizes the posterior probability in several ways: Where the maximum of the posterior probability is, what the win probability is, and so forth.

We discussed a situation where two people enter into a consecutive duel: Alan and Bob will take shots at each other in turn. Alan's probability of hitting Bob and putting him out of commission on one shot is 1/3; Bob's probability of putting Alan out of commission is 2/3. We asked, if they keep taking turns until one hits the other, what's the probability of Alan eventually hitting Bob if he goes first? If he goes second?

Although this could be calculated (as one student suggested) by multiplying and adding probabilities until the numbers were very small, I suggested another way.

If Alan goes first, he'll win outright on his first shot 1/3 of the time. He'll also eventually win with probability (2/3)*P(Alan wins eventually | Bob goes first). So

P(Alan wins eventually | Alan goes first)=1/3+(2/3)*P(Alan wins eventually | Bob goes first).

But P(Alan wins eventually | Bob goes first) can be calculated in terms of P(Alan wins eventually | Alan goes first), because the only way that Alan can win eventually if Bob goes first is if Bob misses on his first try (P=1/3). Then Alan will have a second chance, and the probability that he'll eventually win if Bob goes first is therefore

P(Alan wins eventually | Bob goes first)=(1/3)*P(Alan wins eventually | Alan goes first)

Therefore, P(Alan wins eventually | Alan goes first)=1/3+(1/3)*(2/3)*P(Alan wins eventually | Alan goes first). This can be solved for P(Alan wins eventually | Alan goes first). This turns to be 3/7, and P(Alan wins eventually | Bob goes first)=(1/3)*(3/7)=1/7.

Finally, I brought in Charlie, who is a crack shot and never misses. We decided that if Charlie goes first, he'll knock off Bob since Alan is a poorer shot, and if Alan misses, Charlie will knock him off the second time around. Also if Bob went first to be followed by Charlie, Bob will try to knock off Charlie first since if he knocked of Alan, he's a goner. But if Alan goes first, the first guess that he'd go after Charlie seemed to be wrong, as one student pointed out. Actually, Alan has a better chance of survival if he shoots his gun into the air! More on this later.

Study Sheet for First Quiz

The first test will be on October 10 (Friday). There is a study sheet here. We will start discussing this on Wednesday or Friday, so get together with your group and prepare yourselves for our discussion.

Sunday, September 28, 2008

Class, 9/26

We talked about Problem #4. I mentioned first that it was modelled after a book by Mosteller and Wallace (Mosteller is a famous statistician), in which they tried to determine the authorship of several of the disputed articles in the famous Federalist Papers, published to try to convince Americans to adopt the Constitution.

The idea is that each time a word is used, that may reflect on the author, since different authors tend to use words with different frequencies. So, one author might use "while" and another might use "whilst." ("Whilst" was still in common use in this part of the world in 1789.) So, we can form the likelihood in this problem by multiplying the probability of a word, given the author, for each time a word appears in the text. There are two authors, so we will get a product of many numbers, one number for each word in the sample text.

This requires computing quantities like (0.002)⁵. Unfortunately, this results in some very small numbers. I recommended using powers of ten notation, so that you would have, for example, (0.002)⁵=(2x10^-3)⁵=32x10^-15. You'll get a small integer times some very small number written in scientific notation. The good news is that the power of ten will cancel out of the final answer.

At the end of class, I discussed polls a bit. We determined that the states of nature are the various proportions r of voters who favor candidate A over candidate B. There are infinitely many such numbers. We also discussed how the error in the result will theoretically go down as the size of the sample goes up, so for example the error (plus or minus) in the number of voters in the sample favoring either candidate is roughly (N*r*(1-r))^1/2. So, if N is 1000 and r=1/2, the expected error in the number of voters is about 15, and the error in r is about 15/1000 or 0.015; double that gives what those of you who took statistics before would call the 95% error bar, that is, we expect to have an error larger than +/-0.003 in only 5% of cases.

In real life, sampling difficulties will make the real error bigger than this, so it's more normal for pollsters to quote a somewhat larger number, for example, +/-0.005.

We talked about a basic Bayesian way to do this in practice, namely, in a spreadsheet. We could list a sequence of equally spaced center-points for an interval of r, for example, 0.05, 0.15, 0.25,...0.95, representing intervals of length 0.1. We assign each value of r a prior. One suggestion was 1/10 for each, but one student pointed out that a more realistic prior would be larger for values of r around 1/2 and smaller or near-zero for values of r that deviate significantly from 1/2. Then we can compute the likelihood, which we determined was given by rⁿ(1-r)^N-n, for each value of r, where N is the total number of voters in the sample and n is the number of voters favoring candidate A. (We ignored non-responses). Then in a few mouse strokes we can calculate the joint probability column, compute its sum, and divide it into each joint probability to get the posterior probability.

Finally, we set the date of the first test for Friday, October 10.

Wednesday, September 24, 2008

Class, 9/24

Today we first went through the problem set.

On problem 1, I stressed that the data is that the "expert" said that it is a Super Growth Stock, not that it is a SGS (that is something we don't know, so it can't be data). Something that you know to be true is data; something that you want to know but don't is a state of nature. Because of this confusion, at least one group got their tree wrong by ignoring the branch where SGS was false. This means that they missed counting in their calculation the probability that the "expert" said that the stock was a SGS, when it was not (false positive). If the false positive rate is very large, then you cannot ignore this part of the problem!

The only other problem was that the statement says that your friend is no better than a monkey throwing darts at the stock pages in picking stocks, so the prior probability that he actually picked a SGS is only 1/1000. Some people entertained the notion that it was 1/2, but that contradicts the statement of the problem.

Problem 2 and Problem 1 are basically the same, the only difference is that the false positive rate and the false negative rate are different in this problem (they were the same in problem 1).

In problem 3, the main difficulty was that it has two parts. First, after picking out four chocolate chip cookies in a row, the posterior probability that you have the all chocolate chip cookie box is now 42/43, not 1/2, as some assumed. This changes the probability that the next cookie chosen from the box is a CC cookie quite significantly: It is very close to 1 after you get 4 CC cookies out and no raisin oatmeal cookies.

In problem 5, one group tried to solve it by displaying a particular example of independence in a table and showing that the three relationships were satisfied. The problem is that this only shows it for that particular table, but not in general. To use this method, you'd have to do it for every single one of the infinite number of tables that could be conjured up. This is obviously impossible. What I was looking for was something like this (for one of the things requested):

If P(A|B)=P(A), show that P(A,B)=P(A)P(B).

Because P(A,B)=P(A|B)P(B) no matter what, by substitution from the assumed condition that P(A|B)=P(A) we find that P(A,B)=P(A)P(B).

The More Independence problem was no problem!

We finished the class by discussing a slightly more complicated example of a decision tree, where you are personally deciding whether to install an airbag into your car after, for example, it had deployed in an accident. We showed how the expected value or loss (loss in this case) would propagate backwards in the decision tree, allowing us to cut off a more costly branch at the square (decision) box to choose the best outcome.

Comments on Journals

Some brief comments on this week's journals.

1) It is important to understand that in the "likelihood" column, the probability of what was actually observed, given each of the states of nature, the numbers in this column do not have to add to 1. That is because the states of nature are on the right hand side of the bar; the probabilities in this column are not the probabilities of the states of nature, but the probabilities of the data, given the states of nature.

2) One student wrote about HIV testing, and about the effects of false positives (the book gives an example). He pointed out that a false positive that is due to mislabeling of the sample affects two people, not only the one who gets the false positive report, but also the one whose sample was actually positive, but who probably got a negative report because the wrong sample was assigned to him/her (switched at the lab).

3) Another student wrote from personal experience about a friend whose mother got a false positive mammogram and had to undergo significant psychological and physical pain before cancer was ruled out. On the other hand, failure to follow up on a positive test could be catastrophic, given that in the general population almost 10% of women who get a positive test actually have the disease. What should a doctor do? Certainly, tell his patient that there is a better than 90% probability that she does not have cancer, but on the other hand, that it needs to be followed up.

4) I talked in class about alternative King/Brother scenarios. If you know that the King's sibling is older, it has to be a sister. If you know that the King is the oldest, then there's a 50-50 chance that the younger sibling is a brother.

5) I also discussed what another student brought up, that we professors have to grade you students on a linear scale, which ignores everyone's particular strengths and weaknesses. But, I pointed out, we can and will write letters of recommendation that deal with all of the issues that we know about you, so as to give a potential employer/school a more three-dimensional understanding of you as an individual.

6) One student wrote that we must consider the consequences of actions that we take as well as the probabilities. This was wonderful to read, as that is what we talked about a little on Wednesday and is a major part of the course. Another student talked about the fact that wagers look different when they are for small change than they do when they are about major bucks. We'll talk about that too.