Bayes Rules

Monday, October 6, 2008

Interesting things in this week's journals

One person remarked on the volatility of the stock market, particularly as we are experiencing now. Generally speaking, the stock market is a risky proposition in the sense that over short periods of time it can be quite volatile. Over the past several months, it is down close to 30%. One should not put money into the stock market that you're going to need soon, even within the next five years or more. That's where you should put money you won't need for a decade or more. The second thing is diversification, which is best achieved by investing in mutual funds that represent a broad cross-section of the market. The third thing is to have a mix of stocks and fixed income investments like bonds and money market funds whose volatility is much less, even though their long-term potential for return is lower than with stocks. (Historically, stocks have returned of order 10% per year over a long period, although they can be down sharply in any given year. Bonds typically return a few percent over inflation.)

Another person also talked about the stock market, and mentioned recent volatility. I did mention that recent volatility, although bad, is by no means a percentage record. That is, a 700 point drop in one day is about 7%. There have been much larger percentage drops in history, although 700 points may be a point record. But really, only percentage changes actually reflect what's really happening.

Several people reported that they'd like more clarification of various points. There are many places where you can get this kind of response. Your journal is one; class is another, and you know by now that I'm happy to get your questions. Or you can talk to me out of class. Or you can ask questions by posting them as comments to this blog. Or you can send me email. I welcome all of these.

One person had read Lewis' book and asked about the Prisoner's Dilemma problem. This is of course a problem in game theory, which isn't really part of this course. But it is an interesting problem nonetheless, as it raises the question of whether there is a way for the prisoners to avoid falling into the jailors' trap, thus ending up with sentences that are more favorable to both. There are approaches that can do this, by embedding this particular game into a larger one. You might find information about this on WikiPedia or on the web.

One person did an analysis of all of the possible shooting orders for the Trewel problem, not just ABC but also all other permutations such as CBA, CAB, BAC, etc. In all of these situations, the result is that the best shooter has the best probability of surviving, and the worst shooter has the second best probability of surviving. It is Bob that has the lowest probability of making it out alive. Very interesting!

One person asked about Fermi problems, "If you are possibly starting with a completely wrong number, what's the point?" The point is that you often are in a situation where a decision must be made on imperfect knowledge, so you have to make such estimates. So, it is a good skill to perfect, and practice makes perfect. The more practice you have, the more skilled and confident you become, the better you'll do.

Another person asked about polls taken over a period of time and asked, does each poll have its own bell-shaped curve? The answer is yes, each poll has some uncertainty and therefore its own bell-shaped curve. Now peoples' opinions change over time, so we can't just average the polls over several months together to figure out what is going to happen next. The average of several polls is more likely to reflect what opinion was halfway through the polling period. More sophisticated approaches (something statisticians call "regresssion", which is the subject of a Burack lecture next Monday afternoon) would be required.

One person made a mathematical mistake, which I want to point out. If we have several probabilities expressed as percentages, e.g., 3% and 2%, you cannot multiply them to get the probability of a joint event as 6%. That's because, expressed as probabilities, these are 0.03 and 0.02, respectively, so the probability of the joint event is 0.0006, 0r 0.06%.

One person mentioned an interest in political science and polls, and I mentioned that Prof. Andrew Gelman at Columbia University has a blog to which he posts daily. He is a Bayesian statistician and a political science, author of an interesting book, "Red State, Blue State." Some of what he posts is advanced but much is quite accessible to nonstatisticians. His blog can be found here. I read it every day.

A very interesting problem was posed by one person, who mentioned hanging out with a friend and finding, within a short distance of each other, two four-leaf clovers. If the probabililty of finding one four-leaf clover is 10^-4=1/10000, does this mean that finding two nearby each other is 10^-8? Actually, it probably isn't, for several reasons. The first is basically a fallacy: It may be that this is correct for any two people sitting together at random places around the earth, but the fact is that if you find one four-leaf clover, all of a sudden your attention is drawn to this low-probability event, but it is an event that has already happened. So the probability that is really relevant, once you have found one, is P(find a second|found one), and that is at least 10^-4. If you hadn't found a second one, the first one probably wouldn't have been written about. The same fallacy underlies the occasional news show where you learn that someone who has already won the lottery has won again. The probability that you win a second time, given that you won once, is the same as the probability that you win once (assuming independence). But the only reason that the event made the news is because of the second win. It is a mistake to be very surprised that occasionally someone wins twice.

The other reason is that four-leaved clovers are (as the person mentioned) due to genetics, or to soil condition, or to other external factors. That means that it probably isn't the case the P(find a second one|found one)=10^-4. Because of these factors, this probability probably isn't independent, so writing that P(find a second one|found one) might be much larger than P(find one). It's not unlikely that four-leaved clovers grow in proximity, that is, in clusters.

Another person asked about the formula square root of N*p*(1-p) for the expected uncertainty of the number of coin flips or voters voting for a candidate, where p is the true probability in the entire population. I pointed out that this formula isn't a part of the course, but was brought up to answer a question that was asked in class. You are not responsible for this formula, and I will not derive it. But one thing puzzled this person: the uncertainty is smaller, the farther away from 0.5 p is. But it really is true. One way to see this is to ask about the case p=0. In that case, the voters are unanimous in favoring candidate B, or the coin has two tails. You have no variation at all, so this formula evaluates to 0, as it should. By extension, as you go away from zero, the variation will increase, and because of symmetry (after all heads and tails are symmetric states; voting for A and voting for B are similarly symmetric), it will decrease again as you go to values of p greater than 0.5.

Class, 10/6

We continued studying the study guide for the quiz on Friday.

I had left you with the "three cards" problem. I brought in my trick coins and we determined that there are three states of nature, HH, HT and TT. We did the calculation in a spreadsheet format, taking a prior of 1/3 on each SON, and recognized that if we observe H, the likelihood of observing H is 1 if it is the HH coin but only 1/2 if it is the HT coin. So this yields a spreadsheet that is similar to the Monty Hall (standard) problem, and if we see a H, the posterior probability is 2/3 that the other side is also H.

We then finished the cancer problem. The probability that a member of the general population who tests positive does not have the gene is 495/593 or about 5/6, and 1/6 that this individual does have the gene. Part (3) of the question asks first, what's the probability that someone with a positive test gets the disease: That is
1/6*0.2+5/6*0.0002 or about 0.03333+0.00017=0.0335. Of these, most have the gene. Only 0.00017/0.0335 or about 0.005. That's only 1/2 of 1 percent.

We discussed the galaxy problem. As many of you pointed out, it is exactly like the Shakespeare and Marlowe problem. The problem sets the prior at P(E)=0.8, P(S)=0.2; the likelihood is gotten by raising the probability for each type of object to a power equal to the number of that type of object that the machine found, and multiplying the values together for the three types of object. We found (using a spreadsheet calculation) that the posterior probability was about equal for E and S after this evidence.

On the plagiarism problem, a question was asked about what the meaning of the code is. We discussed how mathematical tables are constructed: You calculate the number to more significant digits than you plan to publish, and round up or down according the the digit that follows the one you plan to publish. If that following digit is a '5', you have a choice to round up or down. By flipping a coin you can round up or down in a random way such that is unlikely to be duplicated by someone else independently putting together a table. So in this way you embed a secret code, known only to you, into the table by the rounding pattern. We calculated that if the prior probability is 1/2 for plagiarism vs. accidental agreement (no cheating), then the posterior probability is about 10^-30 that no cheating was involved if the code is duplicated exactly.

We discussed the reason for choosing equal priors: The law says in civil cases that the side with the preponderance of evidence wins the case. That is anything more than 50%.

I also mentioned that this technique is used to prevent plagiarism in other cases, e.g., map making, by putting small but innocuous mistakes in a map. Also, mistakes in the genome from generation to generation can be used as a "clock" to tell how far back in time two present-day organisms had a common ancestor, as well as the degree of relationship between a number of organisms, for example, how closely related human beings from various parts of the world are when traced back in time.

Finally, we got most of the way through the first urn problem. We decided that there are 10 states of nature corresponding to there being 1, 2,...,10 balls in the urn. The probability that the first ball is unmarked is of course 1, independent of the SON. The second ball is unmarked, but here the probability of that given the SON is 0 if the urn contains only one ball (because then the only ball in the urn is marked), 1/2 in the case of 2 balls in the urn, 2/3 in the case of 3 balls in the urn, and so forth. The third ball was marked, and now there are two marked balls in the urn so we're picking one that we'd already popped back in. Now for SONs 2, 3, ..., 10 the probability of picking a marked ball is 2/2, 2/3, 2/4, ..., 2/10. That is where we left it. We'll return to this problem on Wednesday.

I mentioned that this sort of thing is used, e.g., by biologists who catch fish, tag them, and release them, then after the population has had a chance to mix up, catching a sample again and seeing what proportion of the fish caught the second time are tagged. This can be used to estimate the size of a population of fish in a lake.

Friday, October 3, 2008

Class, 10/3

We passed over the Fermi problem bullet after I re-explained the geometric mean method.

We reminded everyone of the basic equation of conditional probability that underlies everything we are doing: P(A,B)=P(A|B)P(B)=P(B|A)P(A). We talked about the three equivalent facts about whether a distribution is independent or not. It is independent if and only if P(A|B)=P(A) for every A and B, it is independent if P(A,B)=P(A)P(B) for every A and B, and if P(A|B)=P(A) for every A and B, then necessarily P(B|A)=P(A) for every A and B.

We then showed how to construct the unique table of joint probabilities when we are given the marginal probabilities: Just multiply the marginal in a row with the marginal in a column and put the result in the corresponding cell.

We then took a table of independent probabilities and, by changing four cells in a square, just adding an arbitrary number to two of the cells on a diagonal and subtracting the same number from the cells on the other diagonal, and got a table where the probabilities are not independent.

We discussed the Monty Hall problem and variants. We found that if there are four doors, and Regular Monty opens two of them, each showing a goat, then the probability of getting the prize goes from 1/4 to 3/4 if we switch. We found that it goes from 1/4 to 3/8 if Monty opens one door and we switch to one of the others. We did this by a spreadsheet calculation. We then thought of a simpler way: Since your probability of initially picking the right door is 1/4, the probability that one of the other doors has the prize is 3/4. That doesn't change when Monty opens one of them and shows you a goat. So, since there are two doors left, the probability that you'll get the right one if you switch is 1/2 times 3/4, or 3/8.

I left you with another related problem to think about: There are three cards which have been made by pasting together two cards so that the backs are visible. One has two red backs, one has a red back and a blue back, and one has two blue backs. The cards are put in a hat and shaken, and you pick one out, looking at only one back. It is red. What is the probability that the other side is red?

We'll discuss that next time.

We went onto the cancer problem. We took a population of 10000 individuals. The problem statement says that 1% of the population has the gene, so that's 100 who have the gene and 9900 who don't. Of those who have the gene, 98 will be detected by the test and 2 missed (false negative). Of those who don't have the gene, the test will falsely identify 5% as having the gene, or 495 in all (false positives) and will correctly say that the remaining 9405 do not have the gene. Looking at just the positives, there are 98 true positives and 495 false positives, so that the probabilty that a person has the gene if they test positive is only 98/593 or about 1/6; the remaining 495/593, or about 5/6, do not have the gene.

We ran out of time here and will continue on Monday, finishing this problem and then going on in the study guide.

In answer to a question, I stated that if there is an item that we don't get to in our review, then similar items will not appear on the test. I also stated that I expected there to be five questions on the test, and that there should be enough time for everyone to do all of them. I pointed out that usual test-taking strategy says to go for the easy ones first and save the bulk of the time for the harder ones. I also said that it is very important to at least try to answer every question, since I cannot give credit if an item goes completely unanswered. We agreed that people who come early (not earlier thatn 10 AM, please) could start early, and that you may be able to stay an extra 5 minutes (but not more, because of the class that meets next in this room) to finish.

Be sure to bring your calculators. I do not have a loaner calculator!

Class, 10/1

We discussed the fourth problem set, which was done pretty well by you all. I pointed out several errors that were made:

One group forgot, on the second problem, that three different widgets were sampled independently, so that the likelihood had three factors in it, not one.

One group didn't recognize that the third problem had only two states of nature, that is, whether it is Urn #1 or Urn #2. Somehow this group ended up with five states of nature, that is, 1R, 1W, 2R, 2W and 2B, where the number is the urn number and the letter the color. The point is that the thing you don't know and want to learn is always the way to determine what the states of nature are. Here, what we don't know is which urn we've picked, so that tells us what the states of nature are.

One group got the states of nature right, but in the third ball selection forgot that it is the number of balls in the urn when the ball is picked that gives the denominator. True, this is a step made without replacement, but since the first two steps all involved returning the ball (that is, with replacement), there are still ten balls in the urn when the third ball is picked.

On the last problem, one group computed correctly the contribution to the likelihood for each word, but then added them instead of multiplying to get the likelihood. Since the likelihood is the probability that we got 3 of the first word AND 5 of the second AND 3 of the third, you have to multiply. When you compute the probability of one thing AND the probability of another thing, you always multiply. Addition is for when you want to compute the probability of one thing OR the probability of another thing. For example, when you add the joint probabilities in a spreadsheet calculation, you are computing the probability of (data,SON1) OR the probability of (data,SON2) OR ..., to get the probability of the data, irregardless of which SON is true.

We then finished the "Trewel" problem and calculated that the best thing for Alan to do is to shoot in the air, letting Bob and Charlie duke it out, and then, with one of them dead for sure, to come in on his second try and try to kill the survivor. We also recalculated the probability of Alan eventually killing Bob when Alan goes first. See Class, 9/29 for a calculation.

Finally, we discussed the first item on the study sheet, Fermi problems. We discussed the length of the Nile river...and I reminded everyone about the geometric mean trick. If you can put a reasonable lower bound on a quantity and a reasonable upper bound, so that you are pretty sure that the true value is between those bounds, then a decent guess at the correct value is to multiply the lower bound by the upper bound and take the square root of that number. For the Nile, a lower bound might be 100 miles and an upper bound 10,000 miles, which would give an estimate of 1000 miles. Wikipedia says 4100 miles, so this is not a great estimate. For the Mississippi, we imagined the U.S. as a box 3000 miles wide and 2000 miles high, so a length of 2000 miles. The actual length is 2340 miles, so that worked better.

I pointed out that the important thing with regard to Fermi problems is how you got the answer, not the actual value of the answer.

Monday, September 29, 2008

Class, 9/29

Today we discussed further the polling example and actually calculated a simple result. I'll try to post a copy of the calculation later. We found that with 6 favoring candidate A out of a sample of 10, the posterior probability that candidate A wins (has over 50% of the vote) is about 71%. The posterior distribution is closely bell-shaped and peaked at r=0.6=6/10. That's a general rule.

We also asked what would happen if we used a more realistic prior that put more weight near 1/2. In response to a question, I pointed out that you can't put it near 0.6 because that would be "cheating," using the same data twice. You have to do it without looking at the data. We found that a prior that rises to a maximum at 0.5 and then falls again will do two things: It will narrow the posterior distribution somewhat, and will also shift the peak closer to 0.5. If there is a whole lot of data, then the effect of the prior will be negligible, but in our example it can be significant.

The following webpage has election predictions with a chart (lower right hand) that shows a similar posterior distribution, based on calculating the electoral vote in a simulation (this is a modern computational technique, even more powerful than the spreadsheet method we discussed). The left-hand chart has other information that summarizes the posterior probability in several ways: Where the maximum of the posterior probability is, what the win probability is, and so forth.

We discussed a situation where two people enter into a consecutive duel: Alan and Bob will take shots at each other in turn. Alan's probability of hitting Bob and putting him out of commission on one shot is 1/3; Bob's probability of putting Alan out of commission is 2/3. We asked, if they keep taking turns until one hits the other, what's the probability of Alan eventually hitting Bob if he goes first? If he goes second?

Although this could be calculated (as one student suggested) by multiplying and adding probabilities until the numbers were very small, I suggested another way.

If Alan goes first, he'll win outright on his first shot 1/3 of the time. He'll also eventually win with probability (2/3)*P(Alan wins eventually | Bob goes first). So

P(Alan wins eventually | Alan goes first)=1/3+(2/3)*P(Alan wins eventually | Bob goes first).

But P(Alan wins eventually | Bob goes first) can be calculated in terms of P(Alan wins eventually | Alan goes first), because the only way that Alan can win eventually if Bob goes first is if Bob misses on his first try (P=1/3). Then Alan will have a second chance, and the probability that he'll eventually win if Bob goes first is therefore

P(Alan wins eventually | Bob goes first)=(1/3)*P(Alan wins eventually | Alan goes first)

Therefore, P(Alan wins eventually | Alan goes first)=1/3+(1/3)*(2/3)*P(Alan wins eventually | Alan goes first). This can be solved for P(Alan wins eventually | Alan goes first). This turns to be 3/7, and P(Alan wins eventually | Bob goes first)=(1/3)*(3/7)=1/7.

Finally, I brought in Charlie, who is a crack shot and never misses. We decided that if Charlie goes first, he'll knock off Bob since Alan is a poorer shot, and if Alan misses, Charlie will knock him off the second time around. Also if Bob went first to be followed by Charlie, Bob will try to knock off Charlie first since if he knocked of Alan, he's a goner. But if Alan goes first, the first guess that he'd go after Charlie seemed to be wrong, as one student pointed out. Actually, Alan has a better chance of survival if he shoots his gun into the air! More on this later.

Study Sheet for First Quiz

The first test will be on October 10 (Friday). There is a study sheet here. We will start discussing this on Wednesday or Friday, so get together with your group and prepare yourselves for our discussion.

Sunday, September 28, 2008

Class, 9/26

We talked about Problem #4. I mentioned first that it was modelled after a book by Mosteller and Wallace (Mosteller is a famous statistician), in which they tried to determine the authorship of several of the disputed articles in the famous Federalist Papers, published to try to convince Americans to adopt the Constitution.

The idea is that each time a word is used, that may reflect on the author, since different authors tend to use words with different frequencies. So, one author might use "while" and another might use "whilst." ("Whilst" was still in common use in this part of the world in 1789.) So, we can form the likelihood in this problem by multiplying the probability of a word, given the author, for each time a word appears in the text. There are two authors, so we will get a product of many numbers, one number for each word in the sample text.

This requires computing quantities like (0.002)⁵. Unfortunately, this results in some very small numbers. I recommended using powers of ten notation, so that you would have, for example, (0.002)⁵=(2x10^-3)⁵=32x10^-15. You'll get a small integer times some very small number written in scientific notation. The good news is that the power of ten will cancel out of the final answer.

At the end of class, I discussed polls a bit. We determined that the states of nature are the various proportions r of voters who favor candidate A over candidate B. There are infinitely many such numbers. We also discussed how the error in the result will theoretically go down as the size of the sample goes up, so for example the error (plus or minus) in the number of voters in the sample favoring either candidate is roughly (N*r*(1-r))^1/2. So, if N is 1000 and r=1/2, the expected error in the number of voters is about 15, and the error in r is about 15/1000 or 0.015; double that gives what those of you who took statistics before would call the 95% error bar, that is, we expect to have an error larger than +/-0.003 in only 5% of cases.

In real life, sampling difficulties will make the real error bigger than this, so it's more normal for pollsters to quote a somewhat larger number, for example, +/-0.005.

We talked about a basic Bayesian way to do this in practice, namely, in a spreadsheet. We could list a sequence of equally spaced center-points for an interval of r, for example, 0.05, 0.15, 0.25,...0.95, representing intervals of length 0.1. We assign each value of r a prior. One suggestion was 1/10 for each, but one student pointed out that a more realistic prior would be larger for values of r around 1/2 and smaller or near-zero for values of r that deviate significantly from 1/2. Then we can compute the likelihood, which we determined was given by rⁿ(1-r)^N-n, for each value of r, where N is the total number of voters in the sample and n is the number of voters favoring candidate A. (We ignored non-responses). Then in a few mouse strokes we can calculate the joint probability column, compute its sum, and divide it into each joint probability to get the posterior probability.

Finally, we set the date of the first test for Friday, October 10.