Bayes Rules

Monday, January 26, 2009

Chart Set #4: Interpretation

Here's the next chart set, on ways to interpret the posterior distribution. See you on Tuesday.

Thursday, January 22, 2009

In class I mentioned advice on programming style. I recommend in particular Software Carpentry, which although it is based on the Python language, has useful information that can be used generally for any programming project in a modern computer language. This website has mp3 files of a number of lectures on various aspects of programming, along with charts that go with the lectures. Besides topics such as programming style, there are other important points that are very useful, such as version control software that allows multiple programmers to work on the same project without stepping on each others' toes, and allows earlier versions to be resurrected in case a later version is "broken" in ways that are hard to trace.

More to come...

Sorry for the delay. I had to perform a complete backup on my computer, which took a long time.

Assignment: Learn R syntax as described by Jeff in his discussion.

With regard to the homework: You noted that the confidence interval coverage was not always as advertised. In fact, coverage will not be as advertised when n and p are small; in particular you need np and n(1-p) to be at least 10 for this to work as advertised.

We then reviewed some basic statistical notions about the sample mean and its relationship to expectations, how these are used in simple (frequentist) statistical estimation, the meaning of variance and sample variance, the fact that the denominator (n-1) in the sample variance makes it unbiased (but note...the usual formula for the standard deviation gotten by taking the square root of this is not unbiased!)

If x is a discrete random variable that takes values x₁,...,x_q with probabilities p(x₁)...p(x_q) respectively, then we define the expectation of x as E[x]=Σ_i=1^qx_ip(x_i).

More generally for any function g(x) of x, we define the the expectation of g(x) as E(g(x))=Σ_i=1^qg(x_i)p(x_i).

For continuous random variables, the sum becomes an integral.

Repeated samples: How does the sample mean behave? Leads to confidence intervals, etc. Based on thought experiment, "what happens if I take repeated samples?" but in reality we only take one sample. This is one feature that distinguishes Bayesian reasoning from frequentist reasoning...the Bayesian conditions on the one unique sample that we have, and doesn't ask what happens if we sample repeatedly in this thought experiment.

Tried some R code. We discussed in particuler the invisible() function, accessing lists, and the fact that one must use print() within a function in order to get R to display something on the screen.

Probability densities are used for continuous random variables. We briefly set out the rules.

Note that density can be zero over some of the interval, but unlike a distribution, can be larger than 1.

In the definition of the beta density, define Γ(a)=∫₀^∞x^a-1e^-adx. For integers, this is a factorial: Γ(n)=(n-1)! The gamma function is the unique analytic extension of the factorial function and allows us to interpolate the factorial function.

We started on the third chart set.

Bayes' picture, although widely seen on the web, is probably not of him.

Jeff gave his rant about naming theorems for the person instead of what it does. I have some sympathy for this viewpoint.

The proof of Bayes' theorem is trivial.

We regard Bayes' theorem as a model of learning. At any point in time, we have an opinion about hypotheses, parameters, etc., which is given by our prior probability distribution. When new data comes in, Bayes' theorem can be used to update our opinion, giving our posterior probability distribution.

Sometimes (as in the simulation approach we will use extensively in this course) you can bypass the computation of the denominator P(D) in Bayes' theorem. So often we see it written as a proportionality. Until about 20 years ago, before the simulation methods were introduced, we could not bypass the computation of P(D), which was a major pain since it usually requires integrating over a space of high dimension, which is difficult. So despite its elegance, the Bayesian calculations were difficult and were available only for very simple situations (e.g., normal regression). But that has all changed.

In particular, the denominator is given in the discrete case by ΣP(D|A_i)P(A_i) and in the continuous case by ∫p(x|θ)p(θ)dθ.

Unfortunately, my notation on the slides is inconsistent...we ought to have written X instead of D and Θ instead of A. This will be fixed for the next time the class is taught.

Finally, I want to point out two other blogs that are useful. The author of the book, Jim Albert, has a blog for his course here. You may find his discussion of the election interesting. This blog is not very active. And, I read Andrew Gelman's blog daily. You'll find lots of discussion of Bayesian things there. Andrew is a statistician/political scientist, and he's particularly interested in voting patterns and other issues of this sort.

Tuesday, January 20, 2009

STAT 295 1/20/09

For now, here are the recently-posted charts for Thursday.

Jeff started the lecture by displaying the axioms of probability. Jeff prefers not to call the definition of conditional probability an axiom. I like to call it an axiom. Regardless, we need to introduce the concept of conditional probability, and this formula is central to everything that we'll be doing.

Jeff mentioned that you should try to finish the proof that P(A∨B)=P(A)+P(B)-P(A,B). He also showed how a partition of the space into a set of mutually exclusive and exhaustive alternatives can be used to get the marginalization formula P(B)=ΣP(B|A_i)P(A_i).

(I have now learned how to put a limited amount of mathematics into this blog. But I'll leave a link to a useful LaTeX equation translator in the blog here. You can paste a LaTeX equation into the box, then hit "Render Expression" to see what it looks like in "real" mathematical notation. So I recommend bookmarking this page.)

I apologize for the poor rendition of tables in the projected version; this seems to be an incompatibility between the Mac version of PowerPoint and the Windows version that is used on the computer in the classroom. I will try to fix this in the future. Meantime, the downloaded pdf's should be fine.

An example of a probability distribution is from the astronomical HD (Henry Draper) catalog, an early 20th century effort to classify stars according to their spectra, that is to say, the pattern of light and dark regions when you spread the light from a star out into its component colors. Stars of different temperatures have different patterns. In the early days, the stars were classified A, B, ... , O, but later it was found that some classes were duplicates, so they were merged. Also, it was discovered that the reason for the different patterns had to do with the temperature of the star, which ended up with a different order than the original one, namely, OBAFGKM, which many generations of mostly male astronomy students memorized using the vaguely sexist "Oh, be a fine girl, kiss me!" Later generations have used more neutral mnemonics.

I pointed out that this distribution, which is bimodal, is for the stars that appear brightest in the sky, and that the intrinsic distribution has the number of stars in each class increasing from left to right as the temperatures decrease. There are more M stars than any other class, but they are underrepresented in this distribution because M stars are intrinsically faint and thus not in the catalog. At the same time, the peak at A is due to the fact that these stars are very bright, and thus seen to very great distances, and so are overrepresented in this survey, which cut off at a faint limit.

We described the joint distribution of meteorite finds (you just walk along and spy a meteorite on the ground) versus falls (where the object is seen to fall and people go out and look for it). Stones and irons were the other classifier; stones look like stones, whereas iron meteorites are much denser and are magnetic; they also have a distinctive appearance. We talked about the marginal distributions, gotten by summing a row or column. I noted that amongst falls, stones were predominant whereas amongst finds, it's the irons that are most common. This is due to the fact that stones weather over time and eventually look just like terrestrial rocks, so are overlooked by most people. The irons, on the other hand, don't weather and continue to look different from terrestrial rocks. Presumably, the actual proportions of stones and irons in the solar system is more similar to that amongst the falls.

Jeff introduced the question of tossing a HH, a HT and a TT coin (using the notation on the charts). The question was, if you pick one of the three at random, and without looking at it toss it, and see that it comes up heads, what's the probability that the coin is HH? The (surprising to some) answer is that it is 2/3, not 1/2 as some surmised. Jeff then showed how this came about, first by drawing a probability tree and putting the P(HH)=P(HT)=P(TT)=1/3 branches at the root of the tree, and then the conditional probabilities P(S_H|HH)=1, P(S_H|HT)=1/2, etc., on the subbranches, and multiplying to get the joint probabilities at the leaves of the tree. Adding the joint probabilities that belong to S_H together we see that P(S_H)=1/2, and since P(S_H,HH)=1/3, dividing gives us P(HH|S_H)=2/3.

Another chart showed what happened when you flipped the tree.

Finally, the same thing was done using "natural frequencies." The idea here is to imagine a large number of identical cases, 3000 on the chart. Of these, 1000 will be a flip of the HH coin, and all of these will be seen as heads. 1000 will be a flip of the HT coin, but only 500 of these will be seen as heads. And of course, the 1000 flips of the TT coin will result in no heads being seen. But now there are twice as many cases where the HH coin was flipped and we see a head than when the HT coin is flipped. Again, we get 2/3.

The natural frequency method is especially good when trying to explain probability concepts to people who are mathematically naive. Keep it in your toolkit, it can be very useful for a professional statistician to know this. It's described in great detail in this book by Gerd Gigerenzer.

Jeff then described the cognitive dissonance experiment, which has been done by psychologists in various forms for many years. What is found is that if you give a monkey, for example, a choice between two items and it picks one, and if you then give the same monkey a choice between the item that was rejected and a third item, the monkey will tend to pick the third item over the one that was initially rejected. Psychologists had been explaining this by saying that the monkey rationalizes his rejection of the item by his initial rejection ("I really don't like that one."), but in 2007 it was discovered that a purely statistical explanation existed. The fact is that there are only three ways to rank three items such that the rejected item ranks below the one that was initially selected, and in two of those three ways, the third item ranks above the one that was rejected. So a lot of experiments on "cognitive dissonance" will have to be reexamined.

The example is related to the notorious Monty Hall Problem.

We discussed the Hubble Space Telescope. After the first launch, it was found that the solar panels had the unfortunate property of "creaking" when they went from the sun side of the Earth into the Earth's shadow. This caused "Hubble quakes," where the telescope would shake, producing unacceptably blurred images. We computed P(good data)=P(good data|no quake)P(no quake)+P(good data|quake)(1-P(no quake)).

Jeff then introduced marginalization as an important tool in inference, and in particular in Bayesian inference where it is used all the time. By summing or integrating over parameters we are not interested in, we can obtain a distribution over the remaining parameters. We did this for the distribution of luminosity and temperature of stars, and for the genetics of coat color in rats. We skipped over the bridge example.

Jeff defined independence: A is independent of B iff P(A|B)=P(A) for all A and B. Equivalently, if A is independent of B then B is independent of A, and also if A and B are independent then P(A,B)=P(A)P(B) for all A and B. Any of these may be taken as the definition of independence, and the other relationships can be derived. We asked you to do this, i.e., show that the first definition implies the second and third, and that the third implies the first. (Homework, not to be turned in.)

We discussed the fact that higher 5-year survival times for lung cancers detected by CT scans versus those discovered by X-rays can be a statistical artifact and not mean that the mortality (age at which a patient would die of the disease) has been reduced. For one thing, CT scans will detect a tumor at a smaller and hence earlier time; the patient might still die at the same age. Also, CT scans will detect more indolent tumors, which will not kill the patient. Turner Osler, who is familiar with the research, pointed out that the research was funded by the tobacco industry, a fact that was not made clear when the paper was published. The subtext was, go ahead and smoke, just get CT scans. This is, of course, quite unethical.

Finally, Jeff showed two ways of computing the sample mean, and showed how the second one, in the limit, goes over to the usual way of computing expectations. He wanted me to add the term "relative" to one of the lines, thus "observed relative frequency of x".

Friday, January 16, 2009

Difficulties installing packages under Windows Vista

One person had difficulty installing LearnBayes under Windows Vista. An error message was returned, as follows:

In file.create(f.tg) :
cannot create file
'C:\PROGRA~1\R\R-28~1.0/doc/html/packages.html', reason 'Permission denied'

Jeff suggests the following fix:

I'm not sure what the problem is, but I know I had problems using R and other software with vista until I turned the "user account control" feature to off. You'll need to do this. Turn it off and keep it off. I've had no ill effects over many months. Go to the windows security center and it'll be clear how to turn it off.

Thursday, January 15, 2009

STAT 295 1/15/09

New charts for Tuesday: Simple Examples

Assignment (due Thursday, 1/22):

Read Chapter 1 of the book; do problems 4 and 5 on pp. 16-17

In the lecture, Jeff talked about a number of useful features of R:

Statements like y>3 are logical, and if y is a vector (or even if a is a vector) then you will get a vector of true-false values. Then, you can use z[y>3] to select out only those components of the vector z that satisfy the condition y>3, component-by-component. The resulting vector will in general be shorter than the original z vector, since it will only contain those elements less than the corresponding element of y.

Jeff made a connection between data frames and other data structures including those from SAS. I know nothing about SAS. But the basic idea here is that you can have a line or row in a data frame that represents an observation, and the individual entries may have different types. I.e., they do not all have to be numbers (as in a matrix), the first could be an integer, the second could be a logical value (TRUE, FALSE), the third could be a character string ("TRUE", "FALSE"), etc.

You can attach a data frame. This allows you to access the entries in the data frame more easily (i.e., with fewer keystrokes). However, this poses many risks. For one thing, you may have several data frames. Which one will be fetched? Also, the variables are global, and generally, good software practice advises not using global variables.

Lists: This is the best way to return multiple items from a function. You can write

return(list(x,y,z))

and get an object, the components of which are x, y, z, even if x is a number, y is a matrix, and z is a list itself of other objects.

You can access the items of a list using subscripts, e.g., if b=list(x,y,z) then

b[[1]]

will return x, whatever it is (matrix, vector, number, whatever)

And, since b[[1]] is an object, if it happened to be a vector, for example, it could be sub-accessed by whatever method was appropriate, for example if it were a vector, then the 5'th component could be obtained by writing

b[[1]][5]

If you write

f=function(...){return(list(a=x, b=y, c=z))}

then you will be able to access the objects of the returned list by using the names a, b, c. So, for example, if you've called the function f that returns that list, and said w=f(..), then if you write w$b, you will get the value of y that the function computed.

Jeff pointed out that the access to the R functions that compute standard statistical distributions like normal, binomial, etc., are prefixed by 'r', 'p', 'q', 'd', depending on the use of the desired function.

So, 'rnorm' generates a vector of normally distributed random variables. 'dnorm' calculates the normal density at the requested point. 'pnorm' gives the the cumulative distribution function, and 'qnorm' is its inverse, giving the quantile function.

On control functions, we recommend using a smart editor that will allow your braces {} to be paired so as to make clear the logic of the program. Indentation control is important!

Jeff then went into the discussion of the basics of probability. He showed that if a hypothesis A predicts that we will see evidence E more probably than hypothesis B predicts that we will observe E, then when we observe E, we should think that the evidence E supports A more than it does B.

Bill went into a song-and-dance demonstration about the differences between Bayesian and frequentist interpretations of the meaning of probability. Everyone agreed that before the coin was tossed, the probability that it would be 'heads' was 0.5; but once it was tossed, the answers were all over the place. The majority opinion was something like "it is 1 or 0, but I don't know which." The minority opinion was "it is 0.5."

Then I looked at the coin, and knew what it was. I reported that I did this, but I didn't say what I saw. Nothing much changed as regards to your opinions.

Then I told you that I saw 'heads'. Some changed opinions (this is significant).

A student looked at the coin, and said that it was 'heads'. Some changed opinions (also significant).

The point of this experiment was to reveal a critical difference between Bayesian and frequentist interpretations of probability.

In the frequentist view, you cannot talk about the probability that the coin is 'heads' or 'tails' after the coin has been tossed. This view of probability cannot talk about the probability of events that have already happened, since they are fixed.

In the Bayesian view, you can talk about such probabilities, and indeed these are the probabilities that are most important. Bayesians regard the data as fixed, and the "states of nature," AKA whether the coin is 'heads' or 'tails', as the proper thing to describe by a probability distribution.

The reason is that in the Bayesian view, probability describes your willingness to believe that a particular state of nature is true. So the 'probability' is "in your head," rather than being "out there."

We can measure this by thinking about how much you would bet that a particular proposition was true. Regardless of whether the proposition is in the future (which team wins the Superbowl this year) or already determined but unknown to us (whether the 1,000,000th digit of the decimal expansion of π is '3'), you could still make a rational bet on the truth of the proposition.

Coherent betting behavior implies the axioms of probability theory. See this link for more information!

Jeff then picked up on this and presented the basic axioms of probability theory.

Tuesday, January 13, 2009

STAT 295 1/13/09

Just to remind you, the course home page is here. All assignments, charts, and other relevant material will be posted there.

The first items are the handout that Jeff wrote, on R; a chart on "preliminaries," which we did not use, but summarizes the syllabus that was handed out; and a chart set on the introduction to the basics of probability, which will be the next discussion after the introduction to R. This chart set has the fundamental equations we will be using all semester.

The usual frequentist and the Bayesian approaches to statistics differ in the way they use probability theory. The frequentist's main tool is the sampling distribution, that is, the probability distribution of the data, given some hypothesis (e.g., the null hypothesis of no effect, or the hypothesis that a parameter has a particular value). The frequentist looks at what happens if the data space is sampled repeatedly. This leads to notions like p-values, confidence intervals, and the other tools of the frequentist toolkit. The Bayesian approach is different: In Bayesian theory, the data are fixed at the actual value observed, and the probability distribution of interest is the one on the hypotheses. (In frequentist theory, this idea makes no sense, because in this view of statistics, only data, not hypotheses, can have a probability distribution; but the Bayesian viewpoint is different!)

Jeff lectured using his notes on R. We only got to the top of page 3, and the main things mentioned were:

Your assignment is to install R on your computer (if you have one), and also to install the LearnBayes package that is companion to the Albert book. Google 'r', or click here.
The R prompt is a "greater than" sign, >
Get information about an R function by typing >?median , for example
You can write >x=5, etc., using R as a fancy calculator with named variables.
Vectors are not matrices. They have a length, but not dimensions. This is an annoying feature (see the top of p. 3 of the handout).

Here is a link to the R reference that Jeff noted on page 1 of the handout.

Good luck, and see you on Thursday!

Monday, December 8, 2008

Class, 12/8

In class today, one of the things we discussed was Naseem Taleb's ideas, and I also mentioned Mark Thoma's blog. As it turns out, Mark Thoma put an article by Taleb on his blog just today. As you will see when you read it, it generated a lot of comments.

I also mentioned my friend and mentor (a person from whom I have learned much and who helped me greatly in my career), Jim Berger.

Mentors are very important for anyone's career. I have had several, Jim Berger was the latest, but also (in temporal order) Heinz Eichhorn (image on screen behind me), Harlan Smith, and Jurgen Moser. These were people that taught me much, with whom I had very close professional and personal relationships. They helped me build my career. So, you should find mentors who can do for your career what mine have done for me.