Bayes Rules

Friday, February 6, 2009

4-up Chart Set 5

I figured out what was wrong that was keeping me from saving pdf files 4-up. A 4-up version of the corrected Chart Set 5 is here.

First, the assignment for next Thursday can be found here.

Interesting class today. Jeff started out by asking about the homework. We suggest learning how do use the array capabilities of R (array of numbers, vectors, matrices). You can multiply an array by a scalar and get an array with the same number of entries. You can add, subtract or multiply two arrays, or add a constant to an array; the operations will take place pointwise. You can use sum() to add up the elements of an array. And so forth. Play around with R in the calculator mode where you enter a formula and look at what's been computed. Facility with these kinds of calculations can really speed up a calculation, as well as making for easier-to-read code.

In #4, the likelihoods are proportional to each other. To show this, you should complete the square in the summation form of the equation, and note that when you do this the sum can be turned into the sample mean, and the rest of the items in the exp() function are constant with respect to μ.

The ESP problem seemed sensitive to the prior...not enough data to overwhelm it.

We suggested that anyone who plans to write scientific papers in their professional life would do well to learn TeX/LaTeX. LaTeX is most useful.

Jeff gave some background on the heart transplant mortality problem. Most of it was "chalk talk" on the blackboard. Typically the study starts by modeling the risk to individual patients using logistic regression. This powerful method allows us to estimate the probability that a patient will die as a function of covariates such as age, disease status, presence or absence of diabetes, and so forth. We would use logistic regression on a large number of patients to get the coefficients that are appropriate for each covariate. Then, when a new patient comes along, we can plug the covariates appropriate for that patient into our logistic model, and predict the probability of death from that.

Then, armed with these probabilities for each patient, we can predict the number of patients that are expected to die in a given hospital by just adding up those probabilities over all patients in the study. This is what Albert calls the exposure. Jeff used 'd' for that quantity.

Both Turner and Jeff are using this basic idea in their research...Turner in trauma cases, Jeff in neonatal cases.

In the example, a particular hospital had d=0.066 and had one death.

We model the observed number of deaths with a Poisson distribution:

Pr(y|μ)=μ^ye^μ/y!

where μ=d⋅λ.

Note that the expectation E[Y|λ]=d⋅λ.

With a gamma prior g(λ|α,β)=λ^α-1e^-βλ we get a gamma posterior, since the gamma distribution is the conjugate prior for the Poisson distribution. This is a gamma-Poisson model.

To choose α and β, we imagine we have data on 10 other hospitals. We had a discussion about why we cannot use the hospital that we're studying to decide on the prior. Basically, to do that would be to use the data from that hospital twice, and that's not allowed. Strictly speaking, you cannot do this if you respect the fact that if data are dependent, then you cannot simply multiply probabilities, you have to use the conditional probability formula correctly. If you do this, you'll find that the rules of the probability calculus will automatically prevent you from using the data twice. Problem #4 in the new problem set addresses this issue.

Anyway, if z_j is the observed number of deaths at hospital j, and o_j the expected number, then we can write a Poisson likelihood term for each hospital, and assuming the hospitals are independent, the product of these is the likelihood for our estimation. Then a standard prior on λ is 1/λ. [Brief interlude: This is a commonly used prior for scale variables, e.g., lengths, rates, and so forth. In such problems it should not matter if we use a ruler calibrated in inches, for example, instead of centimeters...if we do it consistently we should always get the same result. Mathematically, it is seen as a group-theoretic feature under the group of multiplications by positive real numbers: if c is a constant and x=cy, then dx/x=dy/y regardless of the value of c.]

The prior is improper, that is, its integral from 0 to ∞ diverges at both ends. So it's not a "real" prior since it can't be normalized. However, if we imagine going from 1/n to n in the integral, you can normalize that. Then the question is, if you use such a prior, and multiply by the likelihood, does the posterior end up being normalizable as n→∞? If so, then it is legitimate to use the improper prior, you won't get into trouble.

The posterior using the data from these 10 hospitals is on the chart set. This is what we'll use for the prior with the hospital we are studying.

Digression: Neither Jeff nor I would use this method as described. Rather, we would use a hierarchical model where all hospitals, including the one under investigation, are analyzed together. We will return to this subject later.

Jeff finished the discussion by running the R code in the notes. Running it on the hospital with 1 death and exposure d=0.066, we saw that the 95% credible interval for lambda contained 1 near the middle, so there was no evidence that this hospital had an excess death rate. This is so even though 1 is quite a bit larger than 0.066 (about a factor of 15). However, when we plugged 10 deaths into the calculation, the 95% credible interval did not include 1 so we would say that such a hospital has an elevated risk.

Wednesday, February 4, 2009

STAT 295 2/3/09

I have posted a revised Chart Set #5. Jeff noted some errors and they have been corrected. Unfortunately, since I moved to the latest version of MacOS, I am no longer able to produce 4-up pdf files, so this one (and some of the later ones) will be full size. I apologize for this and will consult with Small Dog to see if this can be fixed.

Jeff asked why the two forms of the likelihood for problem #4 (due 2/5) are equivalent. You should address this question in your turned-in assignment. Note that programming without loops is faster, so you should attempt doing that calculation without loops.

On Chart Set #4, Chart 24, we had been discussing robustness. We noted that the mean and standard deviations of the posterior distribution do depend (although fairly insensitively) on the prior. In particular, the beta prior had a smaller standard deviation and the mode was moved to the left, towards the peak of the beta prior, relative to where the mode was for the flat prior. This led to the notion of stable estimation, such that when we have a lot of data or very precise data, and the prior doesn't change much over the region where the likelihood peaks, then the results won't be sensitive to the prior.

We considered continuous examples, of which the beta-binomial model is an example. Jeff justified the density approximation P(a<Y<b|c-ε<X<c+ε) ≈ ∫f(c,y)/g(c)dy. Thus, we can use ratios of joint to marginal densities in the continuous case, just as we can use the ratio of joint to marginal distributions in the discrete case.

We saw how increasing the amount of data tightens the posterior around the true value.

We then skipped to Chart Set #5. We will return to Chart Set #4 later.

We discussed the Poisson distribution and motivated it. Many things are well modeled as Poisson events, in addition to the ones on the chart set, requests to google.com, stars per square degree, etc.

We went on to the heart transplant mortality problem from Albert's book. The exposure for each patient is the probability that the particular patient will die in a given time frame after the operation. It depends on things like the patient's age, conditions like diabetes, and so on, and is presumed known from other studies. Then the exposures for each patient are added up over all the patients to estimate the overall exposure (risk) for the particular hospital. It is the expected number of patients that will die. The notes use the letter 'e', but Jeff remarked that it's easy to confuse that with the base of natural logarithms, so he changed it to 'd' in his chalkboard discussion. Then if Y is the random variable representing the observed number of deaths, the likelihood has Y~pois(λd) and λ is the parameter we wish to estimate for the hospital.

A gamma prior is chosen. It is flexible, with two adjustable parameters and pedagogically easy because it is the unique conjugate prior for a Poisson likelihood. However, it is a bad idea to use a prior simply because it makes calculations easy, since modern sampling techniques allow us to use any prior we wish, and if we know better, we should use it.

I've corrected Slide #8 on which z and o were transposed. See the chart set published above.

The heart transplant mortality problem will be continued next time.

Monday, February 2, 2009

Chart Set #5

I have posted chart set #5, Single and Multiparameter Models.

Friday, January 30, 2009

STAT 295, 1/29/09

One of the problems involves the posterior predictive distribution. The idea here is that once we have the posterior distribution of the parameters θ, we can multiply this by the likelihood for new observations y, integrate out θ, and get a distribution on y|x that predicts what kinds of observations we would expect in the future. For example, we may have observed the orbit of a planet or the weather at various points in the past; the posterior predictive distribution would allow us to predict the position of the planet in the future or the weather at a later date.

Jeff showed why we can ignore constants independent of the parameters like Choose(n,s) in the likelihood and Γ(a+b)/Γ(a)&Gamma(b); in the prior. Because these parameters appear in both numerator and denominator, and because they are independent of the parameters can be taken outside of the integral the defines the marginal distribution of the data, they will simply cancel out.

I noted that care needs to be taken when comparing models or averaging over models. For example, one might have several models in mind (e.g., a linear and a quadratic model to fit a run of data). Since the models are different, the likelihoods and priors are also different and will have different normalizing factors. In such a case you cannot ignore those factors because to do so would make the different models incompatible.

We ran the R code and looked at the posterior distribution, as given by a sample of size 100,000. From this we noted that the results were fairly stable to flat versus the wide beta prior that Albert used. We noted that we will be making our inferences directly from the sample using methods that don't require us to know the normalizing constant for the posterior distribution.

We went on to Chart Set 4. We discussed the summaries of the posterior mean and median, and pointed out that these can be derived by minimizing the loss given by the square of the difference between the true value and the estimated value, and the absolute value of that distance, respectively.

One chart had "HDR" on it as a label. This stands for "Highest Density Region" and is just a Bayesian credible interval. My recollection now is that the book from which I got this terminology, Samuel Schmitt's Measuring Uncertainty: An Elementary Introduction to Bayesian Statistics, really referred to the shortest credible interval.

We discussed confidence intervals and credible intervals. A confidence interval is an interval that describes the distribution of data on repeated hypothetical trials, given a particular parameter. It is a way of describing the statistical properties of the procedure that is used to calculate the interval under repeated sampling. A credible interval describes the distribution of the parameter, given the particular data set we have actually observed.

Although we can sometimes interpret a confidence interval numerically as a corresponding credible interval (e.g., linear regression with normal errors), and although sometimes credible intervals have good frequentist coverage properties and so can, again numerically, be used as a confidence interval, they are not the same thing at all. In Bayesian theory, data x, once observed, is regarded as fixed, and appears in the result through the likelihood function, which is not a probability density on the parameters θ. There, θ is considered a random variable. But in frequentist theory, θ is not a random variable, x is an exemplar of the random variable X. So it is important to keep these two ideas separate.

In my experience with physical scientists, a large fraction of them want to interpret a confidence interval as a distribution on θ. This is of course wrong, but that is a natural tendency. So it seems that many physical scientists are naturally Bayesians!

Jeff did a blackboard calculation to show that regardless of the prior, in the limit of large data n→∞, the expected mean of the sleep problem →s/n. With lots of data, the prior is pretty much irrelevant. (There are exceptions, which we will discuss later).

Wednesday, January 28, 2009

Second Homework Set

You'll find the second homework set here. Please reread the discussion at the top of Tuesday's class regarding formatting homework and other things!

This set is due on February 5.

Tuesday, January 27, 2009

STAT 295 1/27/09

Important points about assignments:

Assignments need to be put together carefully so that it is easy to grade them. They should start with a narrative that says what the assignment is about, then how you went about solving the problem, then the results. Use tables as appropriate to summarize. Put the R code into an appendix. As Jeff stated, statistics is not just about mathematics and programming, it is also about interpreting and conveying your results to others clearly.
We ask you not to work alone. We prefer you to work in small groups of 2 or 3. Not only does this reduce the grading burden, more importantly it gives you practice in working with other colleagues, which is an essential aspect of professional statistical practice.
Please type your assignments. This makes them easy to read.
Be sure to turn in a hard copy of your assignment.
Also, email R code to us; if the code isn't working right, we can then copy-paste into R which may make it easier to figure out what isn't right.

We introduced Bayes' theorem last time. Jeff remarked again that the denominator, the marginal probability of the data, is computed by summing the numerator P(D|A_i)P(A_i) over all the states of nature A_i. This uses the "Law of Total Probability" introduced earlier.

Please note that the likelihood function can be multiplied by any constant k>0 without changing the results from Bayes' theorem, because the numerator and the denominator will both be multiplied by k, which will cancel.

The likelihood is actually an equivalence class of functions, which differ only in that one of the members of the class is a constant multiple of another. It is important to understand that even though the likelihood P(D|A_i) is thought of as the sampling distribution of D given A_i, it is not the same as the sampling distribution, which tells us how different data D depend on A_i; but the likelihood is always evaluated for the actual data D that was actually observed, and its important characteristic is how it varies as A_i is varied for constant D. The sampling distribution's important characteristic is how it varies as D is varied for constant A_i. Since the likelihood is an equivalence class, it does not integrate or sum to 1 over the states of nature A_i, and it is not a probability on A_i.

We discussed the hemoccult test and the consequences of making wrong decisions. A false positive can result in a colonoscopy, and colonoscopies can have adverse consequences for the patient, even death. On the other hand, a false negative means that a developing cancer may be missed. As Dr. Osler pointed out, the people who put out the test can adjust the reagents used in the test to produce more true positives, but only at the cost of increasing the number of false positives, or vice versa. Some of this is driven by economics and insurance companies, since the hemoccult test, which has many faults, is also very cheap (3 cents), whereas a colonoscopy is expensive (several thousands of dollars) as well as more risky. So there are tradeoffs. A complete analysis really involves decision theory, and is outside of the scope of this course, but we will mention some aspects of decision theory from time to time.

Jeff skipped several examples on the charts, which I may make some comments on later...stay tuned.

He went to the capture-recapture problem on the fish population in a lake. We catch 60 fish, tag them, and release them. After the fish have had a time to mix well with the untagged fish, we catch 100 of them and note that 10 are tagged. It is natural to estimate the number of fish in the lake at 600. Jeff noted that there is an extensive literature on this problem in the frequentist literature.

The Bayesian solution is quite simple. After some discussion, it was decided that the likelihood function is a hypergeometric distribution. In this case,

P(D|n)=C_xⁿC_k-x^N-n/C_k^N

where N=number of fish in the lake, n=number caught=100, k=number tagged=60, x=number of those caught that are tagged=10.

This likelihood can also be obtained without using hypergeometric functions by just considering the probability of sampling each tagged or untagged fish one-by-one, noting that the size of the sample and the number of fish of each type decrease by 1 each time a sample is tallied (sampling without replacement).

We finished by starting the discussion of the sleep example from chapter 2 of the book. The beta prior was chosen by the investigator based on a notion of the mean and the 90th percentile of the sleepers. We also noted that if you have a beta prior in this binomial sampling situation, you will get a beta posterior. Also, even though the likelihood, prior and posterior all "look alike" in containing factors θ^u(1-θ)^v, the thing that is important in the likelihood is (u,v), the data, whereas the thing that is important in both the prior and the posterior is the unknown state of nature θ.