Bayes Rules: 2009

Thursday, February 19, 2009

Blog moved to wordpress.com

The blog for the class has been moved to wordpress.com.

Future postings will be there. Past postings will remain here.

Bill

Wednesday, February 18, 2009

I have learned how to include a modest amount of mathematics in the blog, and I have gone back and done this to all the earlier blog posts.

I may change the location of the blog; I have been pointed to another blog site (Wordpress.com) that has much better LaTeX support. Essentially I will be able to put a LaTeX expression into the blog, and it will automatically be rendered.

If I do decide to make such a change, I will announce it on this blog and on the website.

Tuesday, February 17, 2009

A bit of history: Why we call it a "random variable"

Here's a link to an article in Chance News, explaining how the term "random variable" got fixed in the terminology. I mentioned this in class today.

Stat 295 2/17/09

We discussed the problem #3 on problem set 3 and showed that the condition for equality was that f(y|x,θ)/f(y|θ) be independent of θ. We then also saw that if (for the hospital example) x={y,z}, where z is the data from all the other hospitals, then f(y|x,θ)=f(y|z,y,θ)=1 and the ratio is not independent of theta, so in this case you can't make this substitution. But if f(y|z,y,θ)=1 then the posterior for trying to use y again is equal to the prior so you can't use the data twice. Probability theory won't let you do that (legitimately).

Jeff talked about the curse of dimensionality. When you get to high-dimensional parameter spaces, you can't use the usual integration methods (e.g., integrating on a grid, since the number of points required is exponentially large in the dimension of the space). This is what kept Bayesian methods from wide usage until the introduction of MCMC about 20 years ago. But with MCMC people are now routinely analyzing models with thousands of variables.

New chart set: Introduction to Bayesian Computation.

Start with the Beta-binomial model for overdispersion. We wish to model stomach cancer death rates for 20 cities in Missouri.

Let y_j=number of deaths at city j, n_j the number of cases. j=1,2,...,20.

Initial attempt at a model would be that y_j~bin(n_j,θ), i.e., use a common probability of death for all cities. However, there is greater variability amongst the number of deaths than this model can fit, because not all cities have the same death rate. This is known as "overdispersion," (a term that Jeff doesn't like).

Instead use y_j~bin(n_j,θ) and θ~beta(α,β). We then wish to integrate θ out of this model, and use that as the likelihood.

The calculation is given in the chart set. We get

f(y|n,α,β)=C_yⁿ B(α+y,n+β-y)/B(α,β)

where B(,) is the normalization constant for the beta function.

This is the likelihood for α and β, given data n,y.

Recall that if θ~beta(α,β) then E[θ]=α/(α+β) and Var(θ)=α β/[(α+β)²(α+β+1)]

Reparameterize with η=E[θ], K=α+β. K is a precision parameter, related to 1/Var(θ). Indeed, Var(θ)=η(1-η)/(K+1)

α=Kη, β=K(1-η). K>0 and 0<η<1.

For prior, use independent priors for eta and K:

g(η,K) ∝ 1/(η(1-η)) ⋅ 1/(K+1)².

The posterior for η and K is

g(η,K|y₁,...,y₁) ∝ Π₁²⁰B(Kη+y,K(1-η)+n-y)/B(Kη,K(1-η)) ⋅ 1/(η(1-η)) ⋅ 1/(K+1)².

Look at contour plot of the posterior (on charts).

The chart goes way off the map, and suggests using a log transform on K and a logit on η.

θ₁=logit(η), θ₂=log(K). Thus η=F(θ₁) where the inverse logit, F, was defined last time. K=exp(θ₂).

We need to compute the Jacobian. We have

dη/dθ₁=F(θ₁)(1-F(θ₁)=J_η, and exp(θ₂)=J_K. It turns out that J_ηg(η)=1 in this parameterization where g(η) is the prior on η. This may have motivated Albert's choice for the prior on η.

The prior on K in this parameterization is 1/(1+exp(θ₂))²exp(θ₂)

Contour plot in these variables looks much better.

Next thing: Approximate this posterior density with a normal approximation. The contour plot looks pretty non-normal, so the fit won't be great. Nonetheless, we can get an idea of what's going on from this. Need to know means, variances and the covariance. How do we get these?

Laplace Approximation: Derive approximation to the posterior that's easy to calculate.

The approximation can be useful when the posterior is unimodal, which is typically true when we have enough data. How much is enough? A rule of thumb (Kass & Raftery, Journal of the American Statistical Association, Vol. 90, No. 430 (Jun., 1995), pp. 773-795 ) says that we need 20 times as much data than the number of parameters for the Laplace aproximation to work reasonably well (p. 778, note the discussion and cautions).

Here's how to do it: The Taylor's series expansion says

h(θ)≈h(θ₀)+h'(θ₀)(θ-θ₀)+0.5h"(θ₀)(θ-θ₀)².

The vector version of this is in the notes. The matrix h"(θ₀) of second derivatives is the Hessian.

h(θ₀)+∇h(θ₀)(θ-θ₀)+0.5(θ-θ₀)^T h"(θ₀)(θ-θ₀)

We want θ₀ to be chosen as the posterior mode, so that the linear term vanishes.

Want h(θ)=log[f(y|θ)g(θ)] ≈ h(θ₀)+0.5(θ-θ₀)^T h"(theta₀)(θ-θ₀).

When you exponentiate this, the result is a multivariate normal distribution. The mean is the posterior mode, and the variance-covariance matrix is the inverse of the Hessian matrix.

To sum up, θ|y ≈ N(θ₀,-(h"(θ₀))^-1).

The Laplace approximation is used much less than in the past, because MCMC techniques allow us to sample from the exact posterior distribution to obtain the information we need. It can still be useful in formulating a "proposal distribution," which is used in MCMC sampling.

Thursday, February 12, 2009

Corrected Chart Set #6

The corrected chart set #6 can be found here.

STAT 295 2/12/09

Happy 200th birthday, Charles Darwin!

Today we discussed the problem (#3) regarding the non-independent data. If you can't write f(x,y|θ)=k(x,y)f_y(y|θ)f_x(x|θ), then you cannot treat the sequential problem as if the y variable is independent of x (that is, the calculation given in the analogous problem in problem set #2 does not work). I need to do a bit of research on the k(x,y) piece of this.

Another example: Bioassay experiment. First: transforming parameters to (-∞,∞) is often a good idea. In this case we used a log transform. Similarly, the probabilities are transformed from (0,1) to (-∞,∞) using the logit transformation:

q=logit(p)=log(p/(1-p))

with q=(β₀+β₁x)=F(q) where p=F(q)=(1+exp(-q))^-1

log(p/(1-p))=logit(p)=β₀+β₁x

β₁ is interpreted as the difference in log odds for a 1-unit change in x (the dose).

Logistic regression is very important in statistics. Working statisticians will see it over and over again.

Note that F'(v)=F(v)(1-F(v))

We have animal data from the book, and wish to make inference on the βs.

The likelihood is binomial in p_i (i is the group) and then the product over all groups of animals. This is because we have two situations (the animal lived or it died) with probability p (given by the inverse logit) that it lived and (1-p) that it died. So the likelihood is the product of p's and (1-p)'s over all the animals in the study.

We'll use a sampling approach on a discretized version of β₀, β₁. But need to get an idea of how far to go. Go for the mode (MLE since priors are flat). Then estimate the standard errors, go out several standard errors and use that.

We did this in R using the R function glm (generalized linear model).

We recommend looking at functions in Albert's LearnBayes package to see how he did it. You'll learn new things about R, and have a better idea about what Albert is doing.

We looked at the plot (Figure 4.3 of the book) and saw that we've got to go pretty far out (farther than Jeff guessed) in order to get a decent sample.

How to sample from the distribution. The contour plot gives us an idea. Chop up region into a lot of little squares. Have something that is proportional to the density. Height of this surface is sort of like the probability that theta takes that value. Can use the sample() function to do this (sampling with replacement). We did this and the sample in Fig. 4.4 is the result. Looks pretty good.

We want to compute LD-50, the dose at which 50% of the animals die. Thus

Pr(y|x)=1/(1+exp(-(β₀+β₁x)))=0.5

1=exp(-(β₀+β₁x_LD50))

0=-β₀-β₁x_LD50, so x=-β₀/β₁

Now look at the Bayesian analysis. From the sample, divide the samples of β₀ by the samples of β₁, look at the 95% credible interval for the sample. We did this, looked at the histogram and computed the 95% credible interval. (The result is in the log dose).

We also used the density() function to draw the posterior density for the marginal of LD-50. This uses a nonparametric kernel estimator to draw a smooth line. β₁ is the log odds ratio for a unit change in dose. To get the odds ratio, exponentiate.

(The nonparametric kernel estimator works this way: At each point of the sample, we draw a peaked, narrow bell-shaped curve with unit area. We then add the curves for the whole sample and divide by the number of samples to get an approximation to the probability density curve.)

Sheila asked about hypothesis testing. We can get one-sided credible intervals (analogous to one-sided confidence intervals) quite simply. For example if s$y contains the samples of some quantity β, can compute probability that β>1 as mean((s$y)>0).

I pointed out that there is no Bayesian analog to frequentist two-sided hypothesis testing. We'll get to this later, when we discuss hypothesis testing.

Assignment #4

The pdf for the fourth assignment (due next Thursday) will be found here.

Wednesday, February 11, 2009

New chart set #6

The new chart set, on Bayesian Computation, will be found here.

Tuesday, February 10, 2009

STAT 295 2/10/09

I made some remarks on the homework. Won't repeat them here since they will appear on the papers that were turned back.

We talked about Jacobians (named for a 19th century mathematician, however he was German, not French as I had misremembered). I gave my reasoning, which relied on the change of variables when doing an integral:

∫p(u)du=∫(p(u(v))(du/dv)dv=∫p(v)dv

The idea here is that if u is a variable and we transform to v, the integral has to be unchanged since it is the same probability that we're talking about.

The highlighted piece here is the Jacobian. Whenever you change variables in our probability statements, you have to include a Jacobian. So, the above equation says that

p(v)=p(u)(du/dv).

Jeff gave a different (but entirely equivalent) argument. His argument (which is hard to reproduce here, so I hope you got good notes) also starts with integrating, here to compute the probability that θ is less than x. That is an integral from -∞ to x of the density on θ. But if we transform to λ via θ=g(λ), then you get

Pr(θ<x)=Pr(λ<g^-1(x))=∫_-∞^{g^-1(x)}p(λ)dλ

To go from the distribution (Pr) to the density (p) you take the derivative with respect to x. When you do this, you use the fundamental theorem of calculus (the derivative of the integral is the function under the integral) and the chain rule (you have to multiply by dg^-1(x)/dx, which is the Jacobian).

On the homework for Thursday, Jeff generalized the problem to the following condition:...Assume that there does not exist k(x,y) independent of θ such that

f_x(x|θ)f_y(y|θ)=k(x,y)f(x,y|θ)

Then you should be able to show that you can't pretend that x and y are independent when you calculate the posterior distribution on x and y. You have to use the formulation that f(x,y|θ)=k(x,y)f(y|x,θ)f(x|θ), in other words, the full conditional probability formula is required.

The corrections to the Albert book are here. You have the most recent (third) printing, in all likelihood, so most of these won't apply.

We then discussed the normal likelihood with both mean (μ) and variance (σ²) unknown.

The charts have the equations. We took a prior that is flat on mu and 1/σ² on σ². We'll justify these priors as "noninformative" later in the course. Both priors are improper, but if the posterior is proper there will not be a problem.

The key observation here is that in the posterior, there is a piece that is independent of mu, times another piece that has mu and sigma. This means that we can factorize the posterior like g(σ²)g(μ|σ²). We observed that the latter piece would be normal, with mean the mean of the y's, and with variance to be gotten by sampling σ². Then we found that the marginal distribution of σ² is an inverse chi-square distribution with (n-1) degrees of freedom, where n is the number of data points. Jeff discussed several ways of sampling from an inverse chi-square distribution, and they are included in his R code, which will be found here.

Jeff then used the code to plot contours of the posterior distribution, plot sample points, and compute quantiles and other things of interest. It is known that the frequentist confidence interval on mu is given by a t distribution with (n-1) degrees of freedom, and Jeff computed that as well as the Bayesian credible interval from the sample. They coincided. This is one example where the credible interval and the confidence interval are the same.

Undergraduate opportunity

The Statistical and Applied Mathematical Sciences Institute (SAMSI), in Research Triangle Park, NC, is hosting a one week undergraduate workshop for college juniors and seniors focused on SAMSI research activities related to the statistical and applied mathematical modeling and analysis of experimental data. During the first day, a summary of research activities in the 2008-2009 programs on Sequential Monte Carlo Methods, and Algebraic Methods in Systems Biology and Statistics will be presented. In days two through five, participants will be involved in a hands-on experience. They will use mathematical and statistical models to analyze experimental data they collect in the CRSC/Math Instructional Research Lab on the NC State University campus.

Tutorials on modeling, mathematical and statistical methodology and on the physical experiments being used will be given. Participants working together in small teams will collect data and analyze it using mathematical and statistical software provided.

REGISTRATION

Applicants should use the on-line application form and also have one letter of recommendation. Full financial support for travel expenses, subsistence and lodging in university housing will be provided for all attendees. Due to space considerations, participation is restricted and will be offered to approximately 18 individuals selected from the applicant pool. Participants are expected to arrive for the workshop on Sunday, May 17, 2009 and remain in continuous attendance until 12:00 pm on Friday, May 22, 2009.

Further details, including information on how to register and where to send letters of recommendation can be found here.

Applications will be considered beginning February 10, 2009 and continuing until April 3, 2009, but registration will likely be closed before that date, as workshop capacity is reached. Successful applicants will be notified of their acceptance as soon as a decision on their application is reached.

[Note: Apply early, if you are interested!!!]

Upon acceptance for the program, individuals must confirm by email within three days their intention to participate (otherwise their place will be given to another applicant). Accepted participants will be advised regarding airfares and will need to purchase plane tickets with a three-week advance fare.

Please direct questions concerning the workshop to ugworkshop200905@samsi.info.

Friday, February 6, 2009

4-up Chart Set 5

I figured out what was wrong that was keeping me from saving pdf files 4-up. A 4-up version of the corrected Chart Set 5 is here.

STAT 295 2/5/09

First, the assignment for next Thursday can be found here.

Interesting class today. Jeff started out by asking about the homework. We suggest learning how do use the array capabilities of R (array of numbers, vectors, matrices). You can multiply an array by a scalar and get an array with the same number of entries. You can add, subtract or multiply two arrays, or add a constant to an array; the operations will take place pointwise. You can use sum() to add up the elements of an array. And so forth. Play around with R in the calculator mode where you enter a formula and look at what's been computed. Facility with these kinds of calculations can really speed up a calculation, as well as making for easier-to-read code.

In #4, the likelihoods are proportional to each other. To show this, you should complete the square in the summation form of the equation, and note that when you do this the sum can be turned into the sample mean, and the rest of the items in the exp() function are constant with respect to μ.

The ESP problem seemed sensitive to the prior...not enough data to overwhelm it.

We suggested that anyone who plans to write scientific papers in their professional life would do well to learn TeX/LaTeX. LaTeX is most useful.

Jeff gave some background on the heart transplant mortality problem. Most of it was "chalk talk" on the blackboard. Typically the study starts by modeling the risk to individual patients using logistic regression. This powerful method allows us to estimate the probability that a patient will die as a function of covariates such as age, disease status, presence or absence of diabetes, and so forth. We would use logistic regression on a large number of patients to get the coefficients that are appropriate for each covariate. Then, when a new patient comes along, we can plug the covariates appropriate for that patient into our logistic model, and predict the probability of death from that.

Then, armed with these probabilities for each patient, we can predict the number of patients that are expected to die in a given hospital by just adding up those probabilities over all patients in the study. This is what Albert calls the exposure. Jeff used 'd' for that quantity.

Both Turner and Jeff are using this basic idea in their research...Turner in trauma cases, Jeff in neonatal cases.

In the example, a particular hospital had d=0.066 and had one death.

We model the observed number of deaths with a Poisson distribution:

Pr(y|μ)=μ^ye^μ/y!

where μ=d⋅λ.

Note that the expectation E[Y|λ]=d⋅λ.

With a gamma prior g(λ|α,β)=λ^α-1e^-βλ we get a gamma posterior, since the gamma distribution is the conjugate prior for the Poisson distribution. This is a gamma-Poisson model.

To choose α and β, we imagine we have data on 10 other hospitals. We had a discussion about why we cannot use the hospital that we're studying to decide on the prior. Basically, to do that would be to use the data from that hospital twice, and that's not allowed. Strictly speaking, you cannot do this if you respect the fact that if data are dependent, then you cannot simply multiply probabilities, you have to use the conditional probability formula correctly. If you do this, you'll find that the rules of the probability calculus will automatically prevent you from using the data twice. Problem #4 in the new problem set addresses this issue.

Anyway, if z_j is the observed number of deaths at hospital j, and o_j the expected number, then we can write a Poisson likelihood term for each hospital, and assuming the hospitals are independent, the product of these is the likelihood for our estimation. Then a standard prior on λ is 1/λ. [Brief interlude: This is a commonly used prior for scale variables, e.g., lengths, rates, and so forth. In such problems it should not matter if we use a ruler calibrated in inches, for example, instead of centimeters...if we do it consistently we should always get the same result. Mathematically, it is seen as a group-theoretic feature under the group of multiplications by positive real numbers: if c is a constant and x=cy, then dx/x=dy/y regardless of the value of c.]

The prior is improper, that is, its integral from 0 to ∞ diverges at both ends. So it's not a "real" prior since it can't be normalized. However, if we imagine going from 1/n to n in the integral, you can normalize that. Then the question is, if you use such a prior, and multiply by the likelihood, does the posterior end up being normalizable as n→∞? If so, then it is legitimate to use the improper prior, you won't get into trouble.

The posterior using the data from these 10 hospitals is on the chart set. This is what we'll use for the prior with the hospital we are studying.

Digression: Neither Jeff nor I would use this method as described. Rather, we would use a hierarchical model where all hospitals, including the one under investigation, are analyzed together. We will return to this subject later.

Jeff finished the discussion by running the R code in the notes. Running it on the hospital with 1 death and exposure d=0.066, we saw that the 95% credible interval for lambda contained 1 near the middle, so there was no evidence that this hospital had an excess death rate. This is so even though 1 is quite a bit larger than 0.066 (about a factor of 15). However, when we plugged 10 deaths into the calculation, the 95% credible interval did not include 1 so we would say that such a hospital has an elevated risk.

Wednesday, February 4, 2009

STAT 295 2/3/09

I have posted a revised Chart Set #5. Jeff noted some errors and they have been corrected. Unfortunately, since I moved to the latest version of MacOS, I am no longer able to produce 4-up pdf files, so this one (and some of the later ones) will be full size. I apologize for this and will consult with Small Dog to see if this can be fixed.

Jeff asked why the two forms of the likelihood for problem #4 (due 2/5) are equivalent. You should address this question in your turned-in assignment. Note that programming without loops is faster, so you should attempt doing that calculation without loops.

On Chart Set #4, Chart 24, we had been discussing robustness. We noted that the mean and standard deviations of the posterior distribution do depend (although fairly insensitively) on the prior. In particular, the beta prior had a smaller standard deviation and the mode was moved to the left, towards the peak of the beta prior, relative to where the mode was for the flat prior. This led to the notion of stable estimation, such that when we have a lot of data or very precise data, and the prior doesn't change much over the region where the likelihood peaks, then the results won't be sensitive to the prior.

We considered continuous examples, of which the beta-binomial model is an example. Jeff justified the density approximation P(a<Y<b|c-ε<X<c+ε) ≈ ∫f(c,y)/g(c)dy. Thus, we can use ratios of joint to marginal densities in the continuous case, just as we can use the ratio of joint to marginal distributions in the discrete case.

We saw how increasing the amount of data tightens the posterior around the true value.

We then skipped to Chart Set #5. We will return to Chart Set #4 later.

We discussed the Poisson distribution and motivated it. Many things are well modeled as Poisson events, in addition to the ones on the chart set, requests to google.com, stars per square degree, etc.

We went on to the heart transplant mortality problem from Albert's book. The exposure for each patient is the probability that the particular patient will die in a given time frame after the operation. It depends on things like the patient's age, conditions like diabetes, and so on, and is presumed known from other studies. Then the exposures for each patient are added up over all the patients to estimate the overall exposure (risk) for the particular hospital. It is the expected number of patients that will die. The notes use the letter 'e', but Jeff remarked that it's easy to confuse that with the base of natural logarithms, so he changed it to 'd' in his chalkboard discussion. Then if Y is the random variable representing the observed number of deaths, the likelihood has Y~pois(λd) and λ is the parameter we wish to estimate for the hospital.

A gamma prior is chosen. It is flexible, with two adjustable parameters and pedagogically easy because it is the unique conjugate prior for a Poisson likelihood. However, it is a bad idea to use a prior simply because it makes calculations easy, since modern sampling techniques allow us to use any prior we wish, and if we know better, we should use it.

I've corrected Slide #8 on which z and o were transposed. See the chart set published above.

The heart transplant mortality problem will be continued next time.

Monday, February 2, 2009

Chart Set #5

I have posted chart set #5, Single and Multiparameter Models.

Friday, January 30, 2009

STAT 295, 1/29/09

One of the problems involves the posterior predictive distribution. The idea here is that once we have the posterior distribution of the parameters θ, we can multiply this by the likelihood for new observations y, integrate out θ, and get a distribution on y|x that predicts what kinds of observations we would expect in the future. For example, we may have observed the orbit of a planet or the weather at various points in the past; the posterior predictive distribution would allow us to predict the position of the planet in the future or the weather at a later date.

Jeff showed why we can ignore constants independent of the parameters like Choose(n,s) in the likelihood and Γ(a+b)/Γ(a)&Gamma(b); in the prior. Because these parameters appear in both numerator and denominator, and because they are independent of the parameters can be taken outside of the integral the defines the marginal distribution of the data, they will simply cancel out.

I noted that care needs to be taken when comparing models or averaging over models. For example, one might have several models in mind (e.g., a linear and a quadratic model to fit a run of data). Since the models are different, the likelihoods and priors are also different and will have different normalizing factors. In such a case you cannot ignore those factors because to do so would make the different models incompatible.

We ran the R code and looked at the posterior distribution, as given by a sample of size 100,000. From this we noted that the results were fairly stable to flat versus the wide beta prior that Albert used. We noted that we will be making our inferences directly from the sample using methods that don't require us to know the normalizing constant for the posterior distribution.

We went on to Chart Set 4. We discussed the summaries of the posterior mean and median, and pointed out that these can be derived by minimizing the loss given by the square of the difference between the true value and the estimated value, and the absolute value of that distance, respectively.

One chart had "HDR" on it as a label. This stands for "Highest Density Region" and is just a Bayesian credible interval. My recollection now is that the book from which I got this terminology, Samuel Schmitt's Measuring Uncertainty: An Elementary Introduction to Bayesian Statistics, really referred to the shortest credible interval.

We discussed confidence intervals and credible intervals. A confidence interval is an interval that describes the distribution of data on repeated hypothetical trials, given a particular parameter. It is a way of describing the statistical properties of the procedure that is used to calculate the interval under repeated sampling. A credible interval describes the distribution of the parameter, given the particular data set we have actually observed.

Although we can sometimes interpret a confidence interval numerically as a corresponding credible interval (e.g., linear regression with normal errors), and although sometimes credible intervals have good frequentist coverage properties and so can, again numerically, be used as a confidence interval, they are not the same thing at all. In Bayesian theory, data x, once observed, is regarded as fixed, and appears in the result through the likelihood function, which is not a probability density on the parameters θ. There, θ is considered a random variable. But in frequentist theory, θ is not a random variable, x is an exemplar of the random variable X. So it is important to keep these two ideas separate.

In my experience with physical scientists, a large fraction of them want to interpret a confidence interval as a distribution on θ. This is of course wrong, but that is a natural tendency. So it seems that many physical scientists are naturally Bayesians!

Jeff did a blackboard calculation to show that regardless of the prior, in the limit of large data n→∞, the expected mean of the sleep problem →s/n. With lots of data, the prior is pretty much irrelevant. (There are exceptions, which we will discuss later).

Wednesday, January 28, 2009

Second Homework Set

You'll find the second homework set here. Please reread the discussion at the top of Tuesday's class regarding formatting homework and other things!

This set is due on February 5.

Tuesday, January 27, 2009

STAT 295 1/27/09

Important points about assignments:

Assignments need to be put together carefully so that it is easy to grade them. They should start with a narrative that says what the assignment is about, then how you went about solving the problem, then the results. Use tables as appropriate to summarize. Put the R code into an appendix. As Jeff stated, statistics is not just about mathematics and programming, it is also about interpreting and conveying your results to others clearly.
We ask you not to work alone. We prefer you to work in small groups of 2 or 3. Not only does this reduce the grading burden, more importantly it gives you practice in working with other colleagues, which is an essential aspect of professional statistical practice.
Please type your assignments. This makes them easy to read.
Be sure to turn in a hard copy of your assignment.
Also, email R code to us; if the code isn't working right, we can then copy-paste into R which may make it easier to figure out what isn't right.

We introduced Bayes' theorem last time. Jeff remarked again that the denominator, the marginal probability of the data, is computed by summing the numerator P(D|A_i)P(A_i) over all the states of nature A_i. This uses the "Law of Total Probability" introduced earlier.

Please note that the likelihood function can be multiplied by any constant k>0 without changing the results from Bayes' theorem, because the numerator and the denominator will both be multiplied by k, which will cancel.

The likelihood is actually an equivalence class of functions, which differ only in that one of the members of the class is a constant multiple of another. It is important to understand that even though the likelihood P(D|A_i) is thought of as the sampling distribution of D given A_i, it is not the same as the sampling distribution, which tells us how different data D depend on A_i; but the likelihood is always evaluated for the actual data D that was actually observed, and its important characteristic is how it varies as A_i is varied for constant D. The sampling distribution's important characteristic is how it varies as D is varied for constant A_i. Since the likelihood is an equivalence class, it does not integrate or sum to 1 over the states of nature A_i, and it is not a probability on A_i.

We discussed the hemoccult test and the consequences of making wrong decisions. A false positive can result in a colonoscopy, and colonoscopies can have adverse consequences for the patient, even death. On the other hand, a false negative means that a developing cancer may be missed. As Dr. Osler pointed out, the people who put out the test can adjust the reagents used in the test to produce more true positives, but only at the cost of increasing the number of false positives, or vice versa. Some of this is driven by economics and insurance companies, since the hemoccult test, which has many faults, is also very cheap (3 cents), whereas a colonoscopy is expensive (several thousands of dollars) as well as more risky. So there are tradeoffs. A complete analysis really involves decision theory, and is outside of the scope of this course, but we will mention some aspects of decision theory from time to time.

Jeff skipped several examples on the charts, which I may make some comments on later...stay tuned.

He went to the capture-recapture problem on the fish population in a lake. We catch 60 fish, tag them, and release them. After the fish have had a time to mix well with the untagged fish, we catch 100 of them and note that 10 are tagged. It is natural to estimate the number of fish in the lake at 600. Jeff noted that there is an extensive literature on this problem in the frequentist literature.

The Bayesian solution is quite simple. After some discussion, it was decided that the likelihood function is a hypergeometric distribution. In this case,

P(D|n)=C_xⁿC_k-x^N-n/C_k^N

where N=number of fish in the lake, n=number caught=100, k=number tagged=60, x=number of those caught that are tagged=10.

This likelihood can also be obtained without using hypergeometric functions by just considering the probability of sampling each tagged or untagged fish one-by-one, noting that the size of the sample and the number of fish of each type decrease by 1 each time a sample is tallied (sampling without replacement).

We finished by starting the discussion of the sleep example from chapter 2 of the book. The beta prior was chosen by the investigator based on a notion of the mean and the 90th percentile of the sleepers. We also noted that if you have a beta prior in this binomial sampling situation, you will get a beta posterior. Also, even though the likelihood, prior and posterior all "look alike" in containing factors θ^u(1-θ)^v, the thing that is important in the likelihood is (u,v), the data, whereas the thing that is important in both the prior and the posterior is the unknown state of nature θ.

Monday, January 26, 2009

Chart Set #4: Interpretation

Here's the next chart set, on ways to interpret the posterior distribution. See you on Tuesday.

Thursday, January 22, 2009

STAT 295 1/22/09

In class I mentioned advice on programming style. I recommend in particular Software Carpentry, which although it is based on the Python language, has useful information that can be used generally for any programming project in a modern computer language. This website has mp3 files of a number of lectures on various aspects of programming, along with charts that go with the lectures. Besides topics such as programming style, there are other important points that are very useful, such as version control software that allows multiple programmers to work on the same project without stepping on each others' toes, and allows earlier versions to be resurrected in case a later version is "broken" in ways that are hard to trace.

More to come...

Sorry for the delay. I had to perform a complete backup on my computer, which took a long time.

Assignment: Learn R syntax as described by Jeff in his discussion.

With regard to the homework: You noted that the confidence interval coverage was not always as advertised. In fact, coverage will not be as advertised when n and p are small; in particular you need np and n(1-p) to be at least 10 for this to work as advertised.

We then reviewed some basic statistical notions about the sample mean and its relationship to expectations, how these are used in simple (frequentist) statistical estimation, the meaning of variance and sample variance, the fact that the denominator (n-1) in the sample variance makes it unbiased (but note...the usual formula for the standard deviation gotten by taking the square root of this is not unbiased!)

If x is a discrete random variable that takes values x₁,...,x_q with probabilities p(x₁)...p(x_q) respectively, then we define the expectation of x as E[x]=Σ_i=1^qx_ip(x_i).

More generally for any function g(x) of x, we define the the expectation of g(x) as E(g(x))=Σ_i=1^qg(x_i)p(x_i).

For continuous random variables, the sum becomes an integral.

Repeated samples: How does the sample mean behave? Leads to confidence intervals, etc. Based on thought experiment, "what happens if I take repeated samples?" but in reality we only take one sample. This is one feature that distinguishes Bayesian reasoning from frequentist reasoning...the Bayesian conditions on the one unique sample that we have, and doesn't ask what happens if we sample repeatedly in this thought experiment.

Tried some R code. We discussed in particuler the invisible() function, accessing lists, and the fact that one must use print() within a function in order to get R to display something on the screen.

Probability densities are used for continuous random variables. We briefly set out the rules.

Note that density can be zero over some of the interval, but unlike a distribution, can be larger than 1.

In the definition of the beta density, define Γ(a)=∫₀^∞x^a-1e^-adx. For integers, this is a factorial: Γ(n)=(n-1)! The gamma function is the unique analytic extension of the factorial function and allows us to interpolate the factorial function.

We started on the third chart set.

Bayes' picture, although widely seen on the web, is probably not of him.

Jeff gave his rant about naming theorems for the person instead of what it does. I have some sympathy for this viewpoint.

The proof of Bayes' theorem is trivial.

We regard Bayes' theorem as a model of learning. At any point in time, we have an opinion about hypotheses, parameters, etc., which is given by our prior probability distribution. When new data comes in, Bayes' theorem can be used to update our opinion, giving our posterior probability distribution.

Sometimes (as in the simulation approach we will use extensively in this course) you can bypass the computation of the denominator P(D) in Bayes' theorem. So often we see it written as a proportionality. Until about 20 years ago, before the simulation methods were introduced, we could not bypass the computation of P(D), which was a major pain since it usually requires integrating over a space of high dimension, which is difficult. So despite its elegance, the Bayesian calculations were difficult and were available only for very simple situations (e.g., normal regression). But that has all changed.

In particular, the denominator is given in the discrete case by ΣP(D|A_i)P(A_i) and in the continuous case by ∫p(x|θ)p(θ)dθ.

Unfortunately, my notation on the slides is inconsistent...we ought to have written X instead of D and Θ instead of A. This will be fixed for the next time the class is taught.

Finally, I want to point out two other blogs that are useful. The author of the book, Jim Albert, has a blog for his course here. You may find his discussion of the election interesting. This blog is not very active. And, I read Andrew Gelman's blog daily. You'll find lots of discussion of Bayesian things there. Andrew is a statistician/political scientist, and he's particularly interested in voting patterns and other issues of this sort.

Tuesday, January 20, 2009

STAT 295 1/20/09

For now, here are the recently-posted charts for Thursday.

Jeff started the lecture by displaying the axioms of probability. Jeff prefers not to call the definition of conditional probability an axiom. I like to call it an axiom. Regardless, we need to introduce the concept of conditional probability, and this formula is central to everything that we'll be doing.

Jeff mentioned that you should try to finish the proof that P(A∨B)=P(A)+P(B)-P(A,B). He also showed how a partition of the space into a set of mutually exclusive and exhaustive alternatives can be used to get the marginalization formula P(B)=ΣP(B|A_i)P(A_i).

(I have now learned how to put a limited amount of mathematics into this blog. But I'll leave a link to a useful LaTeX equation translator in the blog here. You can paste a LaTeX equation into the box, then hit "Render Expression" to see what it looks like in "real" mathematical notation. So I recommend bookmarking this page.)

I apologize for the poor rendition of tables in the projected version; this seems to be an incompatibility between the Mac version of PowerPoint and the Windows version that is used on the computer in the classroom. I will try to fix this in the future. Meantime, the downloaded pdf's should be fine.

An example of a probability distribution is from the astronomical HD (Henry Draper) catalog, an early 20th century effort to classify stars according to their spectra, that is to say, the pattern of light and dark regions when you spread the light from a star out into its component colors. Stars of different temperatures have different patterns. In the early days, the stars were classified A, B, ... , O, but later it was found that some classes were duplicates, so they were merged. Also, it was discovered that the reason for the different patterns had to do with the temperature of the star, which ended up with a different order than the original one, namely, OBAFGKM, which many generations of mostly male astronomy students memorized using the vaguely sexist "Oh, be a fine girl, kiss me!" Later generations have used more neutral mnemonics.

I pointed out that this distribution, which is bimodal, is for the stars that appear brightest in the sky, and that the intrinsic distribution has the number of stars in each class increasing from left to right as the temperatures decrease. There are more M stars than any other class, but they are underrepresented in this distribution because M stars are intrinsically faint and thus not in the catalog. At the same time, the peak at A is due to the fact that these stars are very bright, and thus seen to very great distances, and so are overrepresented in this survey, which cut off at a faint limit.

We described the joint distribution of meteorite finds (you just walk along and spy a meteorite on the ground) versus falls (where the object is seen to fall and people go out and look for it). Stones and irons were the other classifier; stones look like stones, whereas iron meteorites are much denser and are magnetic; they also have a distinctive appearance. We talked about the marginal distributions, gotten by summing a row or column. I noted that amongst falls, stones were predominant whereas amongst finds, it's the irons that are most common. This is due to the fact that stones weather over time and eventually look just like terrestrial rocks, so are overlooked by most people. The irons, on the other hand, don't weather and continue to look different from terrestrial rocks. Presumably, the actual proportions of stones and irons in the solar system is more similar to that amongst the falls.

Jeff introduced the question of tossing a HH, a HT and a TT coin (using the notation on the charts). The question was, if you pick one of the three at random, and without looking at it toss it, and see that it comes up heads, what's the probability that the coin is HH? The (surprising to some) answer is that it is 2/3, not 1/2 as some surmised. Jeff then showed how this came about, first by drawing a probability tree and putting the P(HH)=P(HT)=P(TT)=1/3 branches at the root of the tree, and then the conditional probabilities P(S_H|HH)=1, P(S_H|HT)=1/2, etc., on the subbranches, and multiplying to get the joint probabilities at the leaves of the tree. Adding the joint probabilities that belong to S_H together we see that P(S_H)=1/2, and since P(S_H,HH)=1/3, dividing gives us P(HH|S_H)=2/3.

Another chart showed what happened when you flipped the tree.

Finally, the same thing was done using "natural frequencies." The idea here is to imagine a large number of identical cases, 3000 on the chart. Of these, 1000 will be a flip of the HH coin, and all of these will be seen as heads. 1000 will be a flip of the HT coin, but only 500 of these will be seen as heads. And of course, the 1000 flips of the TT coin will result in no heads being seen. But now there are twice as many cases where the HH coin was flipped and we see a head than when the HT coin is flipped. Again, we get 2/3.

The natural frequency method is especially good when trying to explain probability concepts to people who are mathematically naive. Keep it in your toolkit, it can be very useful for a professional statistician to know this. It's described in great detail in this book by Gerd Gigerenzer.

Jeff then described the cognitive dissonance experiment, which has been done by psychologists in various forms for many years. What is found is that if you give a monkey, for example, a choice between two items and it picks one, and if you then give the same monkey a choice between the item that was rejected and a third item, the monkey will tend to pick the third item over the one that was initially rejected. Psychologists had been explaining this by saying that the monkey rationalizes his rejection of the item by his initial rejection ("I really don't like that one."), but in 2007 it was discovered that a purely statistical explanation existed. The fact is that there are only three ways to rank three items such that the rejected item ranks below the one that was initially selected, and in two of those three ways, the third item ranks above the one that was rejected. So a lot of experiments on "cognitive dissonance" will have to be reexamined.

The example is related to the notorious Monty Hall Problem.

We discussed the Hubble Space Telescope. After the first launch, it was found that the solar panels had the unfortunate property of "creaking" when they went from the sun side of the Earth into the Earth's shadow. This caused "Hubble quakes," where the telescope would shake, producing unacceptably blurred images. We computed P(good data)=P(good data|no quake)P(no quake)+P(good data|quake)(1-P(no quake)).

Jeff then introduced marginalization as an important tool in inference, and in particular in Bayesian inference where it is used all the time. By summing or integrating over parameters we are not interested in, we can obtain a distribution over the remaining parameters. We did this for the distribution of luminosity and temperature of stars, and for the genetics of coat color in rats. We skipped over the bridge example.

Jeff defined independence: A is independent of B iff P(A|B)=P(A) for all A and B. Equivalently, if A is independent of B then B is independent of A, and also if A and B are independent then P(A,B)=P(A)P(B) for all A and B. Any of these may be taken as the definition of independence, and the other relationships can be derived. We asked you to do this, i.e., show that the first definition implies the second and third, and that the third implies the first. (Homework, not to be turned in.)

We discussed the fact that higher 5-year survival times for lung cancers detected by CT scans versus those discovered by X-rays can be a statistical artifact and not mean that the mortality (age at which a patient would die of the disease) has been reduced. For one thing, CT scans will detect a tumor at a smaller and hence earlier time; the patient might still die at the same age. Also, CT scans will detect more indolent tumors, which will not kill the patient. Turner Osler, who is familiar with the research, pointed out that the research was funded by the tobacco industry, a fact that was not made clear when the paper was published. The subtext was, go ahead and smoke, just get CT scans. This is, of course, quite unethical.

Finally, Jeff showed two ways of computing the sample mean, and showed how the second one, in the limit, goes over to the usual way of computing expectations. He wanted me to add the term "relative" to one of the lines, thus "observed relative frequency of x".

Friday, January 16, 2009

Difficulties installing packages under Windows Vista

One person had difficulty installing LearnBayes under Windows Vista. An error message was returned, as follows:

In file.create(f.tg) :
cannot create file
'C:\PROGRA~1\R\R-28~1.0/doc/html/packages.html', reason 'Permission denied'

Jeff suggests the following fix:

I'm not sure what the problem is, but I know I had problems using R and other software with vista until I turned the "user account control" feature to off. You'll need to do this. Turn it off and keep it off. I've had no ill effects over many months. Go to the windows security center and it'll be clear how to turn it off.

Thursday, January 15, 2009

STAT 295 1/15/09

New charts for Tuesday: Simple Examples

Assignment (due Thursday, 1/22):

Read Chapter 1 of the book; do problems 4 and 5 on pp. 16-17

In the lecture, Jeff talked about a number of useful features of R:

Statements like y>3 are logical, and if y is a vector (or even if a is a vector) then you will get a vector of true-false values. Then, you can use z[y>3] to select out only those components of the vector z that satisfy the condition y>3, component-by-component. The resulting vector will in general be shorter than the original z vector, since it will only contain those elements less than the corresponding element of y.

Jeff made a connection between data frames and other data structures including those from SAS. I know nothing about SAS. But the basic idea here is that you can have a line or row in a data frame that represents an observation, and the individual entries may have different types. I.e., they do not all have to be numbers (as in a matrix), the first could be an integer, the second could be a logical value (TRUE, FALSE), the third could be a character string ("TRUE", "FALSE"), etc.

You can attach a data frame. This allows you to access the entries in the data frame more easily (i.e., with fewer keystrokes). However, this poses many risks. For one thing, you may have several data frames. Which one will be fetched? Also, the variables are global, and generally, good software practice advises not using global variables.

Lists: This is the best way to return multiple items from a function. You can write

return(list(x,y,z))

and get an object, the components of which are x, y, z, even if x is a number, y is a matrix, and z is a list itself of other objects.

You can access the items of a list using subscripts, e.g., if b=list(x,y,z) then

b[[1]]

will return x, whatever it is (matrix, vector, number, whatever)

And, since b[[1]] is an object, if it happened to be a vector, for example, it could be sub-accessed by whatever method was appropriate, for example if it were a vector, then the 5'th component could be obtained by writing

b[[1]][5]

If you write

f=function(...){return(list(a=x, b=y, c=z))}

then you will be able to access the objects of the returned list by using the names a, b, c. So, for example, if you've called the function f that returns that list, and said w=f(..), then if you write w$b, you will get the value of y that the function computed.

Jeff pointed out that the access to the R functions that compute standard statistical distributions like normal, binomial, etc., are prefixed by 'r', 'p', 'q', 'd', depending on the use of the desired function.

So, 'rnorm' generates a vector of normally distributed random variables. 'dnorm' calculates the normal density at the requested point. 'pnorm' gives the the cumulative distribution function, and 'qnorm' is its inverse, giving the quantile function.

On control functions, we recommend using a smart editor that will allow your braces {} to be paired so as to make clear the logic of the program. Indentation control is important!

Jeff then went into the discussion of the basics of probability. He showed that if a hypothesis A predicts that we will see evidence E more probably than hypothesis B predicts that we will observe E, then when we observe E, we should think that the evidence E supports A more than it does B.

Bill went into a song-and-dance demonstration about the differences between Bayesian and frequentist interpretations of the meaning of probability. Everyone agreed that before the coin was tossed, the probability that it would be 'heads' was 0.5; but once it was tossed, the answers were all over the place. The majority opinion was something like "it is 1 or 0, but I don't know which." The minority opinion was "it is 0.5."

Then I looked at the coin, and knew what it was. I reported that I did this, but I didn't say what I saw. Nothing much changed as regards to your opinions.

Then I told you that I saw 'heads'. Some changed opinions (this is significant).

A student looked at the coin, and said that it was 'heads'. Some changed opinions (also significant).

The point of this experiment was to reveal a critical difference between Bayesian and frequentist interpretations of probability.

In the frequentist view, you cannot talk about the probability that the coin is 'heads' or 'tails' after the coin has been tossed. This view of probability cannot talk about the probability of events that have already happened, since they are fixed.

In the Bayesian view, you can talk about such probabilities, and indeed these are the probabilities that are most important. Bayesians regard the data as fixed, and the "states of nature," AKA whether the coin is 'heads' or 'tails', as the proper thing to describe by a probability distribution.

The reason is that in the Bayesian view, probability describes your willingness to believe that a particular state of nature is true. So the 'probability' is "in your head," rather than being "out there."

We can measure this by thinking about how much you would bet that a particular proposition was true. Regardless of whether the proposition is in the future (which team wins the Superbowl this year) or already determined but unknown to us (whether the 1,000,000th digit of the decimal expansion of π is '3'), you could still make a rational bet on the truth of the proposition.

Coherent betting behavior implies the axioms of probability theory. See this link for more information!

Jeff then picked up on this and presented the basic axioms of probability theory.

Tuesday, January 13, 2009

STAT 295 1/13/09

Just to remind you, the course home page is here. All assignments, charts, and other relevant material will be posted there.

The first items are the handout that Jeff wrote, on R; a chart on "preliminaries," which we did not use, but summarizes the syllabus that was handed out; and a chart set on the introduction to the basics of probability, which will be the next discussion after the introduction to R. This chart set has the fundamental equations we will be using all semester.

The usual frequentist and the Bayesian approaches to statistics differ in the way they use probability theory. The frequentist's main tool is the sampling distribution, that is, the probability distribution of the data, given some hypothesis (e.g., the null hypothesis of no effect, or the hypothesis that a parameter has a particular value). The frequentist looks at what happens if the data space is sampled repeatedly. This leads to notions like p-values, confidence intervals, and the other tools of the frequentist toolkit. The Bayesian approach is different: In Bayesian theory, the data are fixed at the actual value observed, and the probability distribution of interest is the one on the hypotheses. (In frequentist theory, this idea makes no sense, because in this view of statistics, only data, not hypotheses, can have a probability distribution; but the Bayesian viewpoint is different!)

Jeff lectured using his notes on R. We only got to the top of page 3, and the main things mentioned were:

Your assignment is to install R on your computer (if you have one), and also to install the LearnBayes package that is companion to the Albert book. Google 'r', or click here.
The R prompt is a "greater than" sign, >
Get information about an R function by typing >?median , for example
You can write >x=5, etc., using R as a fancy calculator with named variables.
Vectors are not matrices. They have a length, but not dimensions. This is an annoying feature (see the top of p. 3 of the handout).

Here is a link to the R reference that Jeff noted on page 1 of the handout.

Good luck, and see you on Thursday!