Bayes Rules: Stat 295 2/17/09

We discussed the problem #3 on problem set 3 and showed that the condition for equality was that f(y|x,θ)/f(y|θ) be independent of θ. We then also saw that if (for the hospital example) x={y,z}, where z is the data from all the other hospitals, then f(y|x,θ)=f(y|z,y,θ)=1 and the ratio is not independent of theta, so in this case you can't make this substitution. But if f(y|z,y,θ)=1 then the posterior for trying to use y again is equal to the prior so you can't use the data twice. Probability theory won't let you do that (legitimately).

Jeff talked about the curse of dimensionality. When you get to high-dimensional parameter spaces, you can't use the usual integration methods (e.g., integrating on a grid, since the number of points required is exponentially large in the dimension of the space). This is what kept Bayesian methods from wide usage until the introduction of MCMC about 20 years ago. But with MCMC people are now routinely analyzing models with thousands of variables.

New chart set: Introduction to Bayesian Computation.

Start with the Beta-binomial model for overdispersion. We wish to model stomach cancer death rates for 20 cities in Missouri.

Let y_j=number of deaths at city j, n_j the number of cases. j=1,2,...,20.

Initial attempt at a model would be that y_j~bin(n_j,θ), i.e., use a common probability of death for all cities. However, there is greater variability amongst the number of deaths than this model can fit, because not all cities have the same death rate. This is known as "overdispersion," (a term that Jeff doesn't like).

Instead use y_j~bin(n_j,θ) and θ~beta(α,β). We then wish to integrate θ out of this model, and use that as the likelihood.

The calculation is given in the chart set. We get

f(y|n,α,β)=C_yⁿ B(α+y,n+β-y)/B(α,β)

where B(,) is the normalization constant for the beta function.

This is the likelihood for α and β, given data n,y.

Recall that if θ~beta(α,β) then E[θ]=α/(α+β) and Var(θ)=α β/[(α+β)²(α+β+1)]

Reparameterize with η=E[θ], K=α+β. K is a precision parameter, related to 1/Var(θ). Indeed, Var(θ)=η(1-η)/(K+1)

α=Kη, β=K(1-η). K>0 and 0<η<1.

For prior, use independent priors for eta and K:

g(η,K) ∝ 1/(η(1-η)) ⋅ 1/(K+1)².

The posterior for η and K is

g(η,K|y₁,...,y₁) ∝ Π₁²⁰B(Kη+y,K(1-η)+n-y)/B(Kη,K(1-η)) ⋅ 1/(η(1-η)) ⋅ 1/(K+1)².

Look at contour plot of the posterior (on charts).

The chart goes way off the map, and suggests using a log transform on K and a logit on η.

θ₁=logit(η), θ₂=log(K). Thus η=F(θ₁) where the inverse logit, F, was defined last time. K=exp(θ₂).

We need to compute the Jacobian. We have

dη/dθ₁=F(θ₁)(1-F(θ₁)=J_η, and exp(θ₂)=J_K. It turns out that J_ηg(η)=1 in this parameterization where g(η) is the prior on η. This may have motivated Albert's choice for the prior on η.

The prior on K in this parameterization is 1/(1+exp(θ₂))²exp(θ₂)

Contour plot in these variables looks much better.

Next thing: Approximate this posterior density with a normal approximation. The contour plot looks pretty non-normal, so the fit won't be great. Nonetheless, we can get an idea of what's going on from this. Need to know means, variances and the covariance. How do we get these?

Laplace Approximation: Derive approximation to the posterior that's easy to calculate.

The approximation can be useful when the posterior is unimodal, which is typically true when we have enough data. How much is enough? A rule of thumb (Kass & Raftery, Journal of the American Statistical Association, Vol. 90, No. 430 (Jun., 1995), pp. 773-795 ) says that we need 20 times as much data than the number of parameters for the Laplace aproximation to work reasonably well (p. 778, note the discussion and cautions).

Here's how to do it: The Taylor's series expansion says

h(θ)≈h(θ₀)+h'(θ₀)(θ-θ₀)+0.5h"(θ₀)(θ-θ₀)².

The vector version of this is in the notes. The matrix h"(θ₀) of second derivatives is the Hessian.

h(θ₀)+∇h(θ₀)(θ-θ₀)+0.5(θ-θ₀)^T h"(θ₀)(θ-θ₀)

We want θ₀ to be chosen as the posterior mode, so that the linear term vanishes.

Want h(θ)=log[f(y|θ)g(θ)] ≈ h(θ₀)+0.5(θ-θ₀)^T h"(theta₀)(θ-θ₀).

When you exponentiate this, the result is a multivariate normal distribution. The mean is the posterior mode, and the variance-covariance matrix is the inverse of the Hessian matrix.

To sum up, θ|y ≈ N(θ₀,-(h"(θ₀))^-1).

The Laplace approximation is used much less than in the past, because MCMC techniques allow us to sample from the exact posterior distribution to obtain the information we need. It can still be useful in formulating a "proposal distribution," which is used in MCMC sampling.

Bayes Rules

Tuesday, February 17, 2009

Stat 295 2/17/09

No comments:

Blog Archive

About Me