We discussed the problem #3 on problem set 3 and showed that the condition for equality was that f(y|x,θ)/f(y|θ) be independent of θ. We then also saw that if (for the hospital example) x={y,z}, where z is the data from all the other hospitals, then f(y|x,θ)=f(y|z,y,θ)=1 and the ratio is not independent of theta, so in this case you can't make this substitution. But if f(y|z,y,θ)=1 then the posterior for trying to use y again is equal to the prior so you can't use the data twice. Probability theory won't let you do that (legitimately).
Jeff talked about the curse of dimensionality. When you get to high-dimensional parameter spaces, you can't use the usual integration methods (e.g., integrating on a grid, since the number of points required is exponentially large in the dimension of the space). This is what kept Bayesian methods from wide usage until the introduction of MCMC about 20 years ago. But with MCMC people are now routinely analyzing models with thousands of variables.
New chart set: Introduction to Bayesian Computation.
Start with the Beta-binomial model for overdispersion. We wish to model stomach cancer death rates for 20 cities in Missouri.
Let yj=number of deaths at city j, nj the number of cases. j=1,2,...,20.
Initial attempt at a model would be that yj~bin(nj,θ), i.e., use a common probability of death for all cities. However, there is greater variability amongst the number of deaths than this model can fit, because not all cities have the same death rate. This is known as "overdispersion," (a term that Jeff doesn't like).
Instead use yj~bin(nj,θ) and θ~beta(α,β). We then wish to integrate θ out of this model, and use that as the likelihood.
The calculation is given in the chart set. We get
f(y|n,α,β)=Cyn B(α+y,n+β-y)/B(α,β)
where B(,) is the normalization constant for the beta function.
This is the likelihood for α and β, given data n,y.
Recall that if θ~beta(α,β) then E[θ]=α/(α+β) and Var(θ)=α β/[(α+β)2(α+β+1)]
Reparameterize with η=E[θ], K=α+β. K is a precision parameter, related to 1/Var(θ). Indeed, Var(θ)=η(1-η)/(K+1)
α=Kη, β=K(1-η). K>0 and 0<η<1.
For prior, use independent priors for eta and K:
g(η,K) ∝ 1/(η(1-η)) ⋅ 1/(K+1)2.
The posterior for η and K is
g(η,K|y1,...,y1) ∝ Π120B(Kη+y,K(1-η)+n-y)/B(Kη,K(1-η)) ⋅ 1/(η(1-η)) ⋅ 1/(K+1)2.
Look at contour plot of the posterior (on charts).
The chart goes way off the map, and suggests using a log transform on K and a logit on η.
θ1=logit(η), θ2=log(K). Thus η=F(θ1) where the inverse logit, F, was defined last time. K=exp(θ2).
We need to compute the Jacobian. We have
dη/dθ1=F(θ1)(1-F(θ1)=Jη, and exp(θ2)=JK. It turns out that Jηg(η)=1 in this parameterization where g(η) is the prior on η. This may have motivated Albert's choice for the prior on η.
The prior on K in this parameterization is 1/(1+exp(θ2))2exp(θ2)
Contour plot in these variables looks much better.
Next thing: Approximate this posterior density with a normal approximation. The contour plot looks pretty non-normal, so the fit won't be great. Nonetheless, we can get an idea of what's going on from this. Need to know means, variances and the covariance. How do we get these?
Laplace Approximation: Derive approximation to the posterior that's easy to calculate.
The approximation can be useful when the posterior is unimodal, which is typically true when we have enough data. How much is enough? A rule of thumb (Kass & Raftery, Journal of the American Statistical Association, Vol. 90, No. 430 (Jun., 1995), pp. 773-795) says that we need 20 times as much data than the number of parameters for the Laplace aproximation to work reasonably well (p. 778, note the discussion and cautions).
Here's how to do it: The Taylor's series expansion says
h(θ)≈h(θ0)+h'(θ0)(θ-θ0)+0.5h"(θ0)(θ-θ0)2.
The vector version of this is in the notes. The matrix h"(θ0) of second derivatives is the Hessian.
h(θ0)+∇h(θ0)(θ-θ0)+0.5(θ-θ0)T h"(θ0)(θ-θ0)
We want θ0 to be chosen as the posterior mode, so that the linear term vanishes.
Want h(θ)=log[f(y|θ)g(θ)] ≈ h(θ0)+0.5(θ-θ0)T h"(theta0)(θ-θ0).
When you exponentiate this, the result is a multivariate normal distribution. The mean is the posterior mode, and the variance-covariance matrix is the inverse of the Hessian matrix.
To sum up, θ|y ≈ N(θ0,-(h"(θ0))-1).
The Laplace approximation is used much less than in the past, because MCMC techniques allow us to sample from the exact posterior distribution to obtain the information we need. It can still be useful in formulating a "proposal distribution," which is used in MCMC sampling.
Tuesday, February 17, 2009
Thursday, February 12, 2009
STAT 295 2/12/09
Happy 200th birthday, Charles Darwin!
Today we discussed the problem (#3) regarding the non-independent data. If you can't write f(x,y|θ)=k(x,y)fy(y|θ)fx(x|θ), then you cannot treat the sequential problem as if the y variable is independent of x (that is, the calculation given in the analogous problem in problem set #2 does not work). I need to do a bit of research on the k(x,y) piece of this.
Another example: Bioassay experiment. First: transforming parameters to (-∞,∞) is often a good idea. In this case we used a log transform. Similarly, the probabilities are transformed from (0,1) to (-∞,∞) using the logit transformation:
q=logit(p)=log(p/(1-p))
with q=(β0+β1x)=F(q) where p=F(q)=(1+exp(-q))-1
log(p/(1-p))=logit(p)=β0+β1x
β1 is interpreted as the difference in log odds for a 1-unit change in x (the dose).
Logistic regression is very important in statistics. Working statisticians will see it over and over again.
Note that F'(v)=F(v)(1-F(v))
We have animal data from the book, and wish to make inference on the βs.
The likelihood is binomial in pi (i is the group) and then the product over all groups of animals. This is because we have two situations (the animal lived or it died) with probability p (given by the inverse logit) that it lived and (1-p) that it died. So the likelihood is the product of p's and (1-p)'s over all the animals in the study.
We'll use a sampling approach on a discretized version of β0, β1. But need to get an idea of how far to go. Go for the mode (MLE since priors are flat). Then estimate the standard errors, go out several standard errors and use that.
We did this in R using the R function glm (generalized linear model).
We recommend looking at functions in Albert's LearnBayes package to see how he did it. You'll learn new things about R, and have a better idea about what Albert is doing.
We looked at the plot (Figure 4.3 of the book) and saw that we've got to go pretty far out (farther than Jeff guessed) in order to get a decent sample.
How to sample from the distribution. The contour plot gives us an idea. Chop up region into a lot of little squares. Have something that is proportional to the density. Height of this surface is sort of like the probability that theta takes that value. Can use the sample() function to do this (sampling with replacement). We did this and the sample in Fig. 4.4 is the result. Looks pretty good.
We want to compute LD-50, the dose at which 50% of the animals die. Thus
Pr(y|x)=1/(1+exp(-(β0+β1x)))=0.5
1=exp(-(β0+β1xLD50))
0=-β0-β1xLD50, so x=-β0/β1
Now look at the Bayesian analysis. From the sample, divide the samples of β0 by the samples of β1, look at the 95% credible interval for the sample. We did this, looked at the histogram and computed the 95% credible interval. (The result is in the log dose).
We also used the density() function to draw the posterior density for the marginal of LD-50. This uses a nonparametric kernel estimator to draw a smooth line. β1 is the log odds ratio for a unit change in dose. To get the odds ratio, exponentiate.
(The nonparametric kernel estimator works this way: At each point of the sample, we draw a peaked, narrow bell-shaped curve with unit area. We then add the curves for the whole sample and divide by the number of samples to get an approximation to the probability density curve.)
Sheila asked about hypothesis testing. We can get one-sided credible intervals (analogous to one-sided confidence intervals) quite simply. For example if s$y contains the samples of some quantity β, can compute probability that β>1 as mean((s$y)>0).
I pointed out that there is no Bayesian analog to frequentist two-sided hypothesis testing. We'll get to this later, when we discuss hypothesis testing.
Today we discussed the problem (#3) regarding the non-independent data. If you can't write f(x,y|θ)=k(x,y)fy(y|θ)fx(x|θ), then you cannot treat the sequential problem as if the y variable is independent of x (that is, the calculation given in the analogous problem in problem set #2 does not work). I need to do a bit of research on the k(x,y) piece of this.
Another example: Bioassay experiment. First: transforming parameters to (-∞,∞) is often a good idea. In this case we used a log transform. Similarly, the probabilities are transformed from (0,1) to (-∞,∞) using the logit transformation:
q=logit(p)=log(p/(1-p))
with q=(β0+β1x)=F(q) where p=F(q)=(1+exp(-q))-1
log(p/(1-p))=logit(p)=β0+β1x
β1 is interpreted as the difference in log odds for a 1-unit change in x (the dose).
Logistic regression is very important in statistics. Working statisticians will see it over and over again.
Note that F'(v)=F(v)(1-F(v))
We have animal data from the book, and wish to make inference on the βs.
The likelihood is binomial in pi (i is the group) and then the product over all groups of animals. This is because we have two situations (the animal lived or it died) with probability p (given by the inverse logit) that it lived and (1-p) that it died. So the likelihood is the product of p's and (1-p)'s over all the animals in the study.
We'll use a sampling approach on a discretized version of β0, β1. But need to get an idea of how far to go. Go for the mode (MLE since priors are flat). Then estimate the standard errors, go out several standard errors and use that.
We did this in R using the R function glm (generalized linear model).
We recommend looking at functions in Albert's LearnBayes package to see how he did it. You'll learn new things about R, and have a better idea about what Albert is doing.
We looked at the plot (Figure 4.3 of the book) and saw that we've got to go pretty far out (farther than Jeff guessed) in order to get a decent sample.
How to sample from the distribution. The contour plot gives us an idea. Chop up region into a lot of little squares. Have something that is proportional to the density. Height of this surface is sort of like the probability that theta takes that value. Can use the sample() function to do this (sampling with replacement). We did this and the sample in Fig. 4.4 is the result. Looks pretty good.
We want to compute LD-50, the dose at which 50% of the animals die. Thus
Pr(y|x)=1/(1+exp(-(β0+β1x)))=0.5
1=exp(-(β0+β1xLD50))
0=-β0-β1xLD50, so x=-β0/β1
Now look at the Bayesian analysis. From the sample, divide the samples of β0 by the samples of β1, look at the 95% credible interval for the sample. We did this, looked at the histogram and computed the 95% credible interval. (The result is in the log dose).
We also used the density() function to draw the posterior density for the marginal of LD-50. This uses a nonparametric kernel estimator to draw a smooth line. β1 is the log odds ratio for a unit change in dose. To get the odds ratio, exponentiate.
(The nonparametric kernel estimator works this way: At each point of the sample, we draw a peaked, narrow bell-shaped curve with unit area. We then add the curves for the whole sample and divide by the number of samples to get an approximation to the probability density curve.)
Sheila asked about hypothesis testing. We can get one-sided credible intervals (analogous to one-sided confidence intervals) quite simply. For example if s$y contains the samples of some quantity β, can compute probability that β>1 as mean((s$y)>0).
I pointed out that there is no Bayesian analog to frequentist two-sided hypothesis testing. We'll get to this later, when we discuss hypothesis testing.
Wednesday, February 11, 2009
Tuesday, February 10, 2009
STAT 295 2/10/09
I made some remarks on the homework. Won't repeat them here since they will appear on the papers that were turned back.
We talked about Jacobians (named for a 19th century mathematician, however he was German, not French as I had misremembered). I gave my reasoning, which relied on the change of variables when doing an integral:
∫p(u)du=∫(p(u(v))(du/dv)dv=∫p(v)dv
The idea here is that if u is a variable and we transform to v, the integral has to be unchanged since it is the same probability that we're talking about.
The highlighted piece here is the Jacobian. Whenever you change variables in our probability statements, you have to include a Jacobian. So, the above equation says that
p(v)=p(u)(du/dv).
Jeff gave a different (but entirely equivalent) argument. His argument (which is hard to reproduce here, so I hope you got good notes) also starts with integrating, here to compute the probability that θ is less than x. That is an integral from -∞ to x of the density on θ. But if we transform to λ via θ=g(λ), then you get
Pr(θ<x)=Pr(λ<g-1(x))=∫-∞g-1(x)p(λ)dλ
To go from the distribution (Pr) to the density (p) you take the derivative with respect to x. When you do this, you use the fundamental theorem of calculus (the derivative of the integral is the function under the integral) and the chain rule (you have to multiply by dg-1(x)/dx, which is the Jacobian).
On the homework for Thursday, Jeff generalized the problem to the following condition:...Assume that there does not exist k(x,y) independent of θ such that
fx(x|θ)fy(y|θ)=k(x,y)f(x,y|θ)
Then you should be able to show that you can't pretend that x and y are independent when you calculate the posterior distribution on x and y. You have to use the formulation that f(x,y|θ)=k(x,y)f(y|x,θ)f(x|θ), in other words, the full conditional probability formula is required.
The corrections to the Albert book are here. You have the most recent (third) printing, in all likelihood, so most of these won't apply.
We then discussed the normal likelihood with both mean (μ) and variance (σ2) unknown.
The charts have the equations. We took a prior that is flat on mu and 1/σ2 on σ2. We'll justify these priors as "noninformative" later in the course. Both priors are improper, but if the posterior is proper there will not be a problem.
The key observation here is that in the posterior, there is a piece that is independent of mu, times another piece that has mu and sigma. This means that we can factorize the posterior like g(σ2)g(μ|σ2). We observed that the latter piece would be normal, with mean the mean of the y's, and with variance to be gotten by sampling σ2. Then we found that the marginal distribution of σ2 is an inverse chi-square distribution with (n-1) degrees of freedom, where n is the number of data points. Jeff discussed several ways of sampling from an inverse chi-square distribution, and they are included in his R code, which will be found here.
Jeff then used the code to plot contours of the posterior distribution, plot sample points, and compute quantiles and other things of interest. It is known that the frequentist confidence interval on mu is given by a t distribution with (n-1) degrees of freedom, and Jeff computed that as well as the Bayesian credible interval from the sample. They coincided. This is one example where the credible interval and the confidence interval are the same.
We talked about Jacobians (named for a 19th century mathematician, however he was German, not French as I had misremembered). I gave my reasoning, which relied on the change of variables when doing an integral:
∫p(u)du=∫(p(u(v))(du/dv)dv=∫p(v)dv
The idea here is that if u is a variable and we transform to v, the integral has to be unchanged since it is the same probability that we're talking about.
The highlighted piece here is the Jacobian. Whenever you change variables in our probability statements, you have to include a Jacobian. So, the above equation says that
p(v)=p(u)(du/dv).
Jeff gave a different (but entirely equivalent) argument. His argument (which is hard to reproduce here, so I hope you got good notes) also starts with integrating, here to compute the probability that θ is less than x. That is an integral from -∞ to x of the density on θ. But if we transform to λ via θ=g(λ), then you get
Pr(θ<x)=Pr(λ<g-1(x))=∫-∞g-1(x)p(λ)dλ
To go from the distribution (Pr) to the density (p) you take the derivative with respect to x. When you do this, you use the fundamental theorem of calculus (the derivative of the integral is the function under the integral) and the chain rule (you have to multiply by dg-1(x)/dx, which is the Jacobian).
On the homework for Thursday, Jeff generalized the problem to the following condition:...Assume that there does not exist k(x,y) independent of θ such that
fx(x|θ)fy(y|θ)=k(x,y)f(x,y|θ)
Then you should be able to show that you can't pretend that x and y are independent when you calculate the posterior distribution on x and y. You have to use the formulation that f(x,y|θ)=k(x,y)f(y|x,θ)f(x|θ), in other words, the full conditional probability formula is required.
The corrections to the Albert book are here. You have the most recent (third) printing, in all likelihood, so most of these won't apply.
We then discussed the normal likelihood with both mean (μ) and variance (σ2) unknown.
The charts have the equations. We took a prior that is flat on mu and 1/σ2 on σ2. We'll justify these priors as "noninformative" later in the course. Both priors are improper, but if the posterior is proper there will not be a problem.
The key observation here is that in the posterior, there is a piece that is independent of mu, times another piece that has mu and sigma. This means that we can factorize the posterior like g(σ2)g(μ|σ2). We observed that the latter piece would be normal, with mean the mean of the y's, and with variance to be gotten by sampling σ2. Then we found that the marginal distribution of σ2 is an inverse chi-square distribution with (n-1) degrees of freedom, where n is the number of data points. Jeff discussed several ways of sampling from an inverse chi-square distribution, and they are included in his R code, which will be found here.
Jeff then used the code to plot contours of the posterior distribution, plot sample points, and compute quantiles and other things of interest. It is known that the frequentist confidence interval on mu is given by a t distribution with (n-1) degrees of freedom, and Jeff computed that as well as the Bayesian credible interval from the sample. They coincided. This is one example where the credible interval and the confidence interval are the same.
Undergraduate opportunity
The Statistical and Applied Mathematical Sciences Institute (SAMSI), in Research Triangle Park, NC, is hosting a one week undergraduate workshop for college juniors and seniors focused on SAMSI research activities related to the statistical and applied mathematical modeling and analysis of experimental data. During the first day, a summary of research activities in the 2008-2009 programs on Sequential Monte Carlo Methods, and Algebraic Methods in Systems Biology and Statistics will be presented. In days two through five, participants will be involved in a hands-on experience. They will use mathematical and statistical models to analyze experimental data they collect in the CRSC/Math Instructional Research Lab on the NC State University campus.
Tutorials on modeling, mathematical and statistical methodology and on the physical experiments being used will be given. Participants working together in small teams will collect data and analyze it using mathematical and statistical software provided.
REGISTRATION
Applicants should use the on-line application form and also have one letter of recommendation. Full financial support for travel expenses, subsistence and lodging in university housing will be provided for all attendees. Due to space considerations, participation is restricted and will be offered to approximately 18 individuals selected from the applicant pool. Participants are expected to arrive for the workshop on Sunday, May 17, 2009 and remain in continuous attendance until 12:00 pm on Friday, May 22, 2009.
Further details, including information on how to register and where to send letters of recommendation can be found here.
[Note: Apply early, if you are interested!!!]
Tutorials on modeling, mathematical and statistical methodology and on the physical experiments being used will be given. Participants working together in small teams will collect data and analyze it using mathematical and statistical software provided.
REGISTRATION
Applicants should use the on-line application form and also have one letter of recommendation. Full financial support for travel expenses, subsistence and lodging in university housing will be provided for all attendees. Due to space considerations, participation is restricted and will be offered to approximately 18 individuals selected from the applicant pool. Participants are expected to arrive for the workshop on Sunday, May 17, 2009 and remain in continuous attendance until 12:00 pm on Friday, May 22, 2009.
Further details, including information on how to register and where to send letters of recommendation can be found here.
Applications will be considered beginning February 10, 2009 and continuing until April 3, 2009, but registration will likely be closed before that date, as workshop capacity is reached. Successful applicants will be notified of their acceptance as soon as a decision on their application is reached.
[Note: Apply early, if you are interested!!!]
Upon acceptance for the program, individuals must confirm by email within three days their intention to participate (otherwise their place will be given to another applicant). Accepted participants will be advised regarding airfares and will need to purchase plane tickets with a three-week advance fare.
Please direct questions concerning the workshop to ugworkshop200905@samsi.info.
Subscribe to:
Posts (Atom)