Wednesday, February 11, 2009
Tuesday, February 10, 2009
STAT 295 2/10/09
I made some remarks on the homework. Won't repeat them here since they will appear on the papers that were turned back.
We talked about Jacobians (named for a 19th century mathematician, however he was German, not French as I had misremembered). I gave my reasoning, which relied on the change of variables when doing an integral:
∫p(u)du=∫(p(u(v))(du/dv)dv=∫p(v)dv
The idea here is that if u is a variable and we transform to v, the integral has to be unchanged since it is the same probability that we're talking about.
The highlighted piece here is the Jacobian. Whenever you change variables in our probability statements, you have to include a Jacobian. So, the above equation says that
p(v)=p(u)(du/dv).
Jeff gave a different (but entirely equivalent) argument. His argument (which is hard to reproduce here, so I hope you got good notes) also starts with integrating, here to compute the probability that θ is less than x. That is an integral from -∞ to x of the density on θ. But if we transform to λ via θ=g(λ), then you get
Pr(θ<x)=Pr(λ<g-1(x))=∫-∞g-1(x)p(λ)dλ
To go from the distribution (Pr) to the density (p) you take the derivative with respect to x. When you do this, you use the fundamental theorem of calculus (the derivative of the integral is the function under the integral) and the chain rule (you have to multiply by dg-1(x)/dx, which is the Jacobian).
On the homework for Thursday, Jeff generalized the problem to the following condition:...Assume that there does not exist k(x,y) independent of θ such that
fx(x|θ)fy(y|θ)=k(x,y)f(x,y|θ)
Then you should be able to show that you can't pretend that x and y are independent when you calculate the posterior distribution on x and y. You have to use the formulation that f(x,y|θ)=k(x,y)f(y|x,θ)f(x|θ), in other words, the full conditional probability formula is required.
The corrections to the Albert book are here. You have the most recent (third) printing, in all likelihood, so most of these won't apply.
We then discussed the normal likelihood with both mean (μ) and variance (σ2) unknown.
The charts have the equations. We took a prior that is flat on mu and 1/σ2 on σ2. We'll justify these priors as "noninformative" later in the course. Both priors are improper, but if the posterior is proper there will not be a problem.
The key observation here is that in the posterior, there is a piece that is independent of mu, times another piece that has mu and sigma. This means that we can factorize the posterior like g(σ2)g(μ|σ2). We observed that the latter piece would be normal, with mean the mean of the y's, and with variance to be gotten by sampling σ2. Then we found that the marginal distribution of σ2 is an inverse chi-square distribution with (n-1) degrees of freedom, where n is the number of data points. Jeff discussed several ways of sampling from an inverse chi-square distribution, and they are included in his R code, which will be found here.
Jeff then used the code to plot contours of the posterior distribution, plot sample points, and compute quantiles and other things of interest. It is known that the frequentist confidence interval on mu is given by a t distribution with (n-1) degrees of freedom, and Jeff computed that as well as the Bayesian credible interval from the sample. They coincided. This is one example where the credible interval and the confidence interval are the same.
We talked about Jacobians (named for a 19th century mathematician, however he was German, not French as I had misremembered). I gave my reasoning, which relied on the change of variables when doing an integral:
∫p(u)du=∫(p(u(v))(du/dv)dv=∫p(v)dv
The idea here is that if u is a variable and we transform to v, the integral has to be unchanged since it is the same probability that we're talking about.
The highlighted piece here is the Jacobian. Whenever you change variables in our probability statements, you have to include a Jacobian. So, the above equation says that
p(v)=p(u)(du/dv).
Jeff gave a different (but entirely equivalent) argument. His argument (which is hard to reproduce here, so I hope you got good notes) also starts with integrating, here to compute the probability that θ is less than x. That is an integral from -∞ to x of the density on θ. But if we transform to λ via θ=g(λ), then you get
Pr(θ<x)=Pr(λ<g-1(x))=∫-∞g-1(x)p(λ)dλ
To go from the distribution (Pr) to the density (p) you take the derivative with respect to x. When you do this, you use the fundamental theorem of calculus (the derivative of the integral is the function under the integral) and the chain rule (you have to multiply by dg-1(x)/dx, which is the Jacobian).
On the homework for Thursday, Jeff generalized the problem to the following condition:...Assume that there does not exist k(x,y) independent of θ such that
fx(x|θ)fy(y|θ)=k(x,y)f(x,y|θ)
Then you should be able to show that you can't pretend that x and y are independent when you calculate the posterior distribution on x and y. You have to use the formulation that f(x,y|θ)=k(x,y)f(y|x,θ)f(x|θ), in other words, the full conditional probability formula is required.
The corrections to the Albert book are here. You have the most recent (third) printing, in all likelihood, so most of these won't apply.
We then discussed the normal likelihood with both mean (μ) and variance (σ2) unknown.
The charts have the equations. We took a prior that is flat on mu and 1/σ2 on σ2. We'll justify these priors as "noninformative" later in the course. Both priors are improper, but if the posterior is proper there will not be a problem.
The key observation here is that in the posterior, there is a piece that is independent of mu, times another piece that has mu and sigma. This means that we can factorize the posterior like g(σ2)g(μ|σ2). We observed that the latter piece would be normal, with mean the mean of the y's, and with variance to be gotten by sampling σ2. Then we found that the marginal distribution of σ2 is an inverse chi-square distribution with (n-1) degrees of freedom, where n is the number of data points. Jeff discussed several ways of sampling from an inverse chi-square distribution, and they are included in his R code, which will be found here.
Jeff then used the code to plot contours of the posterior distribution, plot sample points, and compute quantiles and other things of interest. It is known that the frequentist confidence interval on mu is given by a t distribution with (n-1) degrees of freedom, and Jeff computed that as well as the Bayesian credible interval from the sample. They coincided. This is one example where the credible interval and the confidence interval are the same.
Undergraduate opportunity
The Statistical and Applied Mathematical Sciences Institute (SAMSI), in Research Triangle Park, NC, is hosting a one week undergraduate workshop for college juniors and seniors focused on SAMSI research activities related to the statistical and applied mathematical modeling and analysis of experimental data. During the first day, a summary of research activities in the 2008-2009 programs on Sequential Monte Carlo Methods, and Algebraic Methods in Systems Biology and Statistics will be presented. In days two through five, participants will be involved in a hands-on experience. They will use mathematical and statistical models to analyze experimental data they collect in the CRSC/Math Instructional Research Lab on the NC State University campus.
Tutorials on modeling, mathematical and statistical methodology and on the physical experiments being used will be given. Participants working together in small teams will collect data and analyze it using mathematical and statistical software provided.
REGISTRATION
Applicants should use the on-line application form and also have one letter of recommendation. Full financial support for travel expenses, subsistence and lodging in university housing will be provided for all attendees. Due to space considerations, participation is restricted and will be offered to approximately 18 individuals selected from the applicant pool. Participants are expected to arrive for the workshop on Sunday, May 17, 2009 and remain in continuous attendance until 12:00 pm on Friday, May 22, 2009.
Further details, including information on how to register and where to send letters of recommendation can be found here.
[Note: Apply early, if you are interested!!!]
Tutorials on modeling, mathematical and statistical methodology and on the physical experiments being used will be given. Participants working together in small teams will collect data and analyze it using mathematical and statistical software provided.
REGISTRATION
Applicants should use the on-line application form and also have one letter of recommendation. Full financial support for travel expenses, subsistence and lodging in university housing will be provided for all attendees. Due to space considerations, participation is restricted and will be offered to approximately 18 individuals selected from the applicant pool. Participants are expected to arrive for the workshop on Sunday, May 17, 2009 and remain in continuous attendance until 12:00 pm on Friday, May 22, 2009.
Further details, including information on how to register and where to send letters of recommendation can be found here.
Applications will be considered beginning February 10, 2009 and continuing until April 3, 2009, but registration will likely be closed before that date, as workshop capacity is reached. Successful applicants will be notified of their acceptance as soon as a decision on their application is reached.
[Note: Apply early, if you are interested!!!]
Upon acceptance for the program, individuals must confirm by email within three days their intention to participate (otherwise their place will be given to another applicant). Accepted participants will be advised regarding airfares and will need to purchase plane tickets with a three-week advance fare.
Please direct questions concerning the workshop to ugworkshop200905@samsi.info.
Friday, February 6, 2009
4-up Chart Set 5
I figured out what was wrong that was keeping me from saving pdf files 4-up. A 4-up version of the corrected Chart Set 5 is here.
STAT 295 2/5/09
First, the assignment for next Thursday can be found here.
Interesting class today. Jeff started out by asking about the homework. We suggest learning how do use the array capabilities of R (array of numbers, vectors, matrices). You can multiply an array by a scalar and get an array with the same number of entries. You can add, subtract or multiply two arrays, or add a constant to an array; the operations will take place pointwise. You can use sum() to add up the elements of an array. And so forth. Play around with R in the calculator mode where you enter a formula and look at what's been computed. Facility with these kinds of calculations can really speed up a calculation, as well as making for easier-to-read code.
In #4, the likelihoods are proportional to each other. To show this, you should complete the square in the summation form of the equation, and note that when you do this the sum can be turned into the sample mean, and the rest of the items in the exp() function are constant with respect to μ.
The ESP problem seemed sensitive to the prior...not enough data to overwhelm it.
We suggested that anyone who plans to write scientific papers in their professional life would do well to learn TeX/LaTeX. LaTeX is most useful.
Jeff gave some background on the heart transplant mortality problem. Most of it was "chalk talk" on the blackboard. Typically the study starts by modeling the risk to individual patients using logistic regression. This powerful method allows us to estimate the probability that a patient will die as a function of covariates such as age, disease status, presence or absence of diabetes, and so forth. We would use logistic regression on a large number of patients to get the coefficients that are appropriate for each covariate. Then, when a new patient comes along, we can plug the covariates appropriate for that patient into our logistic model, and predict the probability of death from that.
Then, armed with these probabilities for each patient, we can predict the number of patients that are expected to die in a given hospital by just adding up those probabilities over all patients in the study. This is what Albert calls the exposure. Jeff used 'd' for that quantity.
Both Turner and Jeff are using this basic idea in their research...Turner in trauma cases, Jeff in neonatal cases.
In the example, a particular hospital had d=0.066 and had one death.
We model the observed number of deaths with a Poisson distribution:
Pr(y|μ)=μyeμ/y!
where μ=d⋅λ.
Note that the expectation E[Y|λ]=d⋅λ.
With a gamma prior g(λ|α,β)=λα-1e-βλ we get a gamma posterior, since the gamma distribution is the conjugate prior for the Poisson distribution. This is a gamma-Poisson model.
To choose α and β, we imagine we have data on 10 other hospitals. We had a discussion about why we cannot use the hospital that we're studying to decide on the prior. Basically, to do that would be to use the data from that hospital twice, and that's not allowed. Strictly speaking, you cannot do this if you respect the fact that if data are dependent, then you cannot simply multiply probabilities, you have to use the conditional probability formula correctly. If you do this, you'll find that the rules of the probability calculus will automatically prevent you from using the data twice. Problem #4 in the new problem set addresses this issue.
Anyway, if zj is the observed number of deaths at hospital j, and oj the expected number, then we can write a Poisson likelihood term for each hospital, and assuming the hospitals are independent, the product of these is the likelihood for our estimation. Then a standard prior on λ is 1/λ. [Brief interlude: This is a commonly used prior for scale variables, e.g., lengths, rates, and so forth. In such problems it should not matter if we use a ruler calibrated in inches, for example, instead of centimeters...if we do it consistently we should always get the same result. Mathematically, it is seen as a group-theoretic feature under the group of multiplications by positive real numbers: if c is a constant and x=cy, then dx/x=dy/y regardless of the value of c.]
The prior is improper, that is, its integral from 0 to ∞ diverges at both ends. So it's not a "real" prior since it can't be normalized. However, if we imagine going from 1/n to n in the integral, you can normalize that. Then the question is, if you use such a prior, and multiply by the likelihood, does the posterior end up being normalizable as n→∞? If so, then it is legitimate to use the improper prior, you won't get into trouble.
The posterior using the data from these 10 hospitals is on the chart set. This is what we'll use for the prior with the hospital we are studying.
Digression: Neither Jeff nor I would use this method as described. Rather, we would use a hierarchical model where all hospitals, including the one under investigation, are analyzed together. We will return to this subject later.
Jeff finished the discussion by running the R code in the notes. Running it on the hospital with 1 death and exposure d=0.066, we saw that the 95% credible interval for lambda contained 1 near the middle, so there was no evidence that this hospital had an excess death rate. This is so even though 1 is quite a bit larger than 0.066 (about a factor of 15). However, when we plugged 10 deaths into the calculation, the 95% credible interval did not include 1 so we would say that such a hospital has an elevated risk.
Interesting class today. Jeff started out by asking about the homework. We suggest learning how do use the array capabilities of R (array of numbers, vectors, matrices). You can multiply an array by a scalar and get an array with the same number of entries. You can add, subtract or multiply two arrays, or add a constant to an array; the operations will take place pointwise. You can use sum() to add up the elements of an array. And so forth. Play around with R in the calculator mode where you enter a formula and look at what's been computed. Facility with these kinds of calculations can really speed up a calculation, as well as making for easier-to-read code.
In #4, the likelihoods are proportional to each other. To show this, you should complete the square in the summation form of the equation, and note that when you do this the sum can be turned into the sample mean, and the rest of the items in the exp() function are constant with respect to μ.
The ESP problem seemed sensitive to the prior...not enough data to overwhelm it.
We suggested that anyone who plans to write scientific papers in their professional life would do well to learn TeX/LaTeX. LaTeX is most useful.
Jeff gave some background on the heart transplant mortality problem. Most of it was "chalk talk" on the blackboard. Typically the study starts by modeling the risk to individual patients using logistic regression. This powerful method allows us to estimate the probability that a patient will die as a function of covariates such as age, disease status, presence or absence of diabetes, and so forth. We would use logistic regression on a large number of patients to get the coefficients that are appropriate for each covariate. Then, when a new patient comes along, we can plug the covariates appropriate for that patient into our logistic model, and predict the probability of death from that.
Then, armed with these probabilities for each patient, we can predict the number of patients that are expected to die in a given hospital by just adding up those probabilities over all patients in the study. This is what Albert calls the exposure. Jeff used 'd' for that quantity.
Both Turner and Jeff are using this basic idea in their research...Turner in trauma cases, Jeff in neonatal cases.
In the example, a particular hospital had d=0.066 and had one death.
We model the observed number of deaths with a Poisson distribution:
Pr(y|μ)=μyeμ/y!
where μ=d⋅λ.
Note that the expectation E[Y|λ]=d⋅λ.
With a gamma prior g(λ|α,β)=λα-1e-βλ we get a gamma posterior, since the gamma distribution is the conjugate prior for the Poisson distribution. This is a gamma-Poisson model.
To choose α and β, we imagine we have data on 10 other hospitals. We had a discussion about why we cannot use the hospital that we're studying to decide on the prior. Basically, to do that would be to use the data from that hospital twice, and that's not allowed. Strictly speaking, you cannot do this if you respect the fact that if data are dependent, then you cannot simply multiply probabilities, you have to use the conditional probability formula correctly. If you do this, you'll find that the rules of the probability calculus will automatically prevent you from using the data twice. Problem #4 in the new problem set addresses this issue.
Anyway, if zj is the observed number of deaths at hospital j, and oj the expected number, then we can write a Poisson likelihood term for each hospital, and assuming the hospitals are independent, the product of these is the likelihood for our estimation. Then a standard prior on λ is 1/λ. [Brief interlude: This is a commonly used prior for scale variables, e.g., lengths, rates, and so forth. In such problems it should not matter if we use a ruler calibrated in inches, for example, instead of centimeters...if we do it consistently we should always get the same result. Mathematically, it is seen as a group-theoretic feature under the group of multiplications by positive real numbers: if c is a constant and x=cy, then dx/x=dy/y regardless of the value of c.]
The prior is improper, that is, its integral from 0 to ∞ diverges at both ends. So it's not a "real" prior since it can't be normalized. However, if we imagine going from 1/n to n in the integral, you can normalize that. Then the question is, if you use such a prior, and multiply by the likelihood, does the posterior end up being normalizable as n→∞? If so, then it is legitimate to use the improper prior, you won't get into trouble.
The posterior using the data from these 10 hospitals is on the chart set. This is what we'll use for the prior with the hospital we are studying.
Digression: Neither Jeff nor I would use this method as described. Rather, we would use a hierarchical model where all hospitals, including the one under investigation, are analyzed together. We will return to this subject later.
Jeff finished the discussion by running the R code in the notes. Running it on the hospital with 1 death and exposure d=0.066, we saw that the 95% credible interval for lambda contained 1 near the middle, so there was no evidence that this hospital had an excess death rate. This is so even though 1 is quite a bit larger than 0.066 (about a factor of 15). However, when we plugged 10 deaths into the calculation, the 95% credible interval did not include 1 so we would say that such a hospital has an elevated risk.
Wednesday, February 4, 2009
STAT 295 2/3/09
I have posted a revised Chart Set #5. Jeff noted some errors and they have been corrected. Unfortunately, since I moved to the latest version of MacOS, I am no longer able to produce 4-up pdf files, so this one (and some of the later ones) will be full size. I apologize for this and will consult with Small Dog to see if this can be fixed.
Jeff asked why the two forms of the likelihood for problem #4 (due 2/5) are equivalent. You should address this question in your turned-in assignment. Note that programming without loops is faster, so you should attempt doing that calculation without loops.
On Chart Set #4, Chart 24, we had been discussing robustness. We noted that the mean and standard deviations of the posterior distribution do depend (although fairly insensitively) on the prior. In particular, the beta prior had a smaller standard deviation and the mode was moved to the left, towards the peak of the beta prior, relative to where the mode was for the flat prior. This led to the notion of stable estimation, such that when we have a lot of data or very precise data, and the prior doesn't change much over the region where the likelihood peaks, then the results won't be sensitive to the prior.
We considered continuous examples, of which the beta-binomial model is an example. Jeff justified the density approximation P(a<Y<b|c-ε<X<c+ε) ≈ ∫f(c,y)/g(c)dy. Thus, we can use ratios of joint to marginal densities in the continuous case, just as we can use the ratio of joint to marginal distributions in the discrete case.
We saw how increasing the amount of data tightens the posterior around the true value.
We then skipped to Chart Set #5. We will return to Chart Set #4 later.
We discussed the Poisson distribution and motivated it. Many things are well modeled as Poisson events, in addition to the ones on the chart set, requests to google.com, stars per square degree, etc.
We went on to the heart transplant mortality problem from Albert's book. The exposure for each patient is the probability that the particular patient will die in a given time frame after the operation. It depends on things like the patient's age, conditions like diabetes, and so on, and is presumed known from other studies. Then the exposures for each patient are added up over all the patients to estimate the overall exposure (risk) for the particular hospital. It is the expected number of patients that will die. The notes use the letter 'e', but Jeff remarked that it's easy to confuse that with the base of natural logarithms, so he changed it to 'd' in his chalkboard discussion. Then if Y is the random variable representing the observed number of deaths, the likelihood has Y~pois(λd) and λ is the parameter we wish to estimate for the hospital.
A gamma prior is chosen. It is flexible, with two adjustable parameters and pedagogically easy because it is the unique conjugate prior for a Poisson likelihood. However, it is a bad idea to use a prior simply because it makes calculations easy, since modern sampling techniques allow us to use any prior we wish, and if we know better, we should use it.
I've corrected Slide #8 on which z and o were transposed. See the chart set published above.
The heart transplant mortality problem will be continued next time.
Jeff asked why the two forms of the likelihood for problem #4 (due 2/5) are equivalent. You should address this question in your turned-in assignment. Note that programming without loops is faster, so you should attempt doing that calculation without loops.
On Chart Set #4, Chart 24, we had been discussing robustness. We noted that the mean and standard deviations of the posterior distribution do depend (although fairly insensitively) on the prior. In particular, the beta prior had a smaller standard deviation and the mode was moved to the left, towards the peak of the beta prior, relative to where the mode was for the flat prior. This led to the notion of stable estimation, such that when we have a lot of data or very precise data, and the prior doesn't change much over the region where the likelihood peaks, then the results won't be sensitive to the prior.
We considered continuous examples, of which the beta-binomial model is an example. Jeff justified the density approximation P(a<Y<b|c-ε<X<c+ε) ≈ ∫f(c,y)/g(c)dy. Thus, we can use ratios of joint to marginal densities in the continuous case, just as we can use the ratio of joint to marginal distributions in the discrete case.
We saw how increasing the amount of data tightens the posterior around the true value.
We then skipped to Chart Set #5. We will return to Chart Set #4 later.
We discussed the Poisson distribution and motivated it. Many things are well modeled as Poisson events, in addition to the ones on the chart set, requests to google.com, stars per square degree, etc.
We went on to the heart transplant mortality problem from Albert's book. The exposure for each patient is the probability that the particular patient will die in a given time frame after the operation. It depends on things like the patient's age, conditions like diabetes, and so on, and is presumed known from other studies. Then the exposures for each patient are added up over all the patients to estimate the overall exposure (risk) for the particular hospital. It is the expected number of patients that will die. The notes use the letter 'e', but Jeff remarked that it's easy to confuse that with the base of natural logarithms, so he changed it to 'd' in his chalkboard discussion. Then if Y is the random variable representing the observed number of deaths, the likelihood has Y~pois(λd) and λ is the parameter we wish to estimate for the hospital.
A gamma prior is chosen. It is flexible, with two adjustable parameters and pedagogically easy because it is the unique conjugate prior for a Poisson likelihood. However, it is a bad idea to use a prior simply because it makes calculations easy, since modern sampling techniques allow us to use any prior we wish, and if we know better, we should use it.
I've corrected Slide #8 on which z and o were transposed. See the chart set published above.
The heart transplant mortality problem will be continued next time.
Monday, February 2, 2009
Subscribe to:
Posts (Atom)