First, the assignment for next Thursday can be found
here.
Interesting class today. Jeff started out by asking about the homework. We suggest learning how do use the array capabilities of R (array of numbers, vectors, matrices). You can multiply an array by a scalar and get an array with the same number of entries. You can add, subtract or multiply two arrays, or add a constant to an array; the operations will take place pointwise. You can use sum() to add up the elements of an array. And so forth. Play around with R in the calculator mode where you enter a formula and look at what's been computed. Facility with these kinds of calculations can really speed up a calculation, as well as making for easier-to-read code.
In #4, the likelihoods are proportional to each other. To show this, you should complete the square in the summation form of the equation, and note that when you do this the sum can be turned into the sample mean, and the rest of the items in the exp() function are constant with respect to μ.
The ESP problem seemed sensitive to the prior...not enough data to overwhelm it.
We suggested that anyone who plans to write scientific papers in their professional life would do well to learn TeX/
LaTeX. LaTeX is most useful.
Jeff gave some background on the heart transplant mortality problem. Most of it was "chalk talk" on the blackboard. Typically the study starts by modeling the risk to individual patients using
logistic regression. This powerful method allows us to estimate the probability that a patient will die as a function of covariates such as age, disease status, presence or absence of diabetes, and so forth. We would use logistic regression on a large number of patients to get the coefficients that are appropriate for each covariate. Then, when a new patient comes along, we can plug the covariates appropriate for that patient into our logistic model, and predict the probability of death from that.
Then, armed with these probabilities for each patient, we can predict the number of patients that are expected to die in a given hospital by just adding up those probabilities over all patients in the study. This is what Albert calls the exposure. Jeff used 'd' for that quantity.
Both Turner and Jeff are using this basic idea in their research...Turner in trauma cases, Jeff in neonatal cases.
In the example, a particular hospital had d=0.066 and had one death.
We model the observed number of deaths with a Poisson distribution:
Pr(y|μ)=μ
ye
μ/y!
where μ=d⋅λ.
Note that the expectation E[Y|λ]=d⋅λ.
With a gamma prior g(λ|α,β)=λ
α-1e
-βλ we get a gamma posterior, since the gamma distribution is the conjugate prior for the Poisson distribution. This is a gamma-Poisson model.
To choose α and β, we imagine we have data on 10 other hospitals. We had a discussion about why we cannot use the hospital that we're studying to decide on the prior. Basically, to do that would be to use the data from that hospital twice, and that's not allowed. Strictly speaking, you cannot do this if you respect the fact that if data are dependent, then you cannot simply multiply probabilities, you have to use the conditional probability formula correctly. If you do this, you'll find that the rules of the probability calculus will automatically prevent you from using the data twice. Problem #4 in the new problem set addresses this issue.
Anyway, if z
j is the observed number of deaths at hospital j, and o
j the expected number, then we can write a Poisson likelihood term for each hospital, and assuming the hospitals are independent, the product of these is the likelihood for our estimation. Then a standard prior on λ is 1/λ. [Brief interlude: This is a commonly used prior for scale variables, e.g., lengths, rates, and so forth. In such problems it should not matter if we use a ruler calibrated in inches, for example, instead of centimeters...if we do it consistently we should always get the same result. Mathematically, it is seen as a group-theoretic feature under the group of multiplications by positive real numbers: if c is a constant and x=cy, then dx/x=dy/y regardless of the value of c.]
The prior is improper, that is, its integral from 0 to ∞ diverges at both ends. So it's not a "real" prior since it can't be normalized. However, if we imagine going from 1/n to n in the integral, you can normalize that. Then the question is, if you use such a prior, and multiply by the likelihood, does the posterior end up being normalizable as n→∞? If so, then it is legitimate to use the improper prior, you won't get into trouble.
The posterior using the data from these 10 hospitals is on the chart set. This is what we'll use for the prior with the hospital we are studying.
Digression: Neither Jeff nor I would use this method as described. Rather, we would use a hierarchical model where all hospitals, including the one under investigation, are analyzed together. We will return to this subject later.
Jeff finished the discussion by running the R code in the notes. Running it on the hospital with 1 death and exposure d=0.066, we saw that the 95% credible interval for lambda contained 1 near the middle, so there was no evidence that this hospital had an excess death rate. This is so even though 1 is quite a bit larger than 0.066 (about a factor of 15). However, when we plugged 10 deaths into the calculation, the 95% credible interval did not include 1 so we would say that such a hospital has an elevated risk.