false

Study Guides
(248,641)

United States
(123,476)

Harvard University
(58)

Epidemiology
(3)

Murray Mittleman
(3)

School

Harvard University
Department

Epidemiology

Course Code

Epidemiology EPI202-01

Professor

Murray Mittleman

Description

EPI Qual Course Notes/Questions
EPI 202
Notes
Second pass
Risk = individual; CI = average
Can estimate risk directly when follow-up starts at same time, little loss to follow-
up, and no death from competing risks.
Risk is a monotonic measure
Number at risk at baseline difficult to determine with staggered entry.
Need to assume uninformative censoring in order for incidence rate/ hazard rate to
solve problem of variable entry, loss to follow-up, and competing risks.
C~IΔt is wrong b/c even if no loss to follow-up/competing risks, the # people at
risk decreases. So will overestimate # cases.
Cross-sectional studies can’t distinguish between duration and incidence.
When there is homogeneity of a measure of effect on ratio scale, usually EMM on
difference scale, vice versa.
AR%: fxn of the disease among exposed attributed to exposure.
PAR%: excess or deficit risk which would occur in the population if the exposure
were removed from population.
To check for confounding, treat confounder as “exposure” and: 1) check if
confounder associated with outcome among the unexposed and calculate OR, 2)
check if confounder associated with exposure in the study base (i.e. controls) and
calculate prevalence ratio.
Closed cohorts only need one common time scale for which the membership
event is defined.
Risks cannot be directly measured in an open cohort.
Etiologically relevant exposure depends on time (induction period)
o The average induction time can be evaluated empirically based on where
RR peaks. OR by using an indicator variable (per 2006 exam, Q4).
Person-time grid can be used to represent intensity of exposure (measured as
average, maximum, lagged, etc.
Controls are a sample of the person-time that gave rise to the cases (the study
base). Gives estimate of the odds of exposure in the person-time at risk.
o Controls are a direct random (or conditionally random – matched) sample
of the study base.
o Exception to at risk assumption (i.e. in study base pool): sampling by
proxy (e.g. blood type and nuns STDs).
Sampling fraction
If the controls are sampled independently of exposure, the same fraction is taken
from the exposed and unexposed person-time pools (i.e. controls represent
exposure person-time in study base).
If sampled, cases must also be sampled independently of exposure.
1 Density sampling: unconditional logistic regression
o Risk set sampling: conditional logistic regression
Closed cohort with complete follow-up: 2X2 tables, linear regression (continuous),
logistic regression (dichotomous), poisson – if have person-time data.
Cumulative incidence sampling: unconditional logistic regression
o If CI would have been used in cohort study, then this sampling can be
used.
Case-cohort sampling: Cox, with variance adjustment.
o With risk set matching: CLR, Cox
Case-crossover: conditional logistic regression (each indiv=strata)
Does Cox and CLR always need dichotomous outcome variables? What happens
if we have person-time data and want to model a continuous outcome?
In a cohort study, the exposed group is matched to the unexposed group on the
matching factor(s)
o Crude unbiased, but adjusted is more precise
o Effects of matching factors can be evaluated
o Effect modification by matching factors can be evaluated
In a case-control study, the case group is matched to the control (referent) group
on the matching factor(s)
o Crude biased toward null if matching factor true confounder, must adjust
o Effect of matching factor cannot be evaluated
o Effect modification by matching factor can be evaluated
o Matching can improve precision (efficiency) but it is not, in itself, a
measure to achieve (or improve) validity stratification is!
o Analysis:
Appropriate matching: stratified analyses valid, but matched more
efficient
Unnecessary matching (uncorrelated with exposure): all valid, but
lose power with stratified analyses (with binary outcome)
Overmatching (uncorrelated with disease) – BAD: unstratified
unmatched analysis valid and most precise. Stratified matched and
unmatched also valid, but less precise.
Matching on an intermediate – really really bad: only
unstratified unmatched analysis valid
Confounding and matching should be looked upon conditionally on other
confounding factors, which are already accounted for
P-value: the probability that a result as extreme or more extreme than the one we
observed would occur due to chance variation, if the null were true.
Hypothesis test don’t give info on magnitude/direction/range/power chi-square
test always two-tailed! Only gives info about consistency of data with null, don’t
provide info on consistency of data with alternative states of nature (doesn’t take
alternative into account consider Bayesian methods or confidence intervals!)
If data-based CI does exclude the null, then sure that the results are statistically
significant. If CI does not exclude null, then inconclusive (maybe still significant)
o i.e., when the 100(1-α)% CI does not include E(X|H ), we may reject the
0
null and conclude that our finds are significant at the specified α-level
2 variance of binomial distribution reaches maximum when binomial proportion =
0.5
p-value changes not only as a function of overall sample size (N + N ), but also
1 0
as a function of other features of the study design, such as balance:1N / 1N +0 )
source of random variability in RCT: random treatment assignment – unmeasured
confounding; source of random variability in obs. study: a) sampling of the
conditional relation, i.e., sampling unrepresentative individuals from
superpopulation, b) randomization by nature, i.e., after control for confounding,
the exposure can be regarded as being randomly assigned by nature.
Interpreting rate difference = 1.85/1000 PY: These data indicate an excess of 1.85
CHD deaths per 1000 PY due to smoking among british male doctors.
Interpreting CI: These data are consistent with IRDs ranging from 1.2-2.5/1000
PY with 95% confidence, assuming no confounding, selection bias, and
information bias
Interpreting ratio: there is a 2.5-fold increase in CHD risk/rate/odds from high
CAT levels in these data.
The independent relationship between the confounder and the disease is an
intrinsic feature of disease biology that cannot be altered by study design
The relationship between the confounder and the exposure of interest is a feature
of the particular population base chosen for the study and of the specific
individuals who are sampled. This association can be altered at the study design
stage by
o Matching
o Restriction
o Selection of a population base that is not characterized by this confounder-
exposure association
o Randomization
The maximum extent to which crude relative risk can be attributed to the effects
of a confounding variable is limited by the minimum of:
o Relative risk of the outcome (and exposure) from that confounding
variable
o Prevalence ratio of prevalence of confounder in the exposed group /
prevalence of confounder in unexposed group
o Prevalence of confounder: 50% maximum bias
All estimates, regardless of the weights, are unbiased estimates, under the
assumption, which is critical for this analysis, of no effect modification.
Don’t use Miettenen’s test-based CI for MH summaries (gives CI that are too
narrow b/c variance under null smaller than variance under observed data)
o Becomes increasingly biased as point estimate departs from the null
Summary ratio measures:
o Equal weights: ignores that some strata are more informative than others
o Proportionate to sample size: doesn’t take balance of the data into account
o Inverse variance: when data sparse, weight goes to 0 (b/c have 1/0 in the
denominator) and lose all information in the stratum; most efficient with
large samples
o MH weights: appropriate for sparse and large data
3 Test of heterogeneity:
o H : 0R is the same across all l levels of the stratification variable(s)
o H : AR is not the same across all l levels of the stratification variable(s)
(i.e., ORine OR jor some i,j)
o Failure to reject the null may imply insufficient power and NOT
homogeneity
o When rejected, best to report separate results or a summary from weights
that do not reflect arbitrary features of the study design (i.e.,
standardization with appropriate population-based weights)
Effect modification is not the only reason for stratum-to-stratum variation in
effect estimates. Selection bias, information bias, confounding, and chance may
also produce variations. In case of chance, test of heterogeneity help assess
whether random sampling variation can explain difference
OR MH disguises effect modification b/c this average is not taken over any
recognizable age distribution representing a population to which we may wish to
generalize the results, but over the distribution of within-stratum variances
(reflects arbitrary features of the study design), which were observed in this
particular study.
Effect modifier may be a risk factor of disease only in the presence of the
exposure (not necessarily independent of exposure like a confounder)
Confounding exists independent of scale
Matching is a form of stratification (i.e., in a matched study, a matched set is
exactly equivalent to a stratum in an unmatched stratified analysis)
Want to use matched analysis when the “level” of potential confounders is unique
to each case in the study, or nearly so, one must match in order to obtain
appropriate controls.
Want to use stratified analysis when each matched set represents a level of potential
confounders that is observed repeatedly during the study
Case control: match for efficiency NOT validity
Analyze matched case-control with matched-pair table!!
Crude analysis of matched case-control data generally biases the estimate of the
OR towards the null
Standardization weights chosen based on distribution of effect modifier in
population of interest to you!
Standardization is computationally identical to stratified analysis methods, just
with different weights derived from the population.
Direct: proportion of cases expected among the unexposed, using stratum-specific
rate observed among the exposed (comparison is if your unexposed exposed /
unexposed)
Indirect: proportion of cases expected among the exposed, using stratum-specific
rate observed among the unexposed (comparison is your exposed / if your
exposed unexposed) simplifies to O / E!!
SMR advantages: simple to compute – only need total # cases (don’t need to
know who became ill), efficient statistically (uses stratum-specific rates in
unexposed – bigger group more stable estimate), counterfactual
4 Traditional approach:
o The directly standardized rate among the exposed is the crude rate
expected in the unexposed if the rates were as observed in the exposed
o The directly standardized rate among the unexposed is the crude rate in the
unexposed
If no EMM, SMR = SRR = MH
Sampling in case-control studies estimates relative pool of T to T (exposed and
1 2
unexposed)
Assignments
Exams
5 EPI 203
Notes
Closed cohort: assumes that exposure unchanging during follow-up. Unless you
follow up at multiple time-points (Nurses’ health study open or closed)?
For case-cohort (case-base) sampled from open cohort (person-time), can use Cox
with variance adjustment. Standard hazard ratio estimate, account for the fact that the
risk set samples are not independent of one another in variance.
In case-crossover, can stratify on time of day, day of the week to control for potential
cyclic confounders.
In matched nested case-control study, use CLR (stratified by time)
In open cohort, use Cox (but still, only risk sets with cases will be informative).
In RCTs, the intervention is the ACT of telling a participant to take the treatment.
Noncompliance = instructions not relayed properly.
Exclude participants not at risk of outcome (not in study base).
Power may only be useful for future studies (not the current one).
Censoring is a form of missing data problem. Ideally, both the birth and death dates
of a subject are known, in which case the lifetime is known.
o Right censoring: if it is known only that the date of death is after some date,
this is called right censoring. Right censoring will occur for those subjects
whose birth date is known but who are still alive when they are lost to follow-
up or when the study ends.
o Left censoring: if a subject's lifetime is known to be less than a certain
duration, the lifetime is said to be left-censored (i.e. died before end of follow-
up = case). For a left censored datum, we know the subject exists.
“Community effects” may affect effect estimates (e.g. fumes from a factory get out to
surrounding areas).
When want everybody that can take a drug to take it (last treatment option), then want
FP of hypersensitivity to drug to be as low as possible (i.e. 100% specificity). E.g. if
drug is lifesaving, but hypersensitivity is not fatal.
Rate is poisson distributed, with increasing rate, increasing variance (cuz variance =
mean).
High-risk populations greater # cases greater power
Risk set analysis: get covariate distribution from person-time at which events
occurred. Not representative of entire study period.
Beware of collinear covariates
Controls only have to represent exposure distribution in study base from which cases
arose (e.g. blood type STDs, can use nuns as controls for young men – even
though not at risk of outcome).
When study base is difficult to define (e.g. case-crossover study in which cases define
secondary base), can say study base is people that would’ve reported events (can’t
verify) OR all of potential study people (with incomplete ascertainment of cases) (all
may be better).
6 When there are no events arising from the person-time in a stratum, the stratum does
not contribute to the rate ratio estimate by any technique that incorporates the
stratification, such as MH estimate or a stratified MLE of the hazard ratio (Cox).
Assignments
Collecting time-varying covariates as non-time-varying will facilitate more practical
data collection (collection at only one point).
In some situations, even if: 1) controls may not represent the study base with respect
to age; 2) age is a confounding factor in the study base, controlling for age in the
case-control analysis will yield a valid age-adjusted estimate of the effect of smoking
on mortality IF the controls accurately reflects the exposure distribution of the study
base within age strata (i.e. distribution of the smokers and non-smokers must be the
same in each age stratum of the controls as in that age stratum of the source
population).
o If a factor is a confounder, then controls must represent the exposure
distribution of the study base from which the cases arose within each strata
of the confounder (after stratification).
o Confounding is also a source of selection bias? When don’t stratify by
age, will have selection bias.
At any PS, the expected proportion of subjects who are treated will be the same, no
matter what the covariate level that gave rise to the score. Thus, treatment and
covariate level are uncorrelated at any specific PS; therefore, the expected proportion
of treated at each covariate level is the same as the expected proportion of untreated.
SO, among persons with a given PS, the distribution of the covariates X is on average
the same among the treated and untreated.
Covariates PS Treatment Outcome
Takehome: whether they got treatment is independent of covariates at specific PS.
o PS and treatment modeled separately:
PS = linear combination of covariates
Pr(treatment) = PS + linear combination of other variables
Conditionally on PS, treatment and covariate levels are uncorrelated, so covariate
levels aren’t confounders.
At each PS, there is a balance of covariates between treated and untreated.
SMR has smaller variance than PS-matched data (cuz you use everyone). Also not
affected by further uncontrolled confounding attributable to imperfect matching.
*Jenn’s notes*
Cross-sectional study: a type of point-in-time survey where all subjects are survivors
with
7 characteristics that are all functions of past history/events; date of survey is a key
operational element; disease present reflects cumulative incidence over the past;
can compute prevalence, odds, or any functions of these
Closed cohort with fixed follow-up: past time is reflected in the cohort baseline
characteristics,
and effects of time play out during observation
Open cohorts: person-time categories summarize relevant history; person-time itself is
stage on
which effects play out; person-time from different people considered to be
interchangeable, so
possible to compare different people or different periods of life in the same people
Stratification by time: interested in risk for an event at a particular moment;
continuously measured
person-time unnecessary; only days with events enter the analysis; formation of
risk sets;
random sampling of risk sets will give same results as a full cohort survival
analysis; conditional logistic regression matched on date; no rare disease
assumption needed because of the fine stratification of time, and the summary
odds ratio across strata is a direct estimate of the rate ratio
Stratification by person: used in case-crossover design; only people with events enter
analysis;
conditional logistic regression matched on person
Controls: controls in a case-control study are a group of persons or person-days whose
exposure status
collectively provides information about the distribution of exposure in the persons
or person-
time giving rise to the cases
Effect modification: a change in the biological effect of an exposure by the direct or
indirect physical
consequences of another factor
Source population: the individuals from whom the study population is selected
Standard: the set of weights used for standardization; weights sum to 1
Study base: the person-time in the study population that gives rise to the cases
Notes
8 1. In a cross-sectional study, subject selection does not need to be entirely at random.
Just ensure that there is not differential selection due to both exposure and disease.
Sometimes population representativeness will suffer in favor of efficiency and
comparability.
2. When a study is looking at many outcomes, they will typically be related to one
another, so power calculations for some expected effect is ok to be used for the
other outcomes too. These outcomes are probably manifestations of the same
underlying process.
3. For clinical trials, you want to treat people in whom the average treatment effect
would be beneficial.
4. Power cannot be used to interpret results. We use estimates and confidence
intervals to interpret results. Power is computed for some study in the future.
5. Endogenous variables change over time as they interact with each other.
Exogenous variables arise from outside the study (e.g., random number
generators).
6. Noncompliance could bias effect estimates either way.
7. In a randomized study, residual confounding is more likely when the sample size
is small.
8. The power of a study is the probability of rejecting the null hypothesis. When you
increase sample size, the distributions for the confidence limits become narrower
and closer to one another. However, we should not only consider the null
hypothesis because we could have an upper level of concern (i.e. some other
relevant alternative).
9. If interested in group-level effects, compliance at the individual level may not be
of concern. However, there could be community-level effects interacting with
individual effects, which makes correction for compliance much more difficult,
and possibly impossible.
10. In a prospective cohort study, where cohort sizes can become seriously
imbalanced in later years, power calculations for the original research questions
cannot be applied because the research question has changed over time.
11. We spend so much time studying rates, but with rapidly time-varying
exposures, % measures are more clinically useful.
12. Person-time gives a good overall exposure distribution. Risk sets depend on event
times and exposure distribution at the time of assessment.
13. Person-time analysis uses volumes of experience (i.e., the time contributed by
each person). By lumping together person-time into some category, you assume
that within that category, the risks are homogenous. Person-time analysis adjusts
for time of follow-up per person.
14. With random risk set sampling, you can extrapolate exposure-specific follow-up
times based on the sampling fraction.
15. In a matched case-control study, an unmatched analysis will be biased (to the null)
when the matching factor has a strong association to exposure. Strata would have
only small differences in exposure that get obscured and you lose contrast.
16. Secondary study bases are created by case eligibility and exclusion criteria.
17. We technically stratify on time with risk set sampling, but time is not a strata
itself.
9 18. Controls can clearly not be representative of the source population but still be
useful.
19. Effects may be amplified in univariate analyses.
20. Matching redefines risk sets. Stratification of the cohort by matched factors
essentially samples from a smaller risk set.
21. The case-crossover is a finely-stratified cohort study. The case-crossover design is
applicable when the postulated risk factor is common, to allow for ease of
exposure recall, and transient, to allow for the ability to create a demarcation of
exposure and so that the period of observation is related to the time you can
reasonably gather information. The time-varying nature of exposure is related to
how long we can collect information. We can look at exposure status in relation to
risk in many time intervals.
22. Effect measures can be modified by differential loss of information in one group.
23. The PAR% formula can only be used in case-crossover studies if the exposed and
unexposed periods correspond to the appropriate demarcations of time (i.e., the
windows are correct).
24. Recall bias: In a case-control study, we are concerned with different people
having different patterns of recall, but in a case-crossover study, we are concerned
with one person having different patterns of recall (i.e., sleep deprivation &
sharps injury).
25. Matching in a cohort study ensures that the prevalence of each matched factor is
the same in compared cohorts, that for any combination of covariates there will be
at least one control with which to compared with a case, and takes care of any
interactions between the matched factors.
26. If matching is impossible using propensity scores, maybe an observational study
is not appropriate because the groups aren’t comparable.
27. Propensity scores are often used when lots of information are available for
starting the analysis. You should get the same results using propensity scores and
ordinary covariates in a logistic regression. Propensity scores have the advantage
that it is just one variable, while there is a 10:1 event-covariate guideline for
regression models.
28. Propensity scores should be estimated separately for cohort-accrual blocks when
attempting to control for time-varying covariates. This results in a prospective
study design that captures a sociologic phenomenon that might not apply to other
places and times.
29. A factor of low prevalence affects only a small % of the population, but it can still
have a high OR if strongly associated.
30. When thinking about possible confounding variables, think about the possibility
of intermediate variables and possible mechanisms of action.
31. Dichotomization of variables is an attempt to limit the time-varying
characteristics.
32. Selection bias can be introduced when the sampling fractions for individuals in
the base depend on an exposure variable in an unknown way. An analysis of the
effects of other variables will be unbiased when the source of the dependence can
be identified and handled in the analysis as if it were a confounder. Especially
problematic is common effects in a case-control study.
10 33. If a factor among controls is not representative of the study base, a factor-adjusted
effect estimate will still be valid if, within each stratum of that factor, the
distribution of exposure among controls resembles the distribution in the source
population. Thus, controls are representative of the exposure distribution in the
study base within each factor-strata, IF we account for the stratification in our
analyses.
34. Among persons with a given propensity score, the distribution of the covariates X
is on average the same among the treated and untreated because propensity score
is independent of the actual treatment. The propensity score is a summary score
that gives the probability of treatment, and does not take into account actual
treatment. A propensity score is created from the joint distribution of X’s.
35. The standardized rate in the denominator of the SMR represents the constructed
crude rate for a hypothetical population that has a population distribution over
covariate levels that is the same as the distribution of covariates among the
exposed subjects only (i.e., the weights).
36. SMR-weighted estimators could have smaller confidence intervals than estimates
arising from propensity matching because SMR-weighted estimators use all
information (i.e., bigger power because of larger sample size with no loss of
information).
37. Matching on risk set is necessary to adjust for time-varying covariates that cannot
be controlled through randomization.
11 EPI 204
*read papers in notes*
*summarize models*
*watch the modeling lectures*
Notes
Second pass
When using Cox, PH necessary because assuming no effect modification of main
effect (unless specifically add stratum-specific coefficients or ixn term)
sampling fraction means fraction sampled from exposed and nonexposed person-time.
Do only discordant pairs contribute to conditional logistic regression? A: YES, the
cross-product must not ≠ 0.
Induction period: time required for the effects of a specific exposure to become
manifest
o Latency period: portion of the induction period during which disease is
present but unmanifest
Cohort effect: changes in disease frequency that are shared by all members of a group
who entered follow-up at common time
Inception cohort: the persons who are under observation at the beginning of an
exposure that defines cohort membership
Survivor cohort: the persons who remain under observation at some point after the
beginning of an exposure that defines cohort membership
Age effect: a change in disease incidence that is due to a biological concomitant of
aging
Period effect: changes in disease frequency that are specific to a calendar time
o Commonly result of secular changes in definition of disease or diagnostic
practices, or secular changes in exposure prevalence (with short induction
periods)
Immortal person-time: person time prior to becoming eligible for a study (not part of
the denominator of a rate calculation when observable person-time lies entirely
outside the bounds of cohort membership). i.e. participation = had not died earlier
Models
o Poisson: analysis of rates from cohort studies baseline risks, excess relative
or absolute risks
Parameter estimates for poisson regression models computed using
MLE using log-likelihood that would arise if the event counts in the
table were independent Poisson random variables.
Models in which rates depend on parameters through a linear function
(GLMs: linear regression) can be computed through least squares.
Cupples paper: pooled logistic regression
o Assumptions:
The underlying risk of outcome in each interval is the same
The relationship between risk factors and outcome is the same for
every interval
12 Only current risk profile is needed to predict outcome
o Generalized person-years approach. Treats each observation interval (of equal
length) as a mini-follow-up study in which the current risk factor
measurements are employed to predict an event of interest in the interval
among persons free of the event at the beginning of the interval. Observations
over multiple intervals are pooled into a single sample to predict the short-
term risk of an event.
o MODELS
MH:
Parameters: I = incidence rate with given covariates at time t;
I0= baseline incidence rate at time t, when covariates = 0; β1 =
log incidence rate ratio for a one unit change in X
Assumptions: fewest assumptions, proportional over all the
age strata (i.e., no ixn of main effect w/ any of stratifying
variables) so can get weighted average
Implicit model: if stratify on t, Z, and L
β1X
o Crude: I = I e 0 β1X
o Stratified 2x2 tables: I(t | X, Z, L) = I0 ZL(t)e =
o I(t | X, Z) = I0(t)e1X + β2Z + β3L + ixn terms between everything but
with main effect…
Cox PH: anderson-gill
Assumptions: PH assumption I(t|x=1) / I(t|x=0) = e ; β1
stratified cox allows different baseline hazards over time, at
each time all betas same and proportional, other covariates
don’t ixt (those without specific ixn terms/ stratified betas)
Crude: I = I 0 β1X
Time-adjusted: I(t | X, Z) = I 0t)eβ1X + β2(will be the same as
crude if t and Z are not confounderβ1X + β2Z
Stratified I: I(t | X, Z, S) 0SI (t)e (will be the same as
the time-adjusted if no interaction between S*t, and S is not a
confounder
β1SX + β2Z
Stratified II: I(t | X, Z, S) 0SI (t)e (will be the same as
stratified I if there is no EMM by S*X
Poisson: anderson-gill
Assumptions: within each category (set of covariates), rate
belongs to everyone in that set
Standard: I(t | X, Z) = I0(t)e1X + β2Z
Same as Cox, BUT NO stratification, just one giant pool of
data!
Baseline must be explicitly stated calculate absolute
measures, but also more likely to be misspecified.
Counts or person-time
Pooled logistic (stratified = conditional): Anderson-gill
Assumptions: parametrization fulfills beta-string, log odds
related to exp / (1+exp), the underlying risk of outcome in each
13 interval is the same, the relationship between risk factors and
outcome is the same for every interval, only current risk profile
is needed to predict outcome
NO stratification! Treat each interval as cohort study and pool
intervals when analyzing
Standard model: I(t | X, Z) = e β0 + β1X + β2Z
1+ e β0 + β1X + β2Z
Baseline must be specified, modeling logit[I(t)], which is linear
in the covariates of the model. On log of the odds of the
incidence rate scale
Incidence rate can be directly estimated, inherently
multiplicative
When disease rare logit[I(t)] ~ log[I(t)]
Conditional logistic
Calculating within each set, betas same across all sets,
stratifying on time (if matched on age/calendar year)
CLR likelihood algebraically identical to Cox PH partial
likelihood odds ratio estimated will approach IRR estimated
from Cox as the matching ratio goes to infinity (becomes
cohort!)
Standard model: I(t) = I (t)e β1Xwhere i = qcpair
0i
(encompasses all matching factors)
Parameters: I 0i) = baseline incidence rate for subjects when
covariates = 0, at the value of the matching factors
corresponding to the matching factors of the case in qcpair.
all possible cross-classifications of matching factors is
controlled for (fully saturate w.r.t. matching factors)
unconditional logistic regression would not have the qcpair
variable (i) unstratified model
Assignments
Exams
14 EPI 289
*read Hernan and Robins 2006 for more IPW info
*read Sato paper
2008 Notes
Second pass
For this class, we assume: 1) causal effect in the entire population, 2) dichotomous
variables, 3) deterministic counterfactuals, 4) no interference
Pseudo-population created by IPW is unconditionally exchangeable (same individuals
used to make exposed and unexposed).
Standardization = IPW (algebraically equivalent). Weighting is equivalent of
simulating what would happen in the study population if everybody had received
treatment a.
Null paradox: plug in standard models to estimate each component of the g-formula,
but doesn’t work because even if causal null true, the effect estimate will not be null
(no parameter for null hypothesis). Model misspecified before collection of data!
Can use time-varying weights with IPW. Inverse probability of having your observed
treatment history through t given your L history through t.
Problem with RCT: loss to follow-up, noncompliance, unblinding, other
In an observational study, average causal effects can be calculated under the
assumptions of consistency, postivity, and exchangeability conditional on the
covariates (by IPW, standardization).
Conditions for causal inference from ideal randomized experiments:
o Consistency
If you are treated, your counterfactual outcome under treatment is your
observed outcome
If you are not treated, your counterfactual outcome under no treatment
is your observed outcome
o Positivity
Some subjects receive treatment and some subjects receive no
treatment
o Exchangeability
If the treated had been untreated, they would have been expected to
have the same average outcome as the untreated (AND the other way
around for full exchangeability) MCAR
In each level of L if conditional randomization MAR
Commonly used methods to control for confounding (e.g. stratification, matching)
only estimates effects in subset of the population.
The equivalent of pooling when using a regression model is not to include product
(interaction) terms between A and L.
Problems with stratification/causal interpretation of conditional association measure
o 1) heterogeneity (minor)
o 2) collapsibility (minor)
o 3) Time-varying exposures (potentially major)
15 If there is no heterogeneity then each conditional risk ratio is equal to the
IPW/standardized risk ratio. If there is heterogeneity, then each conditional risk ratio
is different and pooling makes no sense.
Matching yields conditional causal effects (i.e. effect in the exposed if each exposed
subject is matched to one unexposed subject, then the matched sample will have the
same distribution of risk factors as in the exposed, not as in the entire population).
To compute the marginal effect in the population:
o IPW/standardization require conditional exchangeability of the exposed and
the unexposed across the entire population
o Stratification further requires no heterogeneity.
Selection bias not an issue of generalizability lack of exchangeability.
Problem: detection bias under null
o A: exogenous estrogens, Y: endometrial cancer, C: vaginal bleeding, Y’: EC
diagnosis (mismeasured!)
A Y C Y’
o Solution: screen women for endometrial cancer every 3-6 months whether
they exhibit bleeding or not. This will eliminate the arrow going from C
Y’, and we wouldn’t have to stratify on C either since it’s a collider.
D-separation: two variables are marginally or conditionally independent, depending
on whether it was necessary to control for any variables.
o A path is only blocked if and only if it contains a noncollider that has been
conditioned, or it contains a collider that has not been conditioned on and has
no descendants that have been conditioned on.
DAGs cannot represent EMM (since DAGs represent population or individual causal
effects): unfaithfulness.
o Can have cancellation of arrows (e.g. common causes bias cancels out with
common effects bias matching in a cohort study??).
Identifiability: causal effect can be identified if 1) no common causes (i.e. no
backdoor path, no confounding exchangeability holds, OR 2) common causes
BUT enough data to block the backdoor paths, no unmeasured confounders.
Confounder: 1) Collapsibility, 2) Standard (derived from comparability), 3) Causal ~
comparability.
o Problem with collapsibility defn: OR is noncollapsible (except at null), maybe
looking at common effect (even in randomized trials).
o Problem with standard definition: M-bias
IV methods: consistently estimate causal effects in the absence of conditional
exchangeability (with unmeasured confounding between AY). Require FOUR
assumptions.
o (1)-(3) Definition of instrument Z (non-negotiable):
1) Z and A are associated (b/c causal effect – e.g. RCT or common
cause – e.g. lactose intolerance gene)
2) Z affects the outcome Y only through A – no direct effect
(unverifiable)
16 3) Z does not share common causes with Y (unverifiable)
o 4) INSTRUMENT NOT ENOUGH: In order for IV estimator equality to hold,
a fourth unverifiable condition must hold.
NO IXN
No between-subject heterogeneity: effect of A Y same for
every individual (extreme no ixn). Isn’t this normally assumed
when we calculate RRs? A: Yes, but must assume this to even
use the IV. No alternatives (e.g. no standardization can adjust).
No interaction between instrument and exposure (can assume
just no additive – IV estimator; or multiplicative – different IV
estimator). Effect of AY same in levels of Z (or U*) for
treated AND untreated.
MONOTONICITY is an alternative to no ixn assumptions. Defined
for individual (i.e. no defiers – there is no individual who would have
taken A=0 if assigned Z=1 AND taken A=1 if assigned Z=0).
Under monotonicity, IV estimator estimates the average effect
of AY only in the compliers (but don’t know who these
people are! Can’t distinguish from always takers or never
takers. Generalizability?).
o Intention-to-treat effect in the numerator inflated by noncompliance
(measured in denominator) of standard IV estimator.
o Problems: 1) it is impossible to verify that Z is an instrument – bad instrument
may be more biased than unadjusted estimate, 2) a weak instrument Z blows
up the bias, 3) Instruments are insufficient to estimate causal effects, 4) can’t
deal with time-varying exposures. Effects of exposure estimated by IV
methods may be much larger than effects estimated by conventional
adjustment methods (b/c numerator small)
o Adjusting for confounders always get you closer to truth (will never overshoot
true effect). Not necessarily, with poorly measured confounder.
o Instrument only allows to compute bounds? Standard IV estimator provides
point estimate, not bounds.
o Differential loss to follow-up can occur in observational and randomized
studies.
o Volunteer bias (self-selection bias) cannot occur in RCT (no association
between treatment and selection).
o Healthy worker bias another selection bias, will occur in RCT or observational
study b/c initial study just incorrect.
o Our example: would have to stratify on age (or time scale) to eliminate this
selection bias. This is time-varying, have to deal with that!
Vit D HD CC
Age
17 o Selection bias can be solved by: 1) stratification, or 2) IPW. IPW always
works; however, stratification requires that the confounding variable (A in
above) CANNOT be a collider as well! IPW can simultaneously adjust for
confounding and selection bias.
o Hazard = Pr[Y =1|t+10] i.1. hazard at t1 = Pr[Y =1], h1zard at t2 =
Pr[Y 21|Y =1]
o In RCT and obs. studies, built in selection bias structure, data cannot
determine between selection bias or EMM.
o If HR reverses, strong indication of survivor cohort.
o Measurement error: think 1) independence (i.e. errors for the exposure and
outcome are independent, 2) nondifferentiality (i.e. errors for the exposure
(outcome) is indep. of the true value of the outcome (exposure))
o Measurement bias exists under any of the 4 types of error (can be non-
conservative). Nondifferential and independent no bias under null.
o Sources of structural bias: 1) pre-existing: common causes, 2) study-related:
selection bias, measurement bias.
o GLMs: 1) a functional form, 2) a statistical distribution (e.g. linear models:
errors are assumed to be independent and normally distributed ~ N(0, var.)).
o MSM: marginal = in pseudo-population, structural = causal interpretation.
o Create pseudo-population by IPW, then fit model to pseudo-population
to model counterfactual outcome variables (causal interpretation under
cond’l exchangeability). IP weights can be estimated by using models,
too.
o Unstabilized don’t work well when modeling (high weight to subjects with
low probabability of receiving the exposure level that they received.
Estimators with large variance only when modeling).
o For IPW weights, if want to adjust for L, then L must not be in the numerator,
only in the denominator.
o IPW/g-estimation MSM: 1) model exposure given covariates (weights), 2)
model outcome given the exposure. Model misspecified if either incorrect.
o When a covariate is in the numerator and denominator of IP weights, it is not
adjusted for.
18 2009 Notes
Two types of violations of positivity
o Structural: subjects with certain confounder values cannot possibly be
exposed
Causal inferences cannot be made about subsets with structural
nonpositivity
o Random: sample not infinite so, if stratify on many confounders, start finding
zero cells
Use parametric models to smooth over the zeros (borrowing info fro
other subjects) assuming random nonpositivity (MAR)
IPW creates pseudo-population with unconditional exchangeability
Estimation of parameters of MSMs (Robins 1998, Hernan 2000 BIO 223)
o Use weighted regression model under the assumptions of conditional
exchangeability
o Parameter estimate for β u1ing 1
o Use robust variance to compute a conservative 95% CI
Structural: models for counterfactual outcome variables; Marginal: unconditional
Time-varying exposures: effect estimates from conditional models may not have a
causal interpretation, even under conditional exchangeability!
Using IPW to estimate parameters in MSM is analogous to 1) building PS model
(weights), 2) plugging PS into model that models outcome!!
IPW: individuals who are most underrepresented in relative treatment assignments
must be given proportionally higher weights – becomes unstable!
To show relationships between SNMs:
o We can write conditional exchangeability as: Y inaep. A|L=l for all a =
Pr[A=1|L=1,Y a=1] = Pr[A=1|L=l, Y a=0] = Pr[A=1|L=l]
o Thus, probability of exposure does not depend on value of counterfactual
outcome, conditional on covariates
IF we had counterfactual outcomes, could fit logistic model to check for conditional
exchangeability:
o logit Pr[A=1|L, Y a=0] = α0+ α 1 a=0 + α2L
o check if α 1 0!
G-null test
o Test of sharp null hypothesis: Y a=0= Y a=1for all subjects i
o Assume conditional exchangeability holds
o Test: if sharp null hypothesis true: logit Pr[A=1|L, Y a=0 = α0+ α 1 + α L2
where α 1 0
o model used for g-null test is same model used for estimating denominator of
IP weights plus Y
o tested null by regressing exposure on outcome, instead of other way around
works for time-varying covariates
o IPW estimates parameters in MSMs, g-estimation estimates parameters in
SNMs
Can use g-estimation AND IPW simultaneously to adjust for confounding (g-
estimation) AND selection bias (IPW)
19 o i.e., find value of ψ that produces the value of α cl1sest to 0 in the IP
weighted model: Logit Pr[A=1|L,H(ψ)] = α + α H0ψ) +1α L 2
using estimates of the weights SW*
Causal effects with censoring: E[Y a=1, c=0– E[Y a=0,c=0, i.e., the causal effect of
exposure had nobody’s outcome been censored
If possible, use both methods separately and compare answers!
20 Method a Modeling Non-dynamic regime Dynamic regime
Non-time-varying Time-varying
Assumptions -when consistency and 1) Consistency: If A=a for a given 1) Consistency: if A_bar=a_bar -random dynamic regime =
conditional exchangeability fail to subject, then Y aY for that subject for a given subject, then Y a_bar sequentially randomized
hold, IPW and g-formula still well- 2) Conditional Exchangeability: Y a = Y for that subject experiment
defined, but no causal indep. A|L=l for each possible 2) Conditional Exchangeability: -Strengthened identifiability
interpretation value a of A and l of L Y a _barndep. A(t)|A_bar(t -1) = conditions:
-when positivity fails to hold for 3) Positivity: there are treated and a_bar(t-1), L_bar(t)=l_bar(t) for 1) strengthened consistency
treatment level a, ipw remains untreated (think about this in all regimes a_bar and all 2) strengthened conditional
well defined but no causal terms of intractable confounding) l_bar(t) (i.e., sequential exchangeability
interpretation, while the g-formula -When exposure is randomization) 3) strengthened positivity
is undefined for treatment level a unconditionally randomized, both 3) Positivity: there are treated -under these conditions, g-
g-formula and the IPTW formula and untreated for each treated methods can be used to
are equal to crude association (i.e., a(t) > 0 at all t) estimate not only E[Y ], but
g
also the optimal
(deterministic) treatment
regime will always be
non-random
IPW/ -effectively simulates the data that -MSMs for time-varying SW= ∏ f{A(k)|A_bar(k-1)} -MUST USE
MSM would have been observed had, exposures are typically f{A(k)|A_bar(k-1), L_bar(k)} UNSTABILIZED WEIGHTS
contrary to fact, exposure been nonsaturated -if numerator misspecified, will -IPW estimate of E[Y g=(1,L(1))
a
unconditionally randomized -saturated: E[Y ] = β 0 β A1 0 not result in bias! is the average of Y among
-both SW and W create pseudo- β 1 1 β A3 0 1 -can add baseline covariates the subjects in the unstab.
populations in which (i) the mean -if assume treatment is additive into numerator, but will need to Pseudo-population who
a
of Yais identical to to that in the (non-saturated): E[Y ] = β +0β 1 control for in MSM!! followed rebime g={1,L(1)},
actual study population but (ii) the (A 0 A )1 W = ∏ ______1____________ so calculate E(Y) only for
exposure A indep. of L so that no -parameters estimated by f{A(k)|A_bar(k-1), L_bar(k)} those that followed regime
LA in pseudopop only calculating means in pseudo- 1) Use logistic regression to -W = ∏
difference in W, Pr(A=1) = 1/2, population (using IPW – i.e., estimate weights – i.e. PS; 1- ______1____________
while in SW, Pr(A=1) is as in the weighted regression) PS f{A(0)}f{A(1)|A(0),
actual population (where -can also include covariates to -p*0i P[A =01] L(1)}
association = causation) evaluate effect modification by p0i P[A =01 | L =0l ] 0i
1) model IP weights (PS) using baseline covariates

More
Less
Unlock Document

Related notes for Epidemiology EPI202-01

Only pages 1,2,3,4 are available for preview. Some parts have been intentionally blurred.

Unlock DocumentJoin OneClass

Access over 10 million pages of study

documents for 1.3 million courses.

Sign up

Join to view

Continue

Continue
OR

By registering, I agree to the
Terms
and
Privacy Policies

Already have an account?
Log in

Just a few more details

So we can recommend you notes for your school.

Reset Password

Please enter below the email address you registered with and we will send you a link to reset your password.