Item Regression: Multivariate Regression Models
Item Regression: Multivariate Regression Models
Y1 Y1
X1 X1
Y1 η
Y1
X2 X2
Y1 Y1
Why Multivariate Approach?
• Latent variable approach makes stronger
assumptions
• Assumes underlying construct for which
Y’s are “symptoms”
• Multivariate model is more exploratory
• Based on findings from MV model, we may
adopt latent variable approach.
Data Setup for Individuals 1 and 2
item (Y) ID Visual Acuity Age
y11 1 x11 x12 We have a “block”
Person 1
∑δ
k =2
k I ( j = k ) ×vai
∑ [( xi − x ) 2 / ni ]
N
2
^ σ
se( β 1 ) = i =1
2
N 2
∑ ( xi − x )
• This is a valid analysis: We first
i =1 fit the
SLR and then
correct the standard error of the slope.
Fitting Approach #2
• Marginal Model (GEE or ML)
– approach #1 is okay, but not as good as simultaneously
estimating the mean model and the association model
(i.e. we can iterate between the two, and update
estimates each time).
– We estimate regression coefficients using a procedure
that accounts for lack of independence, and specifically
the correlation structure that you specify.
– Correlation structure is estimated as part of the model.
Related Example Revisited:
Drinks per week
• If Y1 is based on 2 observations (i.e. 2 weeks), and
Y2 is based on 20 observations (i.e. 20 weeks), we
want to account for that.
• We want to “weight” individuals with more
observations more heavily because they have more
“precision” in their estimate of Y.
• Results: Weight is proportional to ni .
• Resulting regression is better by accounting for
this in the estimation procedure.
Fitting Approach #2 (continued)
• Here we use the within unit correlation to compute
the weights.
• GEE solution: “working correlation”
• If specified structure is good, the regression
coefficients are very good.
• If specified structure is bad, coefficients and
standard errors are still valid, but not as good.
• ROBUST PROCEDURE
Fitting this for the Vision example
Approach 1: too complex to be feasible. Need to know all of the
associations and adjust many estimates.
Approach 2: account for correlation in estimation procedure
In STATA:
Logistic model:
xtgee y va age i2 i3 i4 i5 va2 va3 va4 va5, i(id)
link(logit) corr(exchangeable) robust
Linear model:
xtgee y x, i(id) corr(exchangeable) robust
Problem with Approach #1
• Often correlation structure is more complex (our
example was very simple compared to most
situations)
• Post-hoc adjustments won’t always work because
estimating the correlation structure is not as
simple.
• In general, people don’t use approach #1
especially because many stats packages can handle
the adjustments currently (Stata, Splus, R, SAS)
How do I know the correlation
structure?
• You don’t usually.
• Approaches commonly used for multivariate
outcome
– Exchangeable:
• individuals items are all equally correlated with each other.
• Simple and intuitive, easy to estimate and describe.
• Could be a bad assumption
– Unstructured:
• uses empirical estimates from data.
• Less prone to model mis-specification
• less powerful approach.
Summarizing Findings
(1) Constrain equal slopes across items
(2) Constrain slopes that should be
constrained, and allow others to vary
(3) Detailed summary discussion that covers
everything
(4) Complicated: joint tests/CI’s for groups
of items
Multiple Regression Results:
Odds Ratio between items estimated to be 8.69
West SK, Munoz B, Rubin GS, Schein OD, Bandeen-Roche K, Zeger S, German S, Fried LP. Function
and visual impairment in a population-based study of older adults. The SEE project. Salisbury Eye
Evaluation. Invest Ophthalmol Vis Sci. 1997 Jan;38(1):72-82.
Alternate Approach
• Use Bayesian (hierarchical) approach to model estimation
• Models correlation by assuming that ‘like’ parameters come
from a common distribution.
β j ~ N (β ,σ ) 2
β
• We estimate β and σβ as part of the model.
• If are βj’s not similar, then σβ will be large.
• Like a ‘random effects’ model, but broader.
A New Example:
Hyper-Methylation of Genes and Breast Cancer
• Background:
– Methylation of certain genes is thought to be associated with
different prognosis for breast cancer
– Goal is to determine what risk factors are associated with
methylation of genes
– Methylation status of genes are highly correlated.
– We don’t have a very big dataset (N=111 breast cancer tissue
samples)
Mehrotra, J., Ganpat, M.M., Kanaan, Y., Fackler, M.J., McVeigh, M., Lahti-
Domenici, J.,Polyak, K., Argani, P., Naab, T., Garrett, E.S., Parmigiani, G.,
Broome, C., Sukumar, S.ER/PR-negative breat cancers of young African
American women have a higher frequency of methylation of multiple genes
than those of Caucasian women. Clinical Cancer Research, 10(6):2052-2057,
2004.
Data
• Genes: HIN-1, Twist, Cyclin D2, RAR-beta, and
RASSF1A
• Risk factors:
– AfAm vs. Caucasian
– Age < 50 versus > 50
– Estrogen Receptor Status (+/-)
• Only 111 patients in the dataset
• Data is somewhat ‘sparse’
– For HIN-1, if we tabulate methylation by race, age, and
ER, we have empty cells.
– Can’t estimate saturated model (i.e. three-way
interaction)
Modeling Issues
• By fitting multivariate model, we get good stuff:
– WE ACCOUNT FOR CORRELATION AMONG GENES
– WE BORROW STRENGTH ACROSS GENES
– WE CAN SUMMARIZE ASSOCIATIONS OF RISK FACTORS
WITH GENES
• Notation:
– yij = 1 if gene j in tumor i is methylated.
– racei = 1 if tumor is from AfAm patient
– ERi = 1 if tumor I is ER+
– agei = 1 if age of person i <50
• Notation is simplified from previous example.
Started with main effects ‘hierarchical’ model:
logit( yij ) = β0 + β j + γ j racei + α j agei + δ j eri
Assume that ‘like’ parameters are from common distribution
With 5 genes and 3 covariates, we have 20 parameters* to
estimate in this model.