Checking the model: linearity
• Average value of outcome initially assumed to be linear function of continuous predictors
– slope of regression line assumed constant
– equivalently, regression line has no curvature
• If model is correct
– residuals have mean zero at every value of predictor
2
Checking the model: linearity
• If assumption badly violated, result can be
– biased coefficient estimates, residual confounding
– reduced precision and power, missed real effects
– misleading, over-simplified conclusions
3
Three departures from linearity
linear fit
E[y|x]
Lowess smooth
linear fit
E[y|x]
Lowess smooth
5
6
4
0
2
0
-2
-5
-2
0
2
x
linear fit
E[y|x]
4
6
-2
Lowess smooth
0
2
x
linear fit
E[y|x]
5
4
6
Lowess smooth
6
4
0
2
0
-5
-2
-2
0
2
x
4
6
-2
0
2
x
4
6
4
Diagnostics: RVP and CPR plots
• To account for effects of other predictors, diagnostics use
residuals rather than outcome
• Basic approach: check for non-linear patterns in plots of
residuals versus each continuous predictor (RVP) plots
• Better alternative: component plus residual (CPR) plots
– component due to predictor added back into residual
5
Diagnostics: RVP and CPR plots
• CPR plots better for diagnosing non-linearity:
– show trend, RVP plots do not
– easier to add LOWESS smooth
• Need to use RVP for quadratic, other polynomial models
– e.g., E[Y |X] = β0 + β1X + β2X 2 + β3X 3
• In both CPR and RVP: mismatch of linear regression line,
LOWESS smooth indicates lack of linearity
6
-.4
-.2
BMD Residual
0
.2
.4
.6
RVP plot for weight and BMD
0
50
100
150
weight (kg)
Residuals
lowess residuals weight
7
0
BMD Component Plus Residual
.5
1
CPR plot for weight and BMD
0
50
100
150
weight (kg)
8
Solution: transform continuous predictors
• Smooth predictor transformations to fix non-linearity:
– log(x) – provided E[Y |X] is “monotone”
– square root, cube root, other fractional powers of x
– x2, x3 (lower order terms usually included in the model)
9
Predictor transformations
square of x
square and cube of x
1
1
0
0
0
x
1
0
log of x
x
1
square root of x
1
1
0
0
0
x
1
0
x
1
10
1
BMD Component Plus Residual
1.2
1.4
1.6
1.8
2
CPR plot for log-weight and BMD
3.5
4
4.5
5
natural log of weight
11
-.4
-.2
BMD Residual
0
.2
.4
.6
RVP plot for log-weight and BMD
3.5
4
4.5
5
natural log of weight
Residuals
lowess residuals lweight
12
Alternatives: categorize the predictor
• Split at quantiles or clinically familiar cutpoints
• Models mean as a “step function”
• Flexible, familiar, clinically interpretable, but
– ‘unrealistic’ if the regression line changes smoothly, sensitive to choice of cutpoints, inefficient compared to smooth
transformations
• Numbers of categories must balance fit against noisiness
13
• Similar code for testing within wild type group
22
Full disclosure: testing for between-group
differences is complicated
foreach day in 30 60 90 {
* calculate values of spine variables at 30, 60, and 90 days after infection
* see mkspline entry of STATA online PDF manual, page 1057
* requires variables k1-k5 giving knot locations
local sp1 = ‘day’
forvalues i = 1/3 {
local j = ‘i’+1
local sp‘j’ = (max(0,(‘day’-k‘i’)^3)- ///
(max(0,(‘day’-k4)^3)*(k5-k‘i’)-max(0,(‘day’-k5)^3)*(k4-k‘i’))/(k5-k4))/(k5-k1)^2
}
* estimate and test difference between wild type and drug resistant groups
lincom Anyresistance ///
+ ‘sp1’*(dursp1_1-dursp1_0) ///
+ ‘sp2’*(dursp2_1-dursp2_0) ///
+ ‘sp3’*(dursp3_1-dursp3_0) ///
+ ‘sp4’*(dursp4_1-dursp4_0)
display "Above: test for between-group differences at day ‘day’"
}
23
But results are suggestive ....
-----------------------------------------------------------------------------logvl |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------(1) | -.2881681
.1521503
-1.89
0.058
-.5863772
.010041
-----------------------------------------------------------------------------Above: test for between-group differences at day 30
-----------------------------------------------------------------------------logvl |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------(1) | -.3794769
.1082518
-3.51
0.000
-.5916466
-.1673072
-----------------------------------------------------------------------------Above: test for between-group differences at day 60
-----------------------------------------------------------------------------logvl |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------(1) | -.2368644
.0982155
-2.41
0.016
-.4293632
-.0443657
-----------------------------------------------------------------------------Above: test for between-group differences at day 90
24
Checking linearity: summary
• Diagnostics:
– linear models: curved LOWESS smooth in CPR or RVP
plot
– more generally (i.e., linear, logistic, Cox models): fit restricted cubic spline, test for departure from linearity using
testparm for all but first spline component
• Solutions: transform predictor, use linear or cubic splines
25
Checking the model: normality
• t- and F -tests, CIs based on normality of errors ()
• Fairly robust to violations, especially short-tailed errors in
larger samples
• However, long-tailed errors can degrade power, precision
• Diagnostics: Q-Q and other plots of residuals
– tests for normality lack power where you need it
26
0
100
Residuals
200
300
-100
0
100
Residuals
200
300
0
-100
0
Density
.005
.01
Residuals
100 200
300
.015
-100
-100
0
0
Density
.005
.01
Residuals
100
200
300
.015
Diagnosing departures from normality
-200
-100
0
Inverse Normal
100
200
27
Solution: transform the outcome
• Residuals skewed (usually to the right):
– log, square root, other power transformations
– may need to add constant to make all values positive
• Search for best transformation using qladder command
LDL cholesterol, mg/dL
Quantile-Normal plots by transformation
29
Residuals of log-transformed LDL
Residuals
.4
1
Fraction
.3
.2
0
.1
-1
0
-1
0
Residuals
1
Density
Residuals
2
Inverse Normal
1
Residuals
Density
1.5
1
.5
0
-1
0
-2
-1
Residuals
0
Kernel Density Estimate
1
-1
-.5
0
Inverse Normal
.5
1
30
Another solution: bootstrap CIs
• Resample N observations with replacement from data, re-fit
model, store estimates, repeat 100, 500, 1,000 times or more
• Distribution of bootstrap estimates models sampling distribution of actual estimate
• Quick, partial solution:
1. replace model-based SE by SD of bootstrap estimates
2. construct CIs assuming Normality
31
A better solution: percentile bootstrap CIs
• 95% CI: 2.5th to 97.5th percentile of bootstrap estimates
• Bias-correction shifts CI slightly to right or left
• Slower but avoids making Normality assumption
• Requires using many (≥ 1, 000) bootstrap samples
– extreme percentiles are noisy!
32
Solution: model a transform of the mean
(rather than a transform of the outcome)
• Logistic model for binary outcomes uses logit transformation
of E[Y |X] = P r[Y = 1|X]
E[Y |X])
log
= β0 + β1x1 + · · · + βpxp
1 − E[Y |X]
(1)
• Other generalized linear models (GLMs) avoid dichotomizing
outcome, generally use log E[Y |X] (Biostat 209)
– gamma, Poisson, negative binomial, zero-inflated Poisson
and negative binomial
33
Another solution: ordinal models
• Agatston scores for coronary artery calcium (CAC) mostly
zeroes with long right tail
• Log-transformation (after adding 1) does not help: still mostly
zeroes with long right tail
• Could dichotomize outcome as CAC > 0 or CAC > 10, use
logistic model – but potentially wasteful
34
Another solution: ordinal models
• Alternatively, categorize CAC as 0, 1-9, 10-99, 100-399, ≥
400, use regression model for ordinal outcomes
– proportional odds (ologit)
– continuation ratio (ocratio)
• Proportional odds assumption relaxed using gologit2
• Steve will briefly cover these
35
Checking normality: summary
• Diagnostics: curvature in QQ-plot
• Solutions: transform outcome, use bootstrap percentile CIs,
or GLM or ordinal model
36
Checking the model: constant variance
• If constant variance assumption is violated
– coefficient estimates unbiased but inefficient
– tests for between-group differences may be invalid
– unlike Normality problems, larger samples don’t help
37
Diagnostics: constant variance
• Plot residuals against fitted values, predictors
– check for horizontal funnel shapes
• Compare sample size, variance of residuals across subgroups:
– watch out if both differ by factors of more than 2
38
ï20
ï10
Residuals
0
10
20
RVF plot to diagnose non-constant variance
2
4
6
8
Fitted values
39
Solution: transform outcome
outcome
variance ∝ mean
SD ∝ mean
proportions
correlations
Comparing N, residual variance by subgroup
. tabstat resid, by(physact) stat(n var) nototal
physact |
N variance
-----------------+-------------------much less active |
26 1198.729
somewhat less ac |
46 746.4037
about as active |
87 990.6615
somewhat more ac |
85
527.047
much more active |
32 124.3417
-------------------------------------. tabstat resid, by(diabetes) stat(n var) nototal
diabetes |
N variance
---------+-------------------no |
196
100.288
yes |
80 2244.603
------------------------------
Outcome
Continuous
Successes in n trials
Clustered successes
Counts
Counts
Counts
Continuous
∗ over-dispersed
See Table 8.8, VGSM
45
Checking constant variance: summary
• Diagnostics: funnel shapes in RVP plot, variable Ns, SDs
across subgroups
• Solutions: transform outcome, use robust SEs or GLM
46
Checking the model: high leverage and
influential points
• High-leverage:
– ≥ 1 extreme predictor, or anomalous combination
– potential to influence coefficient estimates unduly
• Influential:
– high-leverage plus big impact on coefficients
• Inferences based on a few observations potentially misleading
47
Simple outlier, high leverage, high influence
X - low leverage outlier
all data points
omitting X
X - high leverage point
X
40
35
X
30
.
y
.
.
20
. . .
.. .
.
.
.
. .
.. ...
.
.
..
.
y
30
25
.
20
15
10
30
35
40
x
45
leverage = 0.04
. .....
.
. . .
.
.
. .
. .
.
50
dfbeta = -0.25
.
30
. ..
40
leverage = 0.52
.
.
x
50
60
dfbeta = -.61
X - high leverage outlier
35
30
y
.
25
20
15
.
30
. ..
. .....
.
.
.
. .
.
.
.
. .
. .
40
leverage = 0.52
x
X
.
50
60
dfbeta = -2.09
48
Diagnostics: boxplots of dfbeta statistics
• dfbeta statistics measure changes in each βj when each data
point is omitted
• Defined for each observation and predictor in model
• Check for outliers in boxplots of dfbetas
49
ï.2
ï.1
0
.1
.2
.3
Boxplots of dfbetas for BMI - LDL model
DFbmi
DFnonwhite
DFdrinkany
DFage10
DFsmoking
50
Solution
• Identify up to 10 observations with biggest DFbetas
• Check for data errors or other anomaly
• Refit model without influential points, re-assess conclusions,
report sensitivities
• Consider deleting influential points if they represent a different population
51
Sensitivity of LDL model to 4 influential points
with dfbetas>0.2 in absolute value
Predictor
variable
All observations
βˆ
P -Value
Omitting 4 points
βˆ
P -Value
BMI
Age
Nonwhite
Smoking
Alcohol Use
0.36
–1.89
5.22
4.75
–2.72
0.34
–1.86
4.19
3.78
–2.64
0.007
0.090
0.025
0.032
0.069
0.010
0.090
0.066
0.072
0.072
52
Checking influential points: summary
• Diagnostics: boxplots of dfbetas
Checking the model: covariate overlap
• Observational analysis of binary exposure problematic if exposed, unexposed too unlike
• Lack of overlap makes true model hard to find, especially in
small datasets
• Comparing each covariate in exposed and unexposed may not
be enough, because covariates are correlated:
– some combinations of covariates may be unrepresented in
one group
54
Lack of age overlap in model for effect of
2
Change in BDI Score
4
6
8
treatment on Beck Depression Inventory score
30
40
50
Age
60
70
True model for BDI change in treated
True model for BDI change in controls
55
No power to detect interaction
. regress del_bdi i.treatment##c.age
Source |
SS
df
MS
-------------+-----------------------------Model | 46.3692007
3 15.4564002
Residual | 27.0583639
27 1.00216163
-------------+-----------------------------Total | 73.4275647
30 2.44758549
Number of obs
F( 3,
27)
Prob > F
R-squared
Adj R-squared
Root MSE
Diagnosing lack of overlap
• Compare mean, quartiles, range of covariates in exposed and
unexposed
• Use propensity scores
– fit logistic model for primary predictor
∗ include an MSAS for the exposure-outcome relationship
∗ capture non-linearities and interactions
– get fitted values (on linear predictor or probability scale)
– plot the results by primary predictor and check overlap
57
Propensity score model for statin use
. * logistic model for statin use
. quietly logistic statins agesp* i.raceth i.educ_cat ///
>
i.smoking##i.lessactive diabetes
. * calculate logit propensity score
. predict logit_ps, xb
. * density plots of logit scores in statin users and non-users
. twoway (kdensity logit_ps if statins==1, area(1) lpattern(solid)) ///
>
(kdensity logit_ps if statins==0, area(1) lpattern(longdash)), ///
>
ytitle("Density") xtitle("Logit Propensity Score") ///
>
legend(order(1 "Treated" 2 "Untreated")) ///
>
saving(pscores, replace)
58
0
.5
Density
1
1.5
2
Overlap diagnostics for statin use
-2
-1.5
-1
-.5
Logit Propensity Score
Treated
0
.5
Untreated
59
Solution: lack of overlap
• Restrict inference to region of good overlap
• Match on prognostic covariates or propensity scores
60
Change in Beck Depression Inventory Score
2
4
6
8
Restricting inference to region of overlap
30
40
50
Age
60
70
Inference region
61
Checking overlap: summary
• Diagnostics: compare covariates, density plots of logit-propensity
scores in exposed, unexposed
• Solutions: restrict inference to region of good overlap, possibly by matching
62
Model checking: to transform or not
• Transformations can help meet assumptions
– but make results harder to interpret
• If violations mild, results robust, reasonable not to transform
• If conclusions change substantially after transformation
– model that meets assumptions better is more reliable
63
Model checking: summary
• Non-linearity:
– Diagnostics: curved Lowess smooth in CPR or RVP plot
– Solutions: transform predictor, including splines
• Non-normality:
– Diagnostics: curvature in QQ-plot
– Solutions: transform outcome, use bootstrap CIs, GLM
or ordinal model
64
Model checking: summary
• Non-constant variance:
– Diagnostics: funnel shapes in RVP plot, SDs differ across
unequal size subgroups
– Solutions: transform outcome, use GLM, robust SEs
• Influential points:
– Diagnostics: boxplots of dfbeta statistics
– Solutions: identify up to 10 influential points, correct data
errors, omit influential points if justifiable, present sensitivity analysis
65