Psychology Psychology 253, Section 4: Regularized Regularized Regression, Lecture 1 •

•

•

Regularized Regularized (or penalized, or ridge) regression can be done in R with the functions, lm.ridge {MASS} and

glmpath {glmpath} Examine Alan Gordon’s EEG data (Thanks, Alan, for this gift!), and the {glmpath} data set, ‘heart.data ‘heart.data’’, both with binary DV’ DV’s. s. Hence we use the generalized linear model for logistic regression. But before tackling yet another regression package, let us take stock.

1

R packages for regression analyses Ewart Thomas (May 2012) 1. lm {stats}: Quantitative Quantitative dependent variable (DV), fixed effects effects OLS model. This workhorse function, the General Linear Model, in the base package accepts quantitative or categorical predictors . 2. glm {stats}: Quantitative DV (including binary DV), fixed effects logistic regression model. This Generalized Linear Model in the base package accepts quantitative or categorical predictors, as well as link functions other than the logistic. 3. lmer {lme4}: Quantitative Quantitative DV (including binary DV), mixed choice of link functions. functions. This is an effects model; with a choice expansion of lm() to handle random effects.

4. mlogit {mlogit}: Categorical DV (or DV is ranking of k options), options), multinomial logistic regression model, with alternative-specific and/or participant-specific variables, and extensions to random effects

5. clm/clm2 {ordinal} : Ordinal DV, DV, cumulative cumul ative link model (CLM). The latent variable distribution can be Logistic, Normal, etc., and the cut-points, or thresholds, on the latent variable scale can be estimated. Random effects can be handled by clmm(), using the same syntax as lmer(), and by the older clmm2(), using the same syntax as lme(). 6. lrm {rms}: Ordinal DV, logistic regression models (LRM) in the package, Regression Modeling Strategies. This gives similar output to clm().

7. glmpath {glmpath}: Quantitative Quantitative DV (including binary DV), fixed effects regression model, with a penalty on on the magnitude of the coefficients, e.g., on the L 1- and L2-norms. This is a generalized linear model that allows for family = c(‘binomial’, ‘gaussian ‘gaussian’, ’, ‘poisson ‘poisson’) ’). Regularization with the L1-norm is known by the meaningful acronym, LASSO (Least Absolute Shrinkage & Selection Operator) ; that with a mixture of norms is called an elastic net. 8. lm.ridge {MASS}: Quantitative DV fixed effects regression regression model, with a penalty on on the L2-norm of the coefficients. Unlike glmpath(), it handles only Gaussian DV’s (i.e., it is like lm() not glm()). This is ridge regression, or Tikhonov regularization , and is adequate for many purposes. Now to discuss regularized regression!

EEG Data from Alan Gordon, PhD student On May 7, 2012, at 6:01 PM, Alan Gordon wrote: Hi Ewart, I'm happy to help. Attached is a dataset I use in penalized logistic regression. The question here is: can we use EEG data data from the parietal lobe to predict whether people are correctly identifying a stimulus as previously encountered (hits) or correctly identifying a stimulus as not previously encountered (correct rejections). rejections) . Each EEG EEG feature is a given time bin of amplitude, for a given channel located on the parietal lobe. So, for for instance, one feature might be the mean amplitude of channel 45 from time 100-150ms after stimulus onset. The data is organized as 254 (memory trials) by 523 (features) (features). The 254-length label label vector is 1 for for hits, -1 for correct correct rejections. rejections. For this subject, I can can classify at about about 65%, where where chance is ~50%. Let me know if you'd like more information. I'm happy to see that regularized regression is being discussed discussed in Psy Psy 253! -Alan

Sparse logistic regression for whole brain classification of fMRI data Srikanth Ryali,1 Kaustubh Supekar,2,3 Daniel A. Abrams,1 and Vinod Menon Stanford University Neuroimage. 2010 June; 51(2): 752 –764. Multivariate pattern recognition methods are increasingly being used to identify multiregional brain activity patterns that collectively discriminate one cognitive condition or experimental group from another, using fMRI data. The performance of these methods is often limited because the number of regions considered in the analysis of fMRI data is large compared to the number of observations (trials or participants). Existing methods that aim to tackle this dimensionality problem are less than optimal because they either over-fit the data or are computationally intractable. Here, we describe a novel method based on logistic regression using a combination of L1 and L2 norm regularization that more accurately estimates discriminative brain regions across multiple conditions or groups. The L1 norm, computed using a fast estimation procedure, ensures a fast, sparse and generalizable solution; the L2 norm ensures that correlated brain regions are included in the resulting solution, a critical aspect of fMRI data analysis often overlooked by existing methods. …

Ill-conditioned problems (Wiki) • Hadamard’s (1865-1963) definition of a well-posed problem: (i) A solution exists. (ii) The solution is unique. And (iii) The solution depends continuously on the data. • If a problem is ill-conditioned (i.e., not well-posed), it needs to be re-formulated for numerical treatment. Typically this involves including additional assumptions, such as simplicity and smoothness of solution. This process is known as regularization , and Tikhonov regularization, in which the L2 norm is used, is popular for the regularization of linear ill-conditioned problems. [One approach to data mining.]

Overdetermined data sets • Suppose we have many fewer observations, y j , j = 1, 2, …, n, than predictor variables, x ij , i = 1, 2, …, p; p >> n. We wish to fit the logistic model, p

logit(Pr( y j

=

1)) =

bi xij + e j , i =1

• There would be very many solutions, {bi }, and each would yield perfect prediction of the existing data, but poor prediction of new data. • We need to add constraints to the optimization in order to obtain ‘desired’ solutions. These constraints are often on the size (smaller coeffs are preferred), or number (fewer coeffs are preferred, i.e., parsimony ) of the bi .

An agenda for glmpath() 1. Generate plots of the path of each bi from 0 to its final value, as λ steps from a large value to 0. This is feasible when p ≈ 10, but not when p is ‘large.’ 2. Plot AIC, BIC for the best model at each λ. And/or print a table with AIC and BIC values. Use these to determine the best value of λ. 3. Which coefficients, bi , are non-zero at the best λ? Does this selection of features lead to any insights about the generation of y ? 4. These plots, tables and coefficient sets are produced by glmpath().

An agenda for glmpath() 5. In addition to AIC and BIC, the cross-validation (CV) accuracy of the best model at each λ can be computed, and the ‘best’ value of λ taken to be the value that maximizes accuracy. 6. The CV plots are produced by cv.glmpath(). 7. Bootstrapping can be used to get the distribution of the coefficients in the best model. For each bi , the histogram can be plotted and the mean and sd can be calculated. 8. These can be done with bootstrap.path() and

plot.bootpath().

An agenda for glmpath() 9. If we have independent information about the p features, can this information explain the sign and magnitude of the bi ’s? Here, ‘bi ’ might be the coefficient estimated from the original data, or the mean of the bootstrapped distribution. Plots and lm() might be applicable here. 10.The default in glmpath() is to set , λ2 = 1e-05 (i.e., 10-5), as the penalty on the L2-norm. Might other values of λ2 yield even lower AIC or higher CV-accuracy? This can easily be checked.

glmpath() with Alan’s EEG data • Agenda Items, AI.1-AI.4. In this data set p = 533, so that visualising the path of all the bi ’s is not feasible. However, the tables of AIC, BIC, etc. produced by glmpath() suggest that the optimal λ = 0.028, and the optimal model uses about 107 (about 20%) features, with half of the 107 bi ’s positive. () • AI.5-AI.6. Cross-validation with cv.glmpath() plots the model performance (CV-accuracy) for the ‘best’ model at each value of λ in seq(0, 1, 100). With Alan’s EEG data, accuracy rates of about 72% are obtained, comparing favorably with the 65% reported earlier. • AI.7-AI.9. Each feature is a known electrode, 1:13, at a known time bin, 1:41. So ‘electrode’ and ‘time’ can be used to ‘explain’ bi . • Use smaller data set to fully explore AI.1-AI.11.

library(glmpath) sink('rglmpath2a.r’) eeg = read.csv('EEGDat.csv', header=F) eeg = as.matrix(eeg) lab = read.csv('EEGlabels.csv', header=F) lab = as.vector(lab), lab = ifelse(lab == 1, 1, 0) rs1 = glmpath(eeg, lab, family = binomial)

cv.eeg1 = cv.glmpath(eeg, lab, family = binomial) #deviance cv.eeg2 = cv.glmpath(eeg, lab, family = binomial, type='response’) #prediction error

AI.11-AI.12. Comparison with Traditional Stepwise Regression (‘sglmpath1b.r’) #glmpath exercises. compare with step() d0 = data.frame(cbind(x, y)) res1 = glm(y ~ ., family = binomial, d0) print(summary(res1)) res2 = step(res1) print(summary(res2))

Results

(‘rglmpath1b.r’)

• Both the familiar lm() (AIC = 492.14) and the stepwise function, step() (AIC = 487.69), yield the same ‘best’ set of predictors: tobacco, ldl, famhist, typea, and age. • glmpath() gives a minimum AIC of 489.27 at Step 10. At Step 9, the active set includes 6 predictors, the above set plus sbp; at Step 10, obesity is added. • The results are similar. • AI.5-AI.9. Predictive accuracy needs to be compared, and bootstrapping used to get the distributions of the coefficients.

sbp

tobacco

ldl

0 4

5 2

5 1

5 2

0 2

0 2

y c n e u q e r F

5 1

0 1

y c n e u q e r F

0 1

y c n e u q e r F

5 1

0 1

5 1

0 3

y c n e u q e r F

0 2

0 1

5

0

0

−0.2

0.4

0

0 .2

0. 5

0

0 .1

0 .4

0 .7

0

−0.4

0.2

Bootstrap coefficient

Bootstrap coefficient

Bootstrap coefficient

Bootstrap coefficient

typea

obesity

alcohol

age 0 3

0 3

0 3

5 2

5 2

0 2

y c n e u q e r F

5 1

y c n e u q e r F

5 1

5

5

0

0

0 .0

0 .4

Bootstrap coefficient

y c n e u q e r F

0 2

0 2

5 1

0 1

0 1

0 1

5 2

0 3

0 2

0 1 5

0

− 0. 6

− 0. 1

Bootstrap coefficient

0 1

5

5

5

y c n e u q e r F

famhist

0 3

0 3

y c n e u q e r F

adiposity

0

−0.3

0.1

Bootstrap coefficient

0.2

0.8

Bootstrap coefficient

0.2

0.5

Bootstrap coefficient

Comments on HW-6

• Compare glmpath() with stepwise regression as an automated selection of relevant features, using the self-ratings from HW-3. • Try to explain coefficient values as a function of the ‘loadings’ of the feature (= self-rating) on, say, PC1 and PC2. • CV-accuracy: how is this computed?

•

•

Regularized Regularized (or penalized, or ridge) regression can be done in R with the functions, lm.ridge {MASS} and

glmpath {glmpath} Examine Alan Gordon’s EEG data (Thanks, Alan, for this gift!), and the {glmpath} data set, ‘heart.data ‘heart.data’’, both with binary DV’ DV’s. s. Hence we use the generalized linear model for logistic regression. But before tackling yet another regression package, let us take stock.

1

R packages for regression analyses Ewart Thomas (May 2012) 1. lm {stats}: Quantitative Quantitative dependent variable (DV), fixed effects effects OLS model. This workhorse function, the General Linear Model, in the base package accepts quantitative or categorical predictors . 2. glm {stats}: Quantitative DV (including binary DV), fixed effects logistic regression model. This Generalized Linear Model in the base package accepts quantitative or categorical predictors, as well as link functions other than the logistic. 3. lmer {lme4}: Quantitative Quantitative DV (including binary DV), mixed choice of link functions. functions. This is an effects model; with a choice expansion of lm() to handle random effects.

4. mlogit {mlogit}: Categorical DV (or DV is ranking of k options), options), multinomial logistic regression model, with alternative-specific and/or participant-specific variables, and extensions to random effects

5. clm/clm2 {ordinal} : Ordinal DV, DV, cumulative cumul ative link model (CLM). The latent variable distribution can be Logistic, Normal, etc., and the cut-points, or thresholds, on the latent variable scale can be estimated. Random effects can be handled by clmm(), using the same syntax as lmer(), and by the older clmm2(), using the same syntax as lme(). 6. lrm {rms}: Ordinal DV, logistic regression models (LRM) in the package, Regression Modeling Strategies. This gives similar output to clm().

7. glmpath {glmpath}: Quantitative Quantitative DV (including binary DV), fixed effects regression model, with a penalty on on the magnitude of the coefficients, e.g., on the L 1- and L2-norms. This is a generalized linear model that allows for family = c(‘binomial’, ‘gaussian ‘gaussian’, ’, ‘poisson ‘poisson’) ’). Regularization with the L1-norm is known by the meaningful acronym, LASSO (Least Absolute Shrinkage & Selection Operator) ; that with a mixture of norms is called an elastic net. 8. lm.ridge {MASS}: Quantitative DV fixed effects regression regression model, with a penalty on on the L2-norm of the coefficients. Unlike glmpath(), it handles only Gaussian DV’s (i.e., it is like lm() not glm()). This is ridge regression, or Tikhonov regularization , and is adequate for many purposes. Now to discuss regularized regression!

EEG Data from Alan Gordon, PhD student On May 7, 2012, at 6:01 PM, Alan Gordon wrote: Hi Ewart, I'm happy to help. Attached is a dataset I use in penalized logistic regression. The question here is: can we use EEG data data from the parietal lobe to predict whether people are correctly identifying a stimulus as previously encountered (hits) or correctly identifying a stimulus as not previously encountered (correct rejections). rejections) . Each EEG EEG feature is a given time bin of amplitude, for a given channel located on the parietal lobe. So, for for instance, one feature might be the mean amplitude of channel 45 from time 100-150ms after stimulus onset. The data is organized as 254 (memory trials) by 523 (features) (features). The 254-length label label vector is 1 for for hits, -1 for correct correct rejections. rejections. For this subject, I can can classify at about about 65%, where where chance is ~50%. Let me know if you'd like more information. I'm happy to see that regularized regression is being discussed discussed in Psy Psy 253! -Alan

Sparse logistic regression for whole brain classification of fMRI data Srikanth Ryali,1 Kaustubh Supekar,2,3 Daniel A. Abrams,1 and Vinod Menon Stanford University Neuroimage. 2010 June; 51(2): 752 –764. Multivariate pattern recognition methods are increasingly being used to identify multiregional brain activity patterns that collectively discriminate one cognitive condition or experimental group from another, using fMRI data. The performance of these methods is often limited because the number of regions considered in the analysis of fMRI data is large compared to the number of observations (trials or participants). Existing methods that aim to tackle this dimensionality problem are less than optimal because they either over-fit the data or are computationally intractable. Here, we describe a novel method based on logistic regression using a combination of L1 and L2 norm regularization that more accurately estimates discriminative brain regions across multiple conditions or groups. The L1 norm, computed using a fast estimation procedure, ensures a fast, sparse and generalizable solution; the L2 norm ensures that correlated brain regions are included in the resulting solution, a critical aspect of fMRI data analysis often overlooked by existing methods. …

Ill-conditioned problems (Wiki) • Hadamard’s (1865-1963) definition of a well-posed problem: (i) A solution exists. (ii) The solution is unique. And (iii) The solution depends continuously on the data. • If a problem is ill-conditioned (i.e., not well-posed), it needs to be re-formulated for numerical treatment. Typically this involves including additional assumptions, such as simplicity and smoothness of solution. This process is known as regularization , and Tikhonov regularization, in which the L2 norm is used, is popular for the regularization of linear ill-conditioned problems. [One approach to data mining.]

Overdetermined data sets • Suppose we have many fewer observations, y j , j = 1, 2, …, n, than predictor variables, x ij , i = 1, 2, …, p; p >> n. We wish to fit the logistic model, p

logit(Pr( y j

=

1)) =

bi xij + e j , i =1

• There would be very many solutions, {bi }, and each would yield perfect prediction of the existing data, but poor prediction of new data. • We need to add constraints to the optimization in order to obtain ‘desired’ solutions. These constraints are often on the size (smaller coeffs are preferred), or number (fewer coeffs are preferred, i.e., parsimony ) of the bi .

An agenda for glmpath() 1. Generate plots of the path of each bi from 0 to its final value, as λ steps from a large value to 0. This is feasible when p ≈ 10, but not when p is ‘large.’ 2. Plot AIC, BIC for the best model at each λ. And/or print a table with AIC and BIC values. Use these to determine the best value of λ. 3. Which coefficients, bi , are non-zero at the best λ? Does this selection of features lead to any insights about the generation of y ? 4. These plots, tables and coefficient sets are produced by glmpath().

An agenda for glmpath() 5. In addition to AIC and BIC, the cross-validation (CV) accuracy of the best model at each λ can be computed, and the ‘best’ value of λ taken to be the value that maximizes accuracy. 6. The CV plots are produced by cv.glmpath(). 7. Bootstrapping can be used to get the distribution of the coefficients in the best model. For each bi , the histogram can be plotted and the mean and sd can be calculated. 8. These can be done with bootstrap.path() and

plot.bootpath().

An agenda for glmpath() 9. If we have independent information about the p features, can this information explain the sign and magnitude of the bi ’s? Here, ‘bi ’ might be the coefficient estimated from the original data, or the mean of the bootstrapped distribution. Plots and lm() might be applicable here. 10.The default in glmpath() is to set , λ2 = 1e-05 (i.e., 10-5), as the penalty on the L2-norm. Might other values of λ2 yield even lower AIC or higher CV-accuracy? This can easily be checked.

glmpath() with Alan’s EEG data • Agenda Items, AI.1-AI.4. In this data set p = 533, so that visualising the path of all the bi ’s is not feasible. However, the tables of AIC, BIC, etc. produced by glmpath() suggest that the optimal λ = 0.028, and the optimal model uses about 107 (about 20%) features, with half of the 107 bi ’s positive. () • AI.5-AI.6. Cross-validation with cv.glmpath() plots the model performance (CV-accuracy) for the ‘best’ model at each value of λ in seq(0, 1, 100). With Alan’s EEG data, accuracy rates of about 72% are obtained, comparing favorably with the 65% reported earlier. • AI.7-AI.9. Each feature is a known electrode, 1:13, at a known time bin, 1:41. So ‘electrode’ and ‘time’ can be used to ‘explain’ bi . • Use smaller data set to fully explore AI.1-AI.11.

library(glmpath) sink('rglmpath2a.r’) eeg = read.csv('EEGDat.csv', header=F) eeg = as.matrix(eeg) lab = read.csv('EEGlabels.csv', header=F) lab = as.vector(lab), lab = ifelse(lab == 1, 1, 0) rs1 = glmpath(eeg, lab, family = binomial)

cv.eeg1 = cv.glmpath(eeg, lab, family = binomial) #deviance cv.eeg2 = cv.glmpath(eeg, lab, family = binomial, type='response’) #prediction error

AI.11-AI.12. Comparison with Traditional Stepwise Regression (‘sglmpath1b.r’) #glmpath exercises. compare with step() d0 = data.frame(cbind(x, y)) res1 = glm(y ~ ., family = binomial, d0) print(summary(res1)) res2 = step(res1) print(summary(res2))

Results

(‘rglmpath1b.r’)

• Both the familiar lm() (AIC = 492.14) and the stepwise function, step() (AIC = 487.69), yield the same ‘best’ set of predictors: tobacco, ldl, famhist, typea, and age. • glmpath() gives a minimum AIC of 489.27 at Step 10. At Step 9, the active set includes 6 predictors, the above set plus sbp; at Step 10, obesity is added. • The results are similar. • AI.5-AI.9. Predictive accuracy needs to be compared, and bootstrapping used to get the distributions of the coefficients.

sbp

tobacco

ldl

0 4

5 2

5 1

5 2

0 2

0 2

y c n e u q e r F

5 1

0 1

y c n e u q e r F

0 1

y c n e u q e r F

5 1

0 1

5 1

0 3

y c n e u q e r F

0 2

0 1

5

0

0

−0.2

0.4

0

0 .2

0. 5

0

0 .1

0 .4

0 .7

0

−0.4

0.2

Bootstrap coefficient

Bootstrap coefficient

Bootstrap coefficient

Bootstrap coefficient

typea

obesity

alcohol

age 0 3

0 3

0 3

5 2

5 2

0 2

y c n e u q e r F

5 1

y c n e u q e r F

5 1

5

5

0

0

0 .0

0 .4

Bootstrap coefficient

y c n e u q e r F

0 2

0 2

5 1

0 1

0 1

0 1

5 2

0 3

0 2

0 1 5

0

− 0. 6

− 0. 1

Bootstrap coefficient

0 1

5

5

5

y c n e u q e r F

famhist

0 3

0 3

y c n e u q e r F

adiposity

0

−0.3

0.1

Bootstrap coefficient

0.2

0.8

Bootstrap coefficient

0.2

0.5

Bootstrap coefficient

Comments on HW-6

• Compare glmpath() with stepwise regression as an automated selection of relevant features, using the self-ratings from HW-3. • Try to explain coefficient values as a function of the ‘loadings’ of the feature (= self-rating) on, say, PC1 and PC2. • CV-accuracy: how is this computed?

No recommend documents