Machine Learning

Published on December 2016 | Categories: Documents | Downloads: 69 | Comments: 0 | Views: 423

of 23

Class notes from Yale course on machine learning.

Content

Data Mining and Machine Learning
Sarah Constantin April 16, 2012

1

Lecture 1

Textbook: Hastie’s ”Elements of Statistical Learning.” Grade: 60 percent bi-weekly homework, 40 percent ﬁnal project. Class demonstrations are in R, work may be in R or MATLAB. There are two basic types of problems: classiﬁcation and regression. Regression relates input variables to a numerical variable, trying to predict the response variable from the input variables. Classiﬁcation is the same thing, except the output variable is discrete. Regression example: predict house price from things like size, school district, number of bedrooms, etc. Classiﬁcation example: distinguishing spam from ham emails based on the text of an email. We’ll start with traditional linear models for classiﬁcation and regression, and from there try to make linear models more ﬂexible. The other main issue to consider is high dimensional data – many possible features. For example: a grayscale image with many pixels. Example: Autism. One of the characteristics of autism is impaired social interaction. When watching a movie, autistics will pay less attention to social scenes than normal people do. Eye-tracking data can pay attention to where the subject is looking on the screen. Subjects watched ”Who’s Afraid of Virginia Woolf.” Subjects have classiﬁcation labels (autistic or neurotypical) and each frame has a data point indicating where the subjects are looking. Can we use this data as a binary classiﬁer? Some frames have little discriminatory power, but some frames show signiﬁcant diﬀerence between autistic and neurotypical subjects. So part of this is a variable selection problem. Additionally, we need to take into account the time ordering of the frame.

1

1.1

Least Squares and Nearest Neighbors

Toy example: simple prediction method. Two input variables, x1 and x2, and a class label, red or green. Linear function of the input: ˆ ˆ Y = β0 + Or, in other words, ˆ ˆ Y = XT β The residual sum of squares is given by RSS(β) = (yi − xT β)2 = (y − Xβ)T (y − Xβ) i ˆ Xj βj

This is a measure of goodness of ﬁt of the linear model. K-nearest neighbors algorithm: for each data point, rank the distances to all other points and identify the k nearest neighboring points. For each grid point, calculate k nearest neighbors in the data set. The majority vote will assign the classiﬁcation. This creates a classiﬁcation boundary which is not necessarily linear. Note that not every data point will have an eﬀect on the classiﬁcation boundary. It’s more of a local method, trying to use local features for classiﬁcation rule. The linear method is more of a global method.

2

Lecture 2

Simulation example: red points and green points. We assume we don’t know the underlying probability distribution. Two input variables, X1 and X2 and classiﬁcation labels. Can we ﬁnd a classiﬁcation rule that predicts the label of a new data point? Procedures: least squares and clustering. Least squares treats Y as a linear function of X1 and X2. If the predictor is ¿0.5 predict red, otherwise predict green. K-nearest-neighbors says that if two data points are close in terms of their input variables, we expect the labels to be similar. Prediction is based on neighboring points. For any data point, look at its k nearest neighbors, and give the new point the label of the majority class. The smaller k is, the more seriously it takes outliers. How to choose k is a very important question. As the size of the neighborhood increases, are you using more degrees of freedom or fewer? You’re using less. At the extreme where the size of the neighborhood was all the data, there would be only one possible option. Deﬁne N/K, number of clusters, approximate degrees of freedom. Linear method vs. K nearest neighbor procedure... plotting degrees of freedom against error, there’s a minimum error point for test data (training data, of course, more ﬁtting is always better.) 2

2.1

Statistical Decision Theory

Choose f (x) to minimize expected squared error loss (Y − f (X))2 . Expected Prediction Error: EP E(f ) = (y − f (x))2 p(y|x)p(x)dxdy = [(y − f (x))2 p(y|x)dy]p(x)dx = Ex EY |X ([Y − f (x)]2 |X) So it’s suﬃcient to minimize EY |X ([Y − f (X))2 |X) pointwise. The least squares solution is the regression function. K-nearest neighbors is f (x) = Ave(y|x ∈ Nk (x)) Linear regression: f (x) = X T β. EPE is minimized by [E(XX T )]−1 E[XY ] K-nearest-neighbors directly approximates the Bayes classiﬁer; conditional probability of a point is relaxed to conditional probability within a neighborhood of a point, and probabilities are estimated by training sample proportions.

3

Lecture 3
yi = x t β + i

Linear regression: assume
i

where E[ i ] = 0, V ar[ i ] = minimizing by setting

σ2,

Cov( i , j ) = 0. We minimize the sum of squared errors by

(yi − xT β)2 = (y − Xβ)T (y − Xβ) i 0 = dS/dβ = d/dβ(y T y − β T X T y − y T Xβ + β T X T Xβ) = −2X T y + 2X T Xβ ˆ β = (X T X)−1 X T y Then if we use the least-squares estimator ˆ β = (X T X)−1 X T y 3

it’s an unbiased estimate: ˆ E[β] = β Indeed, ˆ E[β] = E[(X T X)−1 X T (Xβ + )] = β + E[(X T X)−1 ] = β + E[E[(X T X)−1 X T |X]] = β because E[ |X] = 0. ˆ Cov(β) = (X T X)−1 σ 2 Typical estimate of σ 2 : σ2 = ˆ 1 N −p−1 (yi − yi )2 ˆ ˆ ∼ N (0, σ 2 ) then β follows

unbiased estimator of σ 2 . If we further assume that the errors a multivariate normal distribution, ˆ β ∼ N (β, (X T X)−1 σ 2 ) So we can do statistical tests on whether βj = 0 or not.

If we do simulations of data generated from an underlying distribution, with randomly generated error term, look at the coeﬃcient estimate for each sample so generated. Your expectation, the center of the sampling distribution, will be the underlying truth. Gauss-Markov Theorem: the least squares estimate has the smallest variance among all linear unbiased estimates. ˆ ˆ M SE(θ) = E[(θ − θ)2 ] = Bias2 + V ariance but if bias is 0, minimum MSE is minimum variance. There may, however, exist a biased estimator with smaller variance. We can trade a little bias for a large reduction in variance. If variables are highly correlated, the residual vector will be very close to zero and the coefˆ ﬁcient βp will be unstable. Think of this as the coeﬃcient in simple linear regression/ Xj , j = 1 . . . p

4

if these are all independent, then y ∼ x1 . . . xp the eﬀect of the xi ’s will be the same as if you did a simple linear regression for each of them. But in practice very often the x’s have some dependence. Then the coeﬃcient obtained from multiple linear regression would be diﬀerent from simple linear regression. ˆ How diﬀerent? Marginal coeﬃcient βj represents the additional contribution of xj on y after xj has been adjusted for all the other x’s. When we do the linear regression of y on x, the residual is orthogonal to the input space, therefore orthogonal to all the input variables. You can regress variables one at a time on the residuals, and this gives the correct coeﬃcients. It’s a way of ﬁnding the pure eﬀect of a variable. Colinearity can lead to unstable estimates because the residual of (xj ∼ xi ) has smaller variance. Problems with least squares estimates: Prediction accuracy least squares estimates often have low bias but large variance. Can we select variables (subset selection) so that the estimator is biased but has much lower variance. Best subset regression ﬁnds for each k < p the subset of size k that gives smallest residual sum of squares. Unfortunately the number of subsets is 2p , so it’s hard to search through all of subset space. There are two commonly used search procedures. Forward stepwise selection and backwards stepwise selection. Forward: start with no variables; for all predictors not in the model, choose the one to optimize a variable selection criterion such as AIC or BIC continue until no new predictors can be added to improve the criterion. Backwards is the same way, but reversed; start with all the variables, and remove one at a time until you can’t improve the variable selection criterion any more. Akaike Information Criterion: ˆ AIC(θ) = −2 log(likelihood) + 2k where k is the number of parameters. Bayesian Information Criterion: ˆ BIC(θ) = −2 log(likelihood) + (log N )k where N is the dimension. Consistency of model selection: if the true f is among the candidate families of regression functions, the probability of selecting the true model by BIC approaches 1 as n → ∞.

5

4

Lecture 4

Now we look at coeﬃcient shrinkage by ridge regression. This is a more stable estimator: shrink the regression coeﬃcients by imposing a penalty on their size. The ridge coeﬃcients minimize a penalized residual sum of squares. ˆ βridge = argmin (yi − β0 − xij βj )2 + λ
2 βj

as λ increases, you shrink the coeﬃcients harder. This makes mean squared error smaller than least squares. For the prostate cancer example (Hastie, Chapter 3) the coeﬃcients of some of the input variables fall as λ increases, while others grow. The OLS estimate corresponds to λ = 0. If λ = ∞ then all coeﬃcients are forced to be zero. If the OLS estimator for a coeﬃcient for a variable is nonzero, then the ridge regression is nonzero. Basically everything has nonzero coeﬃcients. If you want to reduce the number of input variables, this is not ideal. Lasso instead penalizes the
1

norm. (yi − β0 − xij βj )2 + λ |βj |

ˆ βlasso = argmin

The eﬀect of raising λ on the coeﬃcients is quite diﬀerent; the number of nonzero coeﬃcients gets smaller and smaller. Geometric intuition: an ellipse approaching the L1 ball will hit closer to (1, 0) and if it approaches the L2 ball it’ll hit farther out. Cross-validation: build your model using your training set, evaluate using your test set. This avoids overﬁtting. You can minimize training error without being good at predicting new data. If you have a lot of data, you can split it into diﬀerent subsets, train a model on one of them and test it on the others. K-fold cross-validation: Ek (λ) =
i∈partk

ˆ (yi − xi β −k (λ))2

Ridge regression can show that multiple variables converge to each other; this can separate and reconstruct variables. The shrinkage method standardizes your shrinkage variables so they have the same standard deviation so the coeﬃcients will be comparable to each other. You can have a prediction using each training set, for each of several choices of λ. Least Angle Regression: commonly 6

used algorithm for implementing Lasso. So you can see the range of cross validated MSE and how it changes with L1 norm, so that you can be more conﬁdent about choosing the minimum.

5

Lecture 5

There are still other regression techniques – group Lasso, principal components regression, etc. Fused lasso: x1 . . . xp x variable has some time ordering or spatial structure
p

λ1

|βt | + λ
t=2

|βt − βt−1 |

penalize diﬀerences between adjacent coeﬃcients. Today is all about classiﬁcation problems. You’re interested in predicting categories. Linear methods: this means the decision boundaries are linear. Decision boundary of the form α0 + αj xj = 0. Two regions separated by hyperplane. Expected prediction error: ˆ E[L(G, G(x))] loss is 0 if you assign a point to the right category, otherwise 1. This is called 0-1 loss. Bayes classiﬁer means classify to the most probable class, using the conditional distribution: maxP (g|X = x). Linear regression of an indicator matrix. K class indicators Yk , each Yk is 1 if G = k, else 0. So you can treat them as a numeric value in regression. ˆ β k = (X T X)−1 X T yk ˆ yk = X β k ˆ Training data is of the form (xi , gi ), data point and classiﬁcation. Compare the three values ˆ ˆ ˆ ˆ from the regression Y1 , Y2 , etc. Pick the class with the highest Y . Actual Y can be negative or above one, because of the nature of the linear regression line. This is one problem with linear regression for classiﬁcation. You are not guaranteed that your predicted value is in the appropriate range. Alternative: mixture of Gaussians distribution: sum of Gaussians with diﬀerent scales and means. P−1 1 T fk (x) = e−1/2(x−µk ) k (x−µk ) (2π)p/2 | k |1/2 7

each class density. Linear discriminant analysis: we see that the linear discriminant functions
−1 −1

δk (x) = xT

µk −

1/2µT k

µk + log πk

is an equivalent description of the decision rule, and you choose the k that maximizes δk (x) where πk are the priors. You only retain the terms that are related to k.

6

Lecture 6

Masking problem: if there are more than two classes, simple linear classiﬁcation will assume there are too few classes. So we need other options. For instance, linear discriminant analysis. The linear discriminant functions are
−1 −1

δk (x) = x

t

µk −

1/2µt k

µk + log πk

Maximizing this is equivalent to maximizing the posterior probability of the data if we model each class density as a multivariate Gaussian fk (x) = 1 p/2 | (2π) e−1/2(x−µk ) |1/2 k
T

P−1
k

(xk −µk )

P (G = k, X = x) = P (X = x|G = k)P (G = k) = fk (x)πk All the probabilities of being in classes are f1 (x) . . . fk (x) P (X = x) = P (X = x ∩ ∪k G = k = P (X = x, G = k)

To maximize THIS, maximize the log-likelihood, log fk (x) + log πk ∼ δk (x) modulo constant terms. The classiﬁcation boundary is where δ1 = δ2 , so that will wind up a linear classiﬁcation boundary. log πk − 1/2(µk − µl )T πl
−1 −1

(µk − µl ) + xT 8

(µk − µl ) = 0

What’s the estimate of the Gaussian distribution? Quite simple. πk = Nk /N Proportion in the kth class. µk = ˆ The mean of each class. =
k i

xi /Nk

(xi − µk )(xi − µ)T /(N − K) ˆ ˆ

Each data point gets equal weight. This is reasonable; it estimates the variance-covariance structure for each Gaussian and assumes they are equal. The LDA and linear regression are equivalent when N1 = N2 . Quadratic discriminant analysis is what happens if you don’t assume all Σk to be equal. The discriminant functions are quadratic: δk (x) = −1/2 log |Σk | − 1/2(x − µk )T Σ−1 (x − µk ) + log πk k So the decision boundary between each of the two classes is a quadratic curve. Where does 2 ˆ it come from? Diagonalize Σk = Vk Dk VkT . Vk is a p by p orthonormal and Dk is a diagonal matrix of non-negative eigenvalues dkl . Then (x − µk )T ˆ and log | ˆ
−1 k −1 −1 ˆ ˆ (x − µk ) = [Dk VkT (x − µk )]T [Dk VkT (x − µk ) ˆ

ˆ
k

|=2
l

log dkl

ˆ ˆ What’s going on: within-class variance is W = ˆ . Between-class variance B: k πk (µk − T . How much the classes diﬀer from the center of all the clases. The Fisher µ)(µk − µ) ˆ ˆ ˆ method is to spread out the between-class variance as much as possible with respect to the within-class variance. Maximize aT Ba aT W a where Z = aT X.

7

Lecture 7

One approach to compromise between linear and quadratic discriminant analysis is to low variance-covariance to be a weighted sum of the pooled covariances and the individual ones.

9

At one extreme it’s QDA – diﬀerent for each class – and at the other extreme it’ll be LDA – pooled between all the classes. The tuning parameter can be decided by the data. Dimensionality reduction perspective; project the data into a sequence of directions so the data are as well separated as possible. Two-dimensional data with two classes, concentrated around two overlapping ellipses. How can we project the data into a one-dimensional direction so we can separate the classes as well as possible? Finding this direction is a general eigendecomposition problem. Both individual and pooled variance-covariance structure play a role. Two goals: separate the classes, and be orthogonal to the previous directions. (Uncorrelated.) Logistic regression: a generalized linear model. It’s still a linear model, in the sense that you’re modeling a linear function of your input variables. But the generalization part is due to the fact that now you have a classiﬁcation problem, and you have a transformation: log pi = β 0 + β T xi 1 − pi

The 0-1 response yi is generated from a Bernoulli distribution with probability pi . The link function links the underlying parameter to the 0-1 response. The generalization with k classes: P r(G = k|X = x) T log = βk0 + βk x P r(G = K|X = x) The coeﬃcients are calculated by maximum likelihood. The likelihood L(β) = pgi (xi , β)

P (Y = yi ) = pyi (1 − pi )1−yi i The log-likelihood for the two class case is l(β) = = log pgi (xi , β)

yi log p(xi , β) + (1 − yi ) log(1 − p(xi , β))

Maximizing this usually can’t be solved explicitly – you have to use Newton’s Method to ﬁnd roots. The pi should be getting close to 1/2 as we get close to the classiﬁcation boundary. Far away from the classiﬁcation boundary they should be getting closer to 0 or 1. More emphasis on the more diﬃcult cases. p(1 − p) is large when p is close to 1/2 and small when p is close to 0 or 1. 10

Logistic regression or LDA? LDA is log-posterior odds between class k and K are linear functions of x. P (G = k|X = x) πk − 1/2(µk − µi )T = log P r(G = K)|X = x) π1
T = αk0 + αk x −1 −1

log

(µk − µl ) + xT

(µk − µl )

This linearity comes from Gaussian assumption and assumption of a common covariance matrix.

8

Lecture 9

(missed lecture 8) Input variable: univariate variable; unknown true relationship between y and x, add some noise and the points are the observed data. We want to ﬁt a nonlinear trend for the data. How can we do this systematically? What basis functions? Piecewise constant; piecewise linear; continuous piecewise linear; ”knots” are the continuity constraint points. (You can also force the ﬁrst derivative and second derivative to be continuous at the knots for more smoothness. Piecewise cubic polynomials: discontinuous, continuous, continuous ﬁrst derivative, or continuous second derivative. This is the idea of the so-called ”regression spline.” In each local region, to which degree do you want to ﬁt a polynomial term? And where are the knots? Then you will have a set of basis functions and you can just treat the problem as a linear regression problem. Another method has to do with smoothing splines. This avoids the knot selection problem to use a maximal set of knots. Minimizes RSS(f, λ) = (yi − f (xi ))2 + λ f (t)2 (t)dt

Penalize curvature. Or wiggling. We assume f has continuous second derivatives. Least square ﬁt if λ = ∞, on the one hand, and if λ = 0 f can be any function – no penalty on wiggling. f (x) = Nj (x)θj

basis functions N1 . . . NN (x) basis functions for a natural spline basis. For a speciﬁc choice of knots, you are ﬁtting up to a third degree polynomial, but if you go beyond the last

11

boundary knot you are allowed to ﬁt the linear rather than the cubic. The data becomes sparse and you don’t want to overﬁt the data. The criterion reduces to RSS(θ, λ) = (y − N θ)T (y − N θ) + λθT ΩN θ where ΩN = Compare to ridge regression, (y − xβ)T (y − xβ) + λβ T β Here instead of β on the end, we add the matrix Ω which is the p by p matrix. Eﬀective degrees of freedom: trace of Sλ where Sλ = N (N T N + λΩN )−1 N T in the equation ˆ f = N (N T N + λΩN )−1 N T y Recall y = f (x) + . Analogous to ridge regression. Expected prediction error, combines bias and variance. ˆ ˆ EP E(fx ) = Ex,y ET |x,y (Y − fλ (X))2 ˆ ˆ ˆ = EX,Y ET |X,Y (Y − f (X) + F (x) − ET fλ (X) + ET fλ (X) − fλ (X))2 ˆ ˆ ˆ = EX,Y (Y − f (X))2 + EX,Y ET [(f (X) − ET fλ (X))2 + (ET fλ (X) − fλ (X))2 ] bias plus variance. Nj (t)Nk (t)dt

9

Lecture 10

Nonparametric Logistic Regression Smoothing splines for classiﬁcation: Two class logistic regression with an input X. log P r(Y = 1|X = x) = f (x) P r(Y = 0)|X = x

which implies the probability that Y = 1 given X =x is ef (x) 1 + ef (x) 12

Predict the predicted log-likelihood criterion – penalized with curvature. (yi log p(xi ) + (1 − yi ) log(1 − p(xi )) − 1/2λ = (yi f (xi ) − log(1 + ef (xi ) )) − 1/2λ (f (t))2 dt

(f (t))2 dt

Optimal f is a ﬁnite-dimensional natural spline with knots t the values of xi . Suppose you have more than one input. Want to model Y ∼ f (x1 , x2 ). Nonlinear basis expansion of x1 and x2 separately, and also interaction terms between x1 and x2 . This is called the tensor product of the two sets of basis functions. f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) additive model without interaction term. Let hj (x1 ) be basis functions for x1 and gl (x2 ) be the basis functions for x2 . The tensor product is the set of all possible product pairs of h’s and g’s. Generalized additive model E(Y |X1 , X2 . . . Xp ) = α + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ) fj s are smooth functions. Each is ﬁt with, say, a cubic smoothing spline. A penalized residual sum of squares can be speciﬁed as a criterion to minimize (yi − α − fj (xij )2 + λj fj (tj )2 dtj

diﬀerent λ’s for each component. Iterative approach: ﬁt the smooth functions one at a time. Let α be the mean of y, and minimize (yi − α −
k=j

fk (xij )

Generalized cross validation is used to choose each value of λ.

10

Lecture 11

Tree based methods. Extending linear methods to nonlinear models via basis expansion. Nonlinear transformation of the linear input variable and transform it to a linear ﬁtting model. Multivariate case: tensor product space. High dimensionality of the problem. Generalized additive model: ignore the interaction term for pairs of variables. Size of the 13

problem now grows linearly instead of exponentially. Alternative: tree based method. 2d case: split the input space into small subregions. Rectangles. Fit the simplest possible function in the subregion. Constant predictor within each region. Not a smooth model, but takes account of interactions. If Y does not vary very much within each subregion, it’s not that bad. identify smaller regions if you notice higher values of Y within a region. Divide input space into smaller regions so that within each region you see a smaller region wth a diﬀerent value. criterion for partition. Homogeneous subregion. Binary tree: ﬁrst splitting point is at X1 ≤ t1 . Then two subsets. Growing the tree: regression tree. Look at marginal distributions of all input variables. Note spikes – this suggests things about how the data was collected. For some pairs of variables, linear relationship is not enough. Curve looks like the slope is changing. How do we choose splitting value? Fit one constant in one subset and another in another. Seach for optimal splitting value. Look at residual sum of squared error. (yi − y )2 ˆ for all splitting values. Split is the ith sorted value, minimize (yi − c1 )2 + ˆ
R1 R2

(yi − c)2] ˆ

Fewer data points in a subset; easier to ﬁt with constant. If the diﬀerence in standard error is close to constant, stop. But this only goes for local minima. So instead, grow a large tree and prune. Tree pruning: take a subtree. Collapse some nodes together.

11

Lecture 12

This is from Chapter 8 of Hastie. Divide data into 200 bootstrap samples, and ﬁt classiﬁcation trees to all of them; if the trees are variable then the model has high variance. In our test data, we have two classes, 5 features, each Gaussian distributed and pairwise correlation of 0.95, and the response Y only depends on the ﬁrst input. Bootstrap estimation: give the observed values equal probability based on your data. (xi , yi ) training data. Give equal probability for all of these. (x∗ , y ∗ ) is drawn from the empirical distribution putting equal probability 1/N on each of the training data points. Bootstrapped sample of original data: draw the same number of data points, but with replacement. You may see some repeated points. From the bootstrapped sample you can create parameter estimation and see how variable your parameter estimation can be.

14

B

ˆ f (x) = 1/B
b=1

ˆ f ∗b (x)

ﬁtted model based on a bootstrap sample Z ∗b ∼ p. For k-classiﬁcation, you can take ˆ the majority vote of the trees. Average class probability of the B trees. This gives the data a better chance of being repeatedly used in the training model. Averaging multiple trees. Why does bagging work? Assume the ideal case, where (xi , yi ) are drawn from the ˆ population distribution P . The ideal aggregation fag (x) = Ep f ∗ (x). This is a ﬁtted model 2 ] < E[(Y − f ∗ (x))2 ]. Why? based on bootstrap sample. Prediction error, E[(Y − fag (x)) Bias-variance decomposition. (Y − f ∗ (x))2 = (Y − fag (x) + fag (x) − f ∗ (x))2 ≤ (Y − fag )2 + (fag − f ∗ )2

12

Lecture 13

Bagging or bootstrap aggregation is best for unstable methods. The tree method is an example of an unstable method that responds very well for bagging. Construct a tree for each bootstrap sample, and base the prediction on the average of the trees. For classiﬁcation there are two ways of combining the trees – majority vote, or average of classiﬁcation probabilities. Random forests are an improved version of the bagging idea. Reduce the correlation between the tree you’re adding and the bootstrapped sample. You reduce variance by averaging independent identically distributed random variables. V ar(xi ) = σ 2
B

V ar(1/B
i=1

xi ) = σ 2 /B

Diﬀerent trees are drawn from the same distribution, so they probably have some positive correlation. cor(xi , xj ) = ρ. So we can’t use the independence assumption.
B

1/B (
i=1

2

V ar(xi ) +
i=j

cov(xi , xj )

15

= 1/B 2 (Bσ 2 + B(B − 1)ρσ 2 ) = σ 2 (1/B + (1 − 1/B)ρ) = ρσ 2 + 1−ρ 2 σ B

Random forest algorithm (see Ch. 15) Draw a bootstrap sample from the training ata. Grow a random forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree: 1. select m variables at random from the p variables 2. Pick the best variable/split-point among the m predictor variables 3. Split the node into two daughter nodes. The smaller m, the lower the chance that this tree actually picks the best option. But the tradeoﬀ is that smaller m gives the option of greater sparsity. You have to tune the parameter. Sometimes, in a bootstrap sample, some observations are repeated and some are missing. In a random forest, what do you do about missing values? You will evaluate the prediction accuracy on (x, y) based on the prediction from those trees where (x, y) didn’t show up. You’re getting an out-of-sample performance evaluation from that data point automatically. Error drops sharply after only a fairly small number of trees. Variable importance: to measure the prediction strength of each variable. Record the prediction accuracy (compared to the out of sample error) is recorded.Then the values for the jth variable are randomly permuted in the OOB samples, and the accuracy is again permuted. That is: we have a matrix of missing x values and a vector of missing y values. Permute the x’s, so we take out the predicting eﬀect of the x’s, since the relationship of x’s and y will be canceled. If the xj was irrelevant, then this won’t change the prediction much; but if it is, the prediction accuracy will decrease.

13

Lecture 14

The idea of boosting is to have a classiﬁer
M

sign[
m=1

αm Gm (x)]

16

This is a weighted sum rather than a plain average. Also, the Gm are not generated by a bootstrap sample; each of them is a modiﬁed sample, reweighted in some intelligent way. The samples are not independent, as in bootstrapping. Here, they depend on the performance of the previous sample. There is a sequential order. AdaBoost deﬁnes how to generate the weights and samples. Generate your classiﬁer Gm to ﬁt the training data using weights wi . Compute the error rate: wi I(yi = Gm (xi )) wi How many mistakes did you make? weighted by the wi . errm )/errm ). Logit of the mistakes. Set wi = wi exp(αm I(yi = Gm (xi ))) Upweight the misclassiﬁed points. Repeat for m = 1 : M . Then output a weighted sum of classiﬁers
M

Compute αm = log((1 −

G(x) = sign[
m=1

αm Gm (x)

Boosting ﬁts an additive model. f (x) = βm b(x; γm )

Tree-based methods are examples of this; the basis functions are step functions and γ parametrizes the split variables and split points. Fit these functions by minimizing an average loss function over the training data. min L(yi , βm b(xi , γm ))

Forward stagewise additive modeling, in general, computes the best β and γ by (βm , γm ) = argmin Set fm = fm−1 (x) + βm b(x, γm ). AdaBoost is an example of this. It’s equivalent to forward stagewise additive modeling using the loss function L(y, f (x)) = e−yf (x) But there are other possible monotone decreasing functions of the error term. (βm , Gm ) = argmin 17
m wi exp(−yi βG(xi ))

L(yi , fm−1 (xi ) + βb(xi , γ))

If you ﬁx β and optimize with respect to G, you’re minimizing weighted classiﬁcation error. m Gm = argmin wi I(yi = G(xi )) Plugging this Gm in and solving for β, one obtains βm = 1/2 log 1 − errm errm

The exponential loss is more sensitive to changes in the estimated class probabilities compared to the 0-1 loss. The misclassiﬁcation error rate will suggest you stop sooner than the exponential loss.

14

Lecture 15

Generalize the idea of AdaBoost to a model of boosted trees.
M

fM (x) =
m=1

T (x, Θm )

A sum of individual trees, parametrized by Θm , which indicates tree structure. Like splitting value and splitting variable. Training loss:
N

L(yi , fM (xi ))
i=1

for example, the loss function can be (yi − fm (xi ))2 , the regression loss or L2 loss. The exponential loss is e−yi fm (xi ) . In practice it should be diﬀerentiable. Solving
N

Θm = argmin
i=1

L(y, fm−1 (x) + T (xi , Θm ))

gives you the best choice of next tree, given all the trees you had so far. To ﬁnd the ﬁtted value for each subregion, you just need to optimize the loss function falling in that subregion to choose the optimal constants. We will be ﬁtting the tree model with gradient descent. Go in the direction of steepest descent: choose fm = −ρm gm where ρm is a scalar and gm is the gradient of L(f ) evaluated at f = fm−1 . Then update the solution: fm = fm−1 − ρm gm . Gradient tree boosting (MART) – at each step, compute the derivative of the loss function at point i. Fit a

18

regression tree, to targets rm , the gradients, giving terminal regions Rjm . Choose the optimal coeﬃcients γjm such that γjm = argmin Update fm to include fm = fm−1 + L(yi , fm−1 (xi ) + γ) ∈ Rjm ).

Jm j=1 γjm I(x

Shrinkage: shrink the contribution of each tree by a factor v when it is added to the current approximation. fm = fm−1 + v γjm I(x ∈ Rjm 0 This is analagous to penalized least squares (like ridge regression or lasso.) J is a meta-parameter, the tree size, which determines how much overﬁtting happens.

15

Lecture 16

Non-parametric, unsupervised learning techniques: we have no training data. PCA: linear approximation of the data which captures the most variation of the data. projection on a smaller number of dimensions, on a ﬁnite-dimensional space. Useful in high-dimensional situations. Also useful for visualization, compression. The ﬁrst linear component: z1 = a11 x1 + a12 x2 + . . . a1p xp Sample variance of the projection z1 is greatest among all such linear combinations with ||a1 || = 1. Second linear component: z2 = a21 x1 + a22 x2 + . . . a2p xp such that a2 a1 = 0, orthogonal to the ﬁrst projection, and ||a2 || = 1, and the variance is maximized. X is the variance-covariance matrix of the original data; the jth principal component zj is the linear combination zj = aj X, which has the greatest variance subject to the conditions that ||aj || = 1 and aj is orthogonal to all previous components. Note var(z1 ) = var(Xa1 ) = a1 Sa1 where S = 1/(N − 1)X X, the sample variancecovariance matrix. ||a1 || =
l

a2 1l

Your optimization problem: maximize the variance of the ﬁrst linear combination subject to unit norm. This can be solved by an eigendecomposition problem. 19

Intuitively: you have a data cloud; the direction of greatest eccentricity, the greatest radius, is the ﬁrst principal direction. There is a theorem that any symmetric matrix has a decomposition A = ΓΛΓT where Λ is diagonal and Γ is orthogonal. Why is PCA optimal? Consider a rank-q linear model for representing observations xi as i + µ + Vq λ. Fitting this by least squares means minimizing the reconstruction error ||xi − µ − Vq λi ||2 or ¯ ||(xi − x)2 − Vq VqT (xi − X)||2 ¯ if we assume x = 0, then this yields the projection Vq VqT xi . singular value decomposition ¯ T gives an optimal choice for V . of X = U DV

16

Lecture 17

PCA is a linear method: project onto a linear combination of eigenvectors, which are linear combinations of basis vectors. Kernel PCA: map the input domain to a feature space, Φ : X → H. This transformation is generally nonlinear. Look at data in new feature space. Look at the covariance matrix in the feature space, 1/n eigendecomposition of this. λ < Φ(xi ), v >=< Φ(xi ), Sv > v= ai Φ(xi )
n T j=1 Φ(xj )Φ(xj ) .

Find the

The ai are unknown. but we also have v = ai xi . Puttng these together, if Kij =< xi , xj >, nλKa = K 2 a Kernel trick: formulate PCA as eigendecomposition of kernel matrix. Centering K: subtract the column mean and the row mean. Compute the projections on the eigenvectors: j < v j , Φ(x) >= αi k(xi , x) where αn are the eigenvector expansion coeﬃcients. Examples of positive deﬁnite kernels: Linear kernel K(x, x ) = xT x Polynomial kernels: K(x, x ) = (c + xT x )d . Gaussian 2 2 kernels: K(x, x ) = e−||x−x || /2σ . Mercer’s Theorem: positive deﬁnite kernel, Φ(x) = ( λ1 Φ1 (x), 20 λ2 Φ2 (x), . . .)

17

Lecture 18

Sparce PCA Formulate PCA as a regression-type optimization problem, impose the lasso constraint, and solve the penalized optimization problem. ||Y − xβ||2 + λ||β||1 That’s the lasso penalty. It imposes sparseness. Or the ridge penalty: ||Y − xβ||2 + λ||β||2 If (ˆ , β) = argmin α ˆ ||xi − αβ T xi ||2 + λ||β||2 ˆ with ||α||2 = 1, then β is proportional to V1 , the ﬁrst principal component. This is the same thing, with a slackness condition. Suppose we’re looking at the ﬁrst k principal components, Ap×k = [α1 . . . αk ], Bp×k = [β1 . . . βk ]
k

ˆ ˆ (A, B) = argmin

||xi − AB xi || + λ
j=1

T

2

||βj ||2

ˆ subject to AT A = Ik×k . Then βj is proportional to Vj . Alternative: elastic net. You have variables x1 , x2 , ... and you have a subset which capture most of the data, but they may be highly collinear. Instead of picking just one to include in the model, pick them as a subset. Create a matrix whose ith column is the ith principal component. UT U where X = U DV T . Weight on ridge regression: ˆ βridge = (X T X + λI)−1 X T (XVi ) =V( D2 )V T Vi D2 + λI D2 = Vi 2 i Di + λ 21

18

Lecture 19

Clustering: partition objects into homogeneous groups. Objects are more similar within each cluster. Obvious measure of dissimilarity: distance. Euclidean distance, sum of absolute diﬀerences (L1 distance). In practice you might have categorical data, binary data, etc. Categorical variables: cost matrix, zero along the diagonal, otherwise d(A, B) is a cost value based on your domain knowledge. How bad a wrong choice is that? K-means algorithm; choose K centroids, one for each classes; then iterate the following: cluster an object to the cluster with the closest centroid; update the centroid of each cluster to the mean of all objects in the cluster. This miniizes the square error criterion
K

W (S, c) =
k=1 i∈Sk

d(i, ck )

Step 2 minimizes the above given a choice of centroids c, and Ste 3 minimizes it given a set of clusters S. This converges in a ﬁnite number of steps. Let T (Y ) =
i=1 N

¯ (Yi − Y )2

variance generally.
K

W (S, C) =
k=1 i∈Sk K

¯ (Yi − Yk )2

B(S, c) =
k=1

¯ ¯ Nk (Yk − Y )2

Within-cluster variance and between-cluster variance. T (Y ) = W (S, C) + B(S, C) Bias-variance tradeoﬀ strikes again. Cross term equals 0 ¯ ¯ ¯ (Yi − Yk )(Yj − Y )
k i∈Sk

¯ ¯ (Yk − Y )
k i∈Sk

¯ (Yi − Yk )

22

The inner sum is 0, so the total is 0. How do we choose K: plot within-cluster variance and look for a kink, when the K-1 clustering doesn’t resemble K clustering at all. Consistency: how much does cluster assignment change over diﬀerent random subsets of the data? (Cross validation.)

19

Lecture 20

Last time: K-means. The between-cluster sum of squares should be large – how separated the data are. Within-cluster sum of squares should be as small as possible. Model-based clustering: assume there’s an underlying model. Underlying probability function f (x) = pk φ(y, µk , Σk ) where pk are mixture probabilities. Estimate pk , µk , Σk by maximum likelihood. Estimated by an iterative procedure, the EM algorithm. The latent variable is the class membership. If you know that, the estimate for µ and Σ is straightforward. In the E step, estimate the expectations of the latent variable given current parameters; in M step, re-estimate the parameters given the expectation of the latent variable. At each E-step, estimate posterior probability for each data point: pk φ(yi , µk , Σk ) gik = pk φ(yi , µk , Σk ) At each M-step, given gik ﬁnd pk , µk , Σk maximizing the log-likelihood. µk =
i

gik yi /gk

Σk
i

gik (y − µk )(y − µk ) /gk

data point will be assigned to the k for which gij is maximal for all k. Connection with Kmeans: if we assume Σk is diagonal, no correlation between multivariate data. To maximize the log-likelihood, you’re minimizing l(µk , σ 2 , Sk ) =
k i∈Sk

(yi − µk ) (yi − µk )/2σ 2

This is the squared error criterion in K-means algorithm. In the extreme case, if I observe two extreme clusters, what would you expect to see if you ﬁt a regression line? A line that follows the axis between the clusters. But the relationship within each cluster would be two separate regression lines. 23

Machine Learning

Comments

Content

Sponsor Documents

Recommended