spam boosting, machine learning, statistics, article to teach how to implement an example of boosting. friedman 1976 boosting techniques with spam data taken from HP computer labs.
As boosting proceeds, the log-likelihood for both training and validation
sets (TrainLL and ValidLL) increases towards 0.
The syntax of the gbm command needs some explanation. The formula and
data frame should be self-explanatory. The other options are as follows:
• distribution=’bernoulli’ : this request that we optimize a Binomial
deviance. Other implemented options include adaboost, gaussian,
laplace, poisson, coxph. A multi-class criterion is not yet implemented.
• n.trees=100 : The number of boosting steps to take. As we’ll see below,
it’s possible to add more steps later.
• train.fraction=0.50 : This parameter was not specified in the call
above, but is the default value. This means that spam.train will be divided into two parts. The first 50% will be used for boosting (the column
TrainLL). The rest will be used to choose parameters such as number of
boosting iterations, interaction depth, and shrinkage (validation, represented by the column ValidLL).
The function gbm.perf will identify the best number of iterations used
among those carried out so far. Here “best” means producing the best value
of the loss function (or largest log-likelihood) in the validation set. As a byproduct, it plots the likelihood as a function of boosting iteration for both
train and test sets:
> gbm.perf(gbm1,oobag.curve=F,method=’test’)
Test set iteration estimate: 100
• shrinkage=0.10 : This parameter was not specified in the call above,
but is the default value. It corresponds to the shrinkage parameter ∫ in
(10.40) of HTF on p. 326.
−600
Before we start, note that gbm uses the first part of the data for training, and
the second part for validation. The validation part would typically be used to
choose the number of boosting iterations, as well as the shrinkage parameter
and perhaps the interaction depth. In order to make use of this feature, we
need to randomly permute the rows of the spam.train matrix. The ordering
of rows in the original dataset is such that all the spam emails occur first.
• keep.data=T : Necessary to keep a copy of the data if you want to add
more boosting steps later.
−800
> library(gbm)
• interaction.depth=2 : This corresponds to the highest order interaction
that can be represented by the model. A value of 1 grows a stump, etc.
Bernoulli log−likelihood
Stat 946, March 10, 2003
In this handout we’ll look at the performance of Friedman’s gradient boosting algorithm. The implementation used is the gbm library (version 0.70, which
is not released yet, but is available on the 946 webpage).
Gradient Boosting: Spam data
0
20
40
60
80
100
Iteration
Looks like we should keep going, since the likelihood is still improving...
The choice of the right number of boosting iterations can be automated.
The loop below takes 100 steps at a time, and then calculates the optimal
2
−400
−600
−1000
> best.iter <- gbm.perf(gbm1,method=’test’,plot=F)
> while (gbm1$n.trees - best.iter < 10) {
# do 100 more iterations...
gbm1 <- gbm.more(gbm1,100)
best.iter <- gbm.perf(gbm1,method=’test’,plot=F)
}
−800
Bernoulli log−likelihood
−200
number of steps so far. If the optimal number of steps is within 10 iterations
of where we stopped, we’ll do another steps.
Test set iteration estimate: 341
> gbm1$valid.error[341]
[1] -237.6465
> gbm.perf(gbm1,oobag.curve=F,method=’test’)
Test set iteration estimate: 341
So after 341 iterations, the boosting algorithm stops. The plot of the loglikelihood as a function of boosting iterations is below. We see that for the
validation set, it has stabilized.
Of course, we could play around with various parameters and see how this
aÆects the validation error (remember, we haven’t touched the test set, which
is the object spam.test). Below, I experiment with some diÆerent values of
the shrinkage parameter ∫:
> gbm1a <- gbm(spam~.,distribution=’bernoulli’,data=spam.train,
n.trees=100,interaction.depth=2,keep.data=T,shrinkage=0.01)
> best.iter <- gbm.perf(gbm1a,method=’test’,plot=F)
> while (gbm1a$n.trees - best.iter < 10) {
gbm1a <- gbm.more(gbm1a,100)
best.iter <- gbm.perf(gbm1a,method=’test’,plot=F)
}
> gbm1a$valid.error[best.iter]
Note that although I have only one set of code above with ∫ = 0.01, I
actually ran it with ∫ = 0.05 and ∫ = 0.01. Results in the table below indicate
that the performance is a bit better as the shrinkage gets smaller. It’s not
worth using ∫ = 0.01, however, since it takes much longer.
shrinkage best.iter valid.error
1.00
11
-300.0728
0.50
76
-245.0840
0.25
198
-235.6720
0.10
341
-237.6465
0.05
588
-236.5157
0.01
3326
-236.2470 (after one early stop)
Now refit the model using all of spam.train. Note that in the above work,
we only used 50% of the spam.train for boosting, and 50% for validation. I’ll
use the best shrinkage of 0.25.
Note that the plot of relative improvements are not on the square root scale,
as HTF suggest. The code below generates a second plot of square-root relative
improvements. Of course the rankings are still the same. See 10.13.1 of HTF.
Names are printed out below so we can get the indexes of the best variables.
> barplot(sqrt(junk[57:1,2])/sum(sqrt(junk[,2]))*100,
names=as.character(junk[57:1,1]),
xlab=’Relative influence (sqrt scale)’, horiz=T,las=1)
> names(spam)
[1] "wf.make"
[5] "wf.our"
[9] "wf.order"
[13] "wf.people"
[17] "wf.business"
[21] "wf.your"
[25] "wf.hp"
[29] "wf.lab"
[33] "wf.data"
[37] "wf.1999"
[41] "wf.cs"
[45] "wf.re"
[49] "cf.semicolon"
[53] "cf.dollar"
[57] "captot"
Now we want to look at some of the “partial dependence plots” as discussed
in 10.13 of HTF. The same four eÆects that are presented in Figure 10.7 of
HTF. note however, that the x-axis ranges are diÆerent...
> par(mfrow=c(2,2))
> for (i in 1:4) plot(gbm2,c(52,7,46,25)[i])
interaction depth. If there’s no diÆerence in fit, then no interactions are important.
Next, we look at several performance measures:
0
5
10
15
20
25
30
Compare this deviance with a previous best of 598.8 for bagging
0
1
2
cf.exclaim
4
5
−1.5
> gainchart(spam.test.y,yhat)
mean precision is: 559.018 with se
0.0008062594
Previous best was 554.9 for Bayes, and about 555.0 for MARS.
We can repeat similar calculations for the exponential loss, which will give
us just the adaboost algorithm.
It’s also possible to have partial dependence plots of functions of two variables rather than 1. A bit of a problem there is the fact that there are many
pairs of variables that you could plot. Which ones should you examine? Some
hints may be given by
• A variable that has a high value of relative importance, but a partial
dependence plot of f on the single variable seems reasonably flat. This
means that the eÆect of the variable may be due mainly to interactions
with other variables.
• Correlations between F (XS , XC ) and F (XS ), F (XC ). It seems that if X1
and X2 had a big interaction, then putting 1 2 S, 2 2 C would result in a
lower correlation between F (XS , XC ) and F (XS ), F (XC ).
Before looking for significant interactions, it’d be a good idea to fit a model
with interaction depth of 1, and compare it with models that have a higher
> best.iter <- gbm.perf(gbm3,method=’test’,plot=F)
> while (gbm3$n.trees - best.iter < 10) {
# do 100 more iterations...
gbm3 <- gbm.more(gbm3,100)
best.iter <- gbm.perf(gbm3,method=’test’,plot=F)
}
# now refit the model using all the training data...
> gbm4 <- gbm(spam~.,distribution=’adaboost’,data=spam.train,n.trees=best.iter,
interaction.depth=2,keep.data=T,train.fraction=1)
Calculations similar to before give the following performance results:
Method Deviance Misclass Mean Precision
bernoulli
471.1
80
559.0
673.9
83
559.5
adaboost