Spam Boosting

Published on December 2016 | Categories: Documents | Downloads: 25 | Comments: 0 | Views: 153

of 4

spam boosting, machine learning, statistics, article to teach how to implement an example of boosting. friedman 1976 boosting techniques with spam data taken from HP computer labs.

Content

1

Iter
1
2
3
4
5
6
7
8
9
10
100

TrainLL
-969.3420
-911.9329
-862.6833
-818.4541
-782.1947
-748.9421
-720.9971
-695.4191
-671.3432
-647.3799
-240.1934

ValidLL
-966.0925
-917.3410
-869.9219
-825.1719
-790.0873
-759.9122
-731.0353
-705.4608
-681.7652
-655.8011
-275.9812

StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000

Improve
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

As boosting proceeds, the log-likelihood for both training and validation
sets (TrainLL and ValidLL) increases towards 0.
The syntax of the gbm command needs some explanation. The formula and
data frame should be self-explanatory. The other options are as follows:
• distribution=’bernoulli’ : this request that we optimize a Binomial
deviance. Other implemented options include adaboost, gaussian,
laplace, poisson, coxph. A multi-class criterion is not yet implemented.
• n.trees=100 : The number of boosting steps to take. As we’ll see below,
it’s possible to add more steps later.

• train.fraction=0.50 : This parameter was not specified in the call
above, but is the default value. This means that spam.train will be divided into two parts. The first 50% will be used for boosting (the column
TrainLL). The rest will be used to choose parameters such as number of
boosting iterations, interaction depth, and shrinkage (validation, represented by the column ValidLL).
The function gbm.perf will identify the best number of iterations used
among those carried out so far. Here “best” means producing the best value
of the loss function (or largest log-likelihood) in the validation set. As a byproduct, it plots the likelihood as a function of boosting iteration for both
train and test sets:
> gbm.perf(gbm1,oobag.curve=F,method=’test’)
Test set iteration estimate: 100

−400

> spam.train <- spam.train[sample(nrow(spam.train)),]
> gbm1 <- gbm(spam~.,distribution=’bernoulli’,data=spam.train,
n.trees=100,interaction.depth=2,keep.data=T)

• shrinkage=0.10 : This parameter was not specified in the call above,
but is the default value. It corresponds to the shrinkage parameter ∫ in
(10.40) of HTF on p. 326.

−600

Before we start, note that gbm uses the first part of the data for training, and
the second part for validation. The validation part would typically be used to
choose the number of boosting iterations, as well as the shrinkage parameter
and perhaps the interaction depth. In order to make use of this feature, we
need to randomly permute the rows of the spam.train matrix. The ordering
of rows in the original dataset is such that all the spam emails occur first.

• keep.data=T : Necessary to keep a copy of the data if you want to add
more boosting steps later.

−800

> library(gbm)

• interaction.depth=2 : This corresponds to the highest order interaction
that can be represented by the model. A value of 1 grows a stump, etc.

Bernoulli log−likelihood

Stat 946, March 10, 2003
In this handout we’ll look at the performance of Friedman’s gradient boosting algorithm. The implementation used is the gbm library (version 0.70, which
is not released yet, but is available on the 946 webpage).

Gradient Boosting: Spam data

0

20

40

60

80

100

Iteration

Looks like we should keep going, since the likelihood is still improving...
The choice of the right number of boosting iterations can be automated.
The loop below takes 100 steps at a time, and then calculates the optimal

2

−400
−600
−1000

> best.iter <- gbm.perf(gbm1,method=’test’,plot=F)
> while (gbm1$n.trees - best.iter < 10) {
# do 100 more iterations...
gbm1 <- gbm.more(gbm1,100)
best.iter <- gbm.perf(gbm1,method=’test’,plot=F)
}

−800

Bernoulli log−likelihood

−200

number of steps so far. If the optimal number of steps is within 10 iterations
of where we stopped, we’ll do another steps.

0

100

200

300

400

Iteration

...
Test set iteration estimate: 293
Iter
TrainLL
ValidLL
301
-137.4363
-239.8148
302
-137.2718
-239.6943
303
-137.0738
-239.4927
304
-136.7673
-239.0681
305
-136.4276
-239.2742
306
-136.1734
-238.9913
307
-135.9979
-238.9896
308
-135.6509
-239.3739
309
-135.4305
-239.7303
310
-135.1158
-240.1945
400
-110.7869
-239.1768

StepSize
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000
0.1000

Improve
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

Test set iteration estimate: 341
> gbm1$valid.error[341]
[1] -237.6465
> gbm.perf(gbm1,oobag.curve=F,method=’test’)
Test set iteration estimate: 341

So after 341 iterations, the boosting algorithm stops. The plot of the loglikelihood as a function of boosting iterations is below. We see that for the
validation set, it has stabilized.

Of course, we could play around with various parameters and see how this
aÆects the validation error (remember, we haven’t touched the test set, which
is the object spam.test). Below, I experiment with some diÆerent values of
the shrinkage parameter ∫:
> gbm1a <- gbm(spam~.,distribution=’bernoulli’,data=spam.train,
n.trees=100,interaction.depth=2,keep.data=T,shrinkage=0.01)
> best.iter <- gbm.perf(gbm1a,method=’test’,plot=F)
> while (gbm1a$n.trees - best.iter < 10) {
gbm1a <- gbm.more(gbm1a,100)
best.iter <- gbm.perf(gbm1a,method=’test’,plot=F)
}
> gbm1a$valid.error[best.iter]

Note that although I have only one set of code above with ∫ = 0.01, I
actually ran it with ∫ = 0.05 and ∫ = 0.01. Results in the table below indicate
that the performance is a bit better as the shrinkage gets smaller. It’s not
worth using ∫ = 0.01, however, since it takes much longer.
shrinkage best.iter valid.error
1.00
11
-300.0728
0.50
76
-245.0840
0.25
198
-235.6720
0.10
341
-237.6465
0.05
588
-236.5157
0.01
3326
-236.2470 (after one early stop)
Now refit the model using all of spam.train. Note that in the above work,
we only used 50% of the spam.train for boosting, and 50% for validation. I’ll
use the best shrinkage of 0.25.

3

> gbm2 <- gbm(spam~.,distribution=’bernoulli’,data=spam.train,
n.trees=200, interaction.depth=2,keep.data=T,train.fraction=1,
shrinkage=0.25)
> junk <- summary.gbm(gbm2,las=1)
> junk
var
rel.inf
1
cf.dollar 21.196780813
2
cf.exclaim 19.508465219
3
wf.remove 11.300037416
4
wf.free 10.353743054
5
wf.hp 7.647973197
6
capavg 7.104892314
7
wf.your 4.645710294
8
caplong 4.539950680
9
wf.edu 2.062494603
10
wf.our 1.838697510
...

Note that the plot of relative improvements are not on the square root scale,
as HTF suggest. The code below generates a second plot of square-root relative
improvements. Of course the rankings are still the same. See 10.13.1 of HTF.
Names are printed out below so we can get the indexes of the best variables.
> barplot(sqrt(junk[57:1,2])/sum(sqrt(junk[,2]))*100,
names=as.character(junk[57:1,1]),
xlab=’Relative influence (sqrt scale)’, horiz=T,las=1)
> names(spam)
[1] "wf.make"
[5] "wf.our"
[9] "wf.order"
[13] "wf.people"
[17] "wf.business"
[21] "wf.your"
[25] "wf.hp"
[29] "wf.lab"
[33] "wf.data"
[37] "wf.1999"
[41] "wf.cs"
[45] "wf.re"
[49] "cf.semicolon"
[53] "cf.dollar"
[57] "captot"

"wf.address"
"wf.over"
"wf.mail"
"wf.report"
"wf.email"
"wf.font"
"wf.hpl"
"wf.labs"
"wf.415"
"wf.parts"
"wf.meeting"
"wf.edu"
"cf.roundbkt"
"cf.pound"
"spam"

"wf.all"
"wf.remove"
"wf.receive"
"wf.addresses"
"wf.you"
"wf.000"
"wf.george"
"wf.telnet"
"wf.85"
"wf.pm"
"wf.original"
"wf.table"
"cf.sqbkt"
"capavg"

"wf.3d"
"wf.internet"
"wf.will"
"wf.free"
"wf.credit"
"wf.money"
"wf.650"
"wf.857"
"wf.technology"
"wf.direct"
"wf.project"
"wf.conference"
"cf.exclaim"
"caplong"

cf.dollar
cf.exclaim
wf.remove
wf.free
wf.hp
capavg
wf.your
caplong
wf.edu
wf.our
wf.george
wf.000
captot
wf.money
wf.1999
wf.meeting
wf.650
wf.re
wf.internet
wf.you
wf.will
wf.over
wf.business
wf.receive
cf.semicolon
wf.email
wf.font
cf.pound
wf.make
wf.conference
wf.3d
wf.order
wf.project
wf.pm
wf.data
wf.all
wf.report
wf.people
wf.technology
wf.credit
cf.roundbkt
wf.mail
wf.original
wf.addresses
wf.lab
wf.cs
wf.direct
wf.address
wf.hpl
wf.labs
wf.telnet
wf.857
wf.415
wf.85
wf.parts
wf.table
cf.sqbkt

cf.dollar
cf.exclaim
wf.remove
wf.free
wf.hp
capavg
wf.your
caplong
wf.edu
wf.our
wf.george
wf.000
captot
wf.money
wf.1999
wf.meeting
wf.650
wf.re
wf.internet
wf.you
wf.will
wf.over
wf.business
wf.receive
cf.semicolon
wf.email
wf.font
cf.pound
wf.make
wf.conference
wf.3d
wf.order
wf.project
wf.pm
wf.data
wf.all
wf.report
wf.people
wf.technology
wf.credit
cf.roundbkt
wf.mail
wf.original
wf.addresses
wf.lab
wf.cs
wf.direct
wf.address
wf.hpl
wf.labs
wf.telnet
wf.857
wf.415
wf.85
wf.parts
wf.table
cf.sqbkt

0

5

10

15

Relative influence

20

0

2

4

6

8

10

Relative influence (sqrt scale)

Now we want to look at some of the “partial dependence plots” as discussed
in 10.13 of HTF. The same four eÆects that are presented in Figure 10.7 of
HTF. note however, that the x-axis ranges are diÆerent...

> par(mfrow=c(2,2))
> for (i in 1:4) plot(gbm2,c(52,7,46,25)[i])

4

0.5
0.0

f(wf.remove)

> yhat <- predict(gbm2,newdata=spam.test)
> yhat <- exp(yhat)/(1+exp(yhat))
> bindev(spam.test.y,yhat)
[1] 471.0925

−1.5

−1.0

−0.5

0.0
−0.5
−1.0

f(cf.exclaim)

0.5

1.0

1.0

interaction depth. If there’s no diÆerence in fit, then no interactions are important.
Next, we look at several performance measures:

0

5

10

15

20

25

30

Compare this deviance with a previous best of 598.8 for bagging
0

1

2

cf.exclaim

4

5

−1.5

> gainchart(spam.test.y,yhat)
mean precision is: 559.018 with se

0.0008062594

Previous best was 554.9 for Bayes, and about 555.0 for MARS.
We can repeat similar calculations for the exponential loss, which will give
us just the adaboost algorithm.

−3.0

−3.5

−2.5

f(wf.hp)

> yhat.class <- (yhat>.5)+0
> sum(yhat.class != spam.test.y)
[1] 80

Best before was 92 with bagged trees.

−0.5

−1.0
−1.5
−2.0
−2.5

f(wf.edu)

3

wf.remove

0

5

10
wf.edu

15

0

5

10

15

20

> gbm3 <- gbm(spam~.,distribution=’adaboost’,data=spam.train,n.trees=100,
interaction.depth=2,keep.data=T)

wf.hp

It’s also possible to have partial dependence plots of functions of two variables rather than 1. A bit of a problem there is the fact that there are many
pairs of variables that you could plot. Which ones should you examine? Some
hints may be given by
• A variable that has a high value of relative importance, but a partial
dependence plot of f on the single variable seems reasonably flat. This
means that the eÆect of the variable may be due mainly to interactions
with other variables.
• Correlations between F (XS , XC ) and F (XS ), F (XC ). It seems that if X1
and X2 had a big interaction, then putting 1 2 S, 2 2 C would result in a
lower correlation between F (XS , XC ) and F (XS ), F (XC ).
Before looking for significant interactions, it’d be a good idea to fit a model
with interaction depth of 1, and compare it with models that have a higher

> best.iter <- gbm.perf(gbm3,method=’test’,plot=F)
> while (gbm3$n.trees - best.iter < 10) {
# do 100 more iterations...
gbm3 <- gbm.more(gbm3,100)
best.iter <- gbm.perf(gbm3,method=’test’,plot=F)
}
# now refit the model using all the training data...
> gbm4 <- gbm(spam~.,distribution=’adaboost’,data=spam.train,n.trees=best.iter,
interaction.depth=2,keep.data=T,train.fraction=1)
Calculations similar to before give the following performance results:
Method Deviance Misclass Mean Precision
bernoulli
471.1
80
559.0
673.9
83
559.5
adaboost

Spam Boosting

Comments

Content

Sponsor Documents

Recommended