machine learning

Published on December 2016 | Categories: Documents | Downloads: 74 | Comments: 0 | Views: 414
of 45
Download PDF   Embed   Report

statistical machine learning

Comments

Content

CSE
 575:
 Sta*s*cal
 Machine
 Learning
 
Jingrui
 He
 
CIDSE,
 ASU
 

Instance-­‐based
 Learning
 

1-­‐Nearest
 Neighbor
 
Four
 things
 make
 a
 memory
 based
 learner:
 
1.  A
 distance
 metric

 

 

 

 Euclidian
 (and
 many
 more)
 
2.  How
 many
 nearby
 neighbors
 to
 look
 at?
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 One
 
1.  A
 weigh:ng
 func:on
 (op:onal)
 

 

 Unused
 


 


 


 


 


 


 
 


 


 

2.  How
 to
 fit
 with
 the
 local
 points?

 

 

 

 Just
 predict
 the
 same
 output
 as
 the
 nearest
 neighbor.
 


 
3
 


 

Consistency
 of
 1-­‐NN
 
•  Consider
 an
 es*mator
 fn
 trained
 on
 n
 examples
 
–  e.g.,
 1-­‐NN,
 regression,
 ...
 

•  Es*mator
 is
 consistent
 if
 true
 error
 goes
 to
 zero
 as
 amount
 of
 
data
 increases
 
–  e.g.,
 for
 no
 noise
 data,
 consistent
 if:
 

•  Regression
 is
 not
 consistent!
 
–  Representa*on
 bias
 

•  1-­‐NN
 is
 consistent
 (under
 some
 mild
 fineprint)
 
 

What
 about
 variance???
 
4
 

1-­‐NN
 overfits?
 

5
 

k-­‐Nearest
 Neighbor
 
Four
 things
 make
 a
 memory
 based
 learner:
 
1.  A
 distance
 metric
 

 

 

 Euclidian
 (and
 many
 more)
 
2.  How
 many
 nearby
 neighbors
 to
 look
 at?

 
 k
 
1.  A
 weigh:ng
 func:on
 (op:onal)

 

 Unused
 
2. 


 


 


 


 


 


 


 


 


 

How
 to
 fit
 with
 the
 local
 points?

 

 

 

 

 Just
 predict
 the
 average
 output
 among
 the
 k
 nearest
 neighbors.
 

6
 


 
 


 
 

k-­‐Nearest
 Neighbor
 (here
 k=9)
 

K-­‐nearest
 neighbor
 for
 funcFon
 fiHng
 smooth
 away
 noise,
 but
 there
 are
 clear
 
deficiencies.
 
What
 can
 we
 do
 about
 all
 the
 discon*nui*es
 that
 k-­‐NN
 gives
 us?
 

7
 

Curse
 of
 dimensionality
 for
 
instance-­‐based
 learning
 
•  Must
 store
 and
 retrieve
 all
 data!
 
–  Most
 real
 work
 done
 during
 tes*ng
 
–  For
 every
 test
 sample,
 must
 search
 through
 all
 dataset
 –
 very
 slow!
 
–  There
 are
 fast
 methods
 for
 dealing
 with
 large
 datasets,
 e.g.,
 tree-­‐based
 
methods,
 hashing
 methods,
 
 

•  Instance-­‐based
 learning
 o^en
 poor
 with
 noisy
 or
 irrelevant
 features
 

8
 

Support
 Vector
 Machines
 

Linear
 classifiers
 –
 Which
 line
 is
 beber?
 
Data:
 

Example
 i:
 

w.x
 =
 ∑j
 w(j)
 x(j)
 

10
 

w.x
 +
 b


 =
 0
 

Pick
 the
 one
 with
 the
 largest
 margin!
 

w.x
 =
 ∑j
 w(j)
 x(j)
 

11
 

w.x
 +
 b


 =
 0
 

Maximize
 the
 margin
 

12
 

w.x
 +
 b


 =
 0
 

But
 there
 are
 a
 many
 planes…
 

13
 

w.x
 +
 b


 =
 0
 

Review:
 Normal
 to
 a
 plane
 

14
 

x+
 

margin
 2γ


 =
 -­‐1
 
w.x
 +
 b


 =
 0
 
w.x
 +
 b

w.x
 +
 b


 =
 +1
 

Normalized
 margin
 –
 Canonical
 
hyperplanes
 

x-­‐
 

15
 

x+
 

margin
 2γ


 =
 -­‐1
 
w.x
 +
 b


 =
 0
 
w.x
 +
 b

w.x
 +
 b


 =
 +1
 

Normalized
 margin
 –
 Canonical
 
hyperplanes
 

x-­‐
 

16
 

w.x
 +
 b


 =
 0
 


 =
 +1
 


 =
 -­‐1
 

w.x
 +
 b

w.x
 +
 b

Margin
 maximiza*on
 using
 
canonical
 hyperplanes
 

margin
 2γ

17
 


 =
 -­‐1
 
w.x
 +
 b


 =
 0
 
w.x
 +
 b

w.x
 +
 b


 =
 +1
 

Support
 vector
 machines
 (SVMs)
 

•  Solve
 efficiently
 by
 quadra*c
 
programming
 (QP)
 
–  Well-­‐studied
 solu*on
 algorithms
 

•  Hyperplane
 defined
 by
 support
 vectors
 
margin
 2γ

18
 

What
 if
 the
 data
 is
 not
 linearly
 
separable?
 
Use
 features
 of
 features
 
 
of
 features
 of
 features….
 

19
 

What
 if
 the
 data
 is
 s*ll
 not
 linearly
 
separable?
 
•  Minimize
 w.w
 and
 number
 of
 training
 
mistakes
 
–  Tradeoff
 two
 criteria?
 

•  Tradeoff
 #(mistakes)
 and
 w.w
 
– 
– 
– 
– 
20
 

0/1
 loss
 
Slack
 penalty
 C
 
Not
 QP
 anymore
 
Also
 doesn’t
 dis*nguish
 near
 misses
 
and
 really
 bad
 mistakes
 

Slack
 variables
 –
 Hinge
 loss
 

•  If
 margin
 ≥
 1,
 don’t
 care
 
•  If
 margin
 <
 1,
 pay
 linear
 penalty
 
21
 

Side
 note:
 What’s
 the
 difference
 
between
 SVMs
 and
 logis*c
 regression?
 
SVM:
 

LogisFc
 regression:
 
Log
 loss:
 

22
 

Constrained
 op*miza*on
 

23
 

Lagrange
 mul*pliers
 –
 Dual
 variables
 
Moving
 the
 constraint
 to
 objecFve
 funcFon
 
Lagrangian:
 

Solve:
 

24
 

Lagrange
 mul*pliers
 –
 Dual
 variables
 
Solving:
 

25
 

Dual
 SVM
 deriva*on
 (1)
 –
 
 
the
 linearly
 separable
 case
 

26
 

Dual
 SVM
 deriva*on
 (2)
 –
 
 
the
 linearly
 separable
 case
 

27
 

w.x
 +
 b


 =
 0
 

Dual
 SVM
 interpreta*on
 

28
 

Dual
 SVM
 formula*on
 –
 
 
the
 linearly
 separable
 case
 

29
 

Dual
 SVM
 deriva*on
 –
 
 
the
 non-­‐separable
 case
 

30
 

Dual
 SVM
 formula*on
 –
 
 
the
 non-­‐separable
 case
 

31
 

Why
 did
 we
 learn
 about
 the
 dual
 SVM?
 
•  There
 are
 some
 quadra*c
 programming
 
algorithms
 that
 can
 solve
 the
 dual
 faster
 than
 
the
 primal
 
•  But,
 more
 importantly,
 the
 “kernel
 trick”!!!
 
–  Another
 lible
 detour…
 

32
 

Reminder
 from
 last
 *me:
 What
 if
 the
 
data
 is
 not
 linearly
 separable?
 
Use
 features
 of
 features
 
 
of
 features
 of
 features….
 

Feature
 space
 can
 get
 really
 33
 large
 really
 quickly!
 

number
 of
 monomial
 terms
 

Higher
 order
 polynomials
 

d=4
 

m
 –
 input
 features
 
d
 –
 degree
 of
 polynomial
 

d=3
 
d=2
 
number
 of
 input
 dimensions
 
34
 

grows
 fast!
 
d
 =
 6,
 m
 =
 100
 
about
 1.6
 billion
 terms
 

Dual
 formula*on
 only
 depends
 on
 
dot-­‐products,
 not
 on
 w!
 

35
 

Dot-­‐product
 of
 polynomials
 

36
 

Finally:
 the
 “kernel
 trick”!
 

•  Never
 represent
 features
 explicitly
 
–  Compute
 dot
 products
 in
 closed
 form
 

•  Constant-­‐*me
 high-­‐dimensional
 dot-­‐
products
 for
 many
 classes
 of
 features
 
•  Very
 interes*ng
 theory
 –
 Reproducing
 
Kernel
 Hilbert
 Spaces
 

37
 

Polynomial
 kernels
 
•  All
 monomials
 of
 degree
 d
 in
 O(d)
 opera*ons:
 

•  How
 about
 all
 monomials
 of
 degree
 up
 to
 d?
 
–  Solu*on
 0:
 
 
–  Beber
 solu*on:
 

38
 

Common
 kernels
 
•  Polynomials
 of
 degree
 d
 

•  Polynomials
 of
 degree
 up
 to
 d
 
•  Gaussian
 kernels
 
•  Sigmoid
 

39
 

Overfivng?
 
•  Huge
 feature
 space
 with
 kernels,
 what
 about
 
overfivng???
 
–  Maximizing
 margin
 leads
 to
 sparse
 set
 of
 support
 
vectors
 
 
–  Some
 interes*ng
 theory
 says
 that
 SVMs
 search
 for
 
simple
 hypothesis
 with
 large
 margin
 
–  O^en
 robust
 to
 overfivng
 

40
 

What
 about
 at
 classifica*on
 *me
 
•  For
 a
 new
 input
 x,
 if
 we
 need
 to
 represent
 
Φ(x),
 we
 are
 in
 trouble!
 
•  Recall
 classifier:
 sign(w.Φ(x)+b)
 
•  Using
 kernels
 we
 are
 cool!
 

41
 

SVMs
 with
 kernels
 
•  Choose
 a
 set
 of
 features
 and
 kernel
 func*on
 
•  Solve
 dual
 problem
 to
 obtain
 support
 vectors
 
αi
 
•  At
 classifica*on
 *me,
 compute:
 

Classify
 as
 

42
 

What’s
 the
 difference
 between
 
SVMs
 and
 Logis*c
 Regression?
 

Loss function

High dimensional
features with
kernels

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

No

43
 

Kernels
 in
 logis*c
 regression
 

•  Define
 weights
 in
 terms
 of
 support
 vectors:
 

•  Derive
 simple
 gradient
 descent
 rule
 on
 αi
 
44
 

What’s
 the
 difference
 between
 SVMs
 
and
 Logis*c
 Regression?
 (Revisited)
 

Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

Yes!

Often yes!

Almost always no!

“Margin”

Real probabilities

45
 

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close