machine learning

Published on December 2016 | Categories: Documents | Downloads: 74 | Comments: 0 | Views: 414

of 45

statistical machine learning

Content

CSE
575:
Sta*s*cal
Machine
Learning

Jingrui
He

CIDSE,
ASU

Instance-‐based
Learning

1-‐Nearest
Neighbor

Four
things
make
a
memory
based
learner:

1.  A
distance
metric

Euclidian
(and
many
more)

2.  How
many
nearby
neighbors
to
look
at?

One

1.  A
weigh:ng
func:on
(op:onal)

Unused

2.  How
to
ﬁt
with
the
local
points?

Just
predict
the
same
output
as
the
nearest
neighbor.

3

Consistency
of
1-‐NN

•  Consider
an
es*mator
fn
trained
on
n
examples

–  e.g.,
1-‐NN,
regression,
...

•  Es*mator
is
consistent
if
true
error
goes
to
zero
as
amount
of

data
increases

–  e.g.,
for
no
noise
data,
consistent
if:

•  Regression
is
not
consistent!

–  Representa*on
bias

•  1-‐NN
is
consistent
(under
some
mild
ﬁneprint)

What
about
variance???

4

1-‐NN
overﬁts?

5

k-‐Nearest
Neighbor

Four
things
make
a
memory
based
learner:

1.  A
distance
metric

Euclidian
(and
many
more)

2.  How
many
nearby
neighbors
to
look
at?

k

1.  A
weigh:ng
func:on
(op:onal)

Unused

2. 

How
to
ﬁt
with
the
local
points?

Just
predict
the
average
output
among
the
k
nearest
neighbors.

6

k-‐Nearest
Neighbor
(here
k=9)

K-‐nearest
neighbor
for
funcFon
ﬁHng
smooth
away
noise,
but
there
are
clear

deﬁciencies.

What
can
we
do
about
all
the
discon*nui*es
that
k-‐NN
gives
us?

7

Curse
of
dimensionality
for

instance-‐based
learning

•  Must
store
and
retrieve
all
data!

–  Most
real
work
done
during
tes*ng

–  For
every
test
sample,
must
search
through
all
dataset
–
very
slow!

–  There
are
fast
methods
for
dealing
with
large
datasets,
e.g.,
tree-‐based

methods,
hashing
methods,

•  Instance-‐based
learning
o^en
poor
with
noisy
or
irrelevant
features

8

Support
Vector
Machines

Linear
classiﬁers
–
Which
line
is
beber?

Data:

Example
i:

w.x
=
∑j
w(j)
x(j)

10

w.x
+
b

=
0

Pick
the
one
with
the
largest
margin!

w.x
=
∑j
w(j)
x(j)

11

w.x
+
b

=
0

Maximize
the
margin

12

w.x
+
b

=
0

But
there
are
a
many
planes…

13

w.x
+
b

=
0

Review:
Normal
to
a
plane

14

x+

margin
2γ

=
-‐1

w.x
+
b

=
0

w.x
+
b

w.x
+
b

=
+1

Normalized
margin
–
Canonical

hyperplanes

x-‐

15

x+

margin
2γ

=
-‐1

w.x
+
b

=
0

w.x
+
b

w.x
+
b

=
+1

Normalized
margin
–
Canonical

hyperplanes

x-‐

16

w.x
+
b

=
0

=
+1

=
-‐1

w.x
+
b

w.x
+
b

Margin
maximiza*on
using

canonical
hyperplanes

margin
2γ

17

=
-‐1

w.x
+
b

=
0

w.x
+
b

w.x
+
b

=
+1

Support
vector
machines
(SVMs)

•  Solve
eﬃciently
by
quadra*c

programming
(QP)

–  Well-‐studied
solu*on
algorithms

•  Hyperplane
deﬁned
by
support
vectors

margin
2γ

18

What
if
the
data
is
not
linearly

separable?

Use
features
of
features

of
features
of
features….

19

What
if
the
data
is
s*ll
not
linearly

separable?

•  Minimize
w.w
and
number
of
training

mistakes

–  Tradeoﬀ
two
criteria?

•  Tradeoﬀ
#(mistakes)
and
w.w

– 
– 
– 
– 
20

0/1
loss

Slack
penalty
C

Not
QP
anymore

Also
doesn’t
dis*nguish
near
misses

and
really
bad
mistakes

Slack
variables
–
Hinge
loss

•  If
margin
≥
1,
don’t
care

•  If
margin
<
1,
pay
linear
penalty

21

Side
note:
What’s
the
diﬀerence

between
SVMs
and
logis*c
regression?

SVM:

LogisFc
regression:

Log
loss:

22

Constrained
op*miza*on

23

Lagrange
mul*pliers
–
Dual
variables

Moving
the
constraint
to
objecFve
funcFon

Lagrangian:

Solve:

24

Lagrange
mul*pliers
–
Dual
variables

Solving:

25

Dual
SVM
deriva*on
(1)
–

the
linearly
separable
case

26

Dual
SVM
deriva*on
(2)
–

the
linearly
separable
case

27

w.x
+
b

=
0

Dual
SVM
interpreta*on

28

Dual
SVM
formula*on
–

the
linearly
separable
case

29

Dual
SVM
deriva*on
–

the
non-‐separable
case

30

Dual
SVM
formula*on
–

the
non-‐separable
case

31

Why
did
we
learn
about
the
dual
SVM?

•  There
are
some
quadra*c
programming

algorithms
that
can
solve
the
dual
faster
than

the
primal

•  But,
more
importantly,
the
“kernel
trick”!!!

–  Another
lible
detour…

32

Reminder
from
last
*me:
What
if
the

data
is
not
linearly
separable?

Use
features
of
features

of
features
of
features….

Feature
space
can
get
really
33
large
really
quickly!

number
of
monomial
terms

Higher
order
polynomials

d=4

m
–
input
features

d
–
degree
of
polynomial

d=3

d=2

number
of
input
dimensions

34

grows
fast!

d
=
6,
m
=
100

about
1.6
billion
terms

Dual
formula*on
only
depends
on

dot-‐products,
not
on
w!

35

Dot-‐product
of
polynomials

36

Finally:
the
“kernel
trick”!

•  Never
represent
features
explicitly

–  Compute
dot
products
in
closed
form

•  Constant-‐*me
high-‐dimensional
dot-‐
products
for
many
classes
of
features

•  Very
interes*ng
theory
–
Reproducing

Kernel
Hilbert
Spaces

37

Polynomial
kernels

•  All
monomials
of
degree
d
in
O(d)
opera*ons:

•  How
about
all
monomials
of
degree
up
to
d?

–  Solu*on
0:

–  Beber
solu*on:

38

Common
kernels

•  Polynomials
of
degree
d

•  Polynomials
of
degree
up
to
d

•  Gaussian
kernels

•  Sigmoid

39

Overﬁvng?

•  Huge
feature
space
with
kernels,
what
about

overﬁvng???

–  Maximizing
margin
leads
to
sparse
set
of
support

vectors

–  Some
interes*ng
theory
says
that
SVMs
search
for

simple
hypothesis
with
large
margin

–  O^en
robust
to
overﬁvng

40

What
about
at
classiﬁca*on
*me

•  For
a
new
input
x,
if
we
need
to
represent

Φ(x),
we
are
in
trouble!

•  Recall
classiﬁer:
sign(w.Φ(x)+b)

•  Using
kernels
we
are
cool!

41

SVMs
with
kernels

•  Choose
a
set
of
features
and
kernel
func*on

•  Solve
dual
problem
to
obtain
support
vectors

αi

•  At
classiﬁca*on
*me,
compute:

Classify
as

42

What’s
the
diﬀerence
between

SVMs
and
Logis*c
Regression?

Loss function

High dimensional
features with
kernels

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

No

43

Kernels
in
logis*c
regression

•  Deﬁne
weights
in
terms
of
support
vectors:

•  Derive
simple
gradient
descent
rule
on
αi

44

What’s
the
diﬀerence
between
SVMs

and
Logis*c
Regression?
(Revisited)

Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output

SVMs

Logistic
Regression

Hinge loss

Log-loss

Yes!

Yes!

Often yes!

Almost always no!

“Margin”

Real probabilities

45

machine learning

Comments

Content

Sponsor Documents

Recommended