machine learning
Comments
Content
CSE
575:
Sta*s*cal
Machine
Learning
Jingrui
He
CIDSE,
ASU
Instance-‐based
Learning
1-‐Nearest
Neighbor
Four
things
make
a
memory
based
learner:
1. A
distance
metric
Euclidian
(and
many
more)
2. How
many
nearby
neighbors
to
look
at?
One
1. A
weigh:ng
func:on
(op:onal)
Unused
2. How
to
fit
with
the
local
points?
Just
predict
the
same
output
as
the
nearest
neighbor.
3
Consistency
of
1-‐NN
• Consider
an
es*mator
fn
trained
on
n
examples
– e.g.,
1-‐NN,
regression,
...
• Es*mator
is
consistent
if
true
error
goes
to
zero
as
amount
of
data
increases
– e.g.,
for
no
noise
data,
consistent
if:
• Regression
is
not
consistent!
– Representa*on
bias
• 1-‐NN
is
consistent
(under
some
mild
fineprint)
What
about
variance???
4
1-‐NN
overfits?
5
k-‐Nearest
Neighbor
Four
things
make
a
memory
based
learner:
1. A
distance
metric
Euclidian
(and
many
more)
2. How
many
nearby
neighbors
to
look
at?
k
1. A
weigh:ng
func:on
(op:onal)
Unused
2.
How
to
fit
with
the
local
points?
Just
predict
the
average
output
among
the
k
nearest
neighbors.
6
k-‐Nearest
Neighbor
(here
k=9)
K-‐nearest
neighbor
for
funcFon
fiHng
smooth
away
noise,
but
there
are
clear
deficiencies.
What
can
we
do
about
all
the
discon*nui*es
that
k-‐NN
gives
us?
7
Curse
of
dimensionality
for
instance-‐based
learning
• Must
store
and
retrieve
all
data!
– Most
real
work
done
during
tes*ng
– For
every
test
sample,
must
search
through
all
dataset
–
very
slow!
– There
are
fast
methods
for
dealing
with
large
datasets,
e.g.,
tree-‐based
methods,
hashing
methods,
• Instance-‐based
learning
o^en
poor
with
noisy
or
irrelevant
features
8
Support
Vector
Machines
Linear
classifiers
–
Which
line
is
beber?
Data:
Example
i:
w.x
=
∑j
w(j)
x(j)
10
w.x
+
b
=
0
Pick
the
one
with
the
largest
margin!
w.x
=
∑j
w(j)
x(j)
11
w.x
+
b
=
0
Maximize
the
margin
12
w.x
+
b
=
0
But
there
are
a
many
planes…
13
w.x
+
b
=
0
Review:
Normal
to
a
plane
14
x+
margin
2γ
=
-‐1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Normalized
margin
–
Canonical
hyperplanes
x-‐
15
x+
margin
2γ
=
-‐1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Normalized
margin
–
Canonical
hyperplanes
x-‐
16
w.x
+
b
=
0
=
+1
=
-‐1
w.x
+
b
w.x
+
b
Margin
maximiza*on
using
canonical
hyperplanes
margin
2γ
17
=
-‐1
w.x
+
b
=
0
w.x
+
b
w.x
+
b
=
+1
Support
vector
machines
(SVMs)
• Solve
efficiently
by
quadra*c
programming
(QP)
– Well-‐studied
solu*on
algorithms
• Hyperplane
defined
by
support
vectors
margin
2γ
18
What
if
the
data
is
not
linearly
separable?
Use
features
of
features
of
features
of
features….
19
What
if
the
data
is
s*ll
not
linearly
separable?
• Minimize
w.w
and
number
of
training
mistakes
– Tradeoff
two
criteria?
• Tradeoff
#(mistakes)
and
w.w
–
–
–
–
20
0/1
loss
Slack
penalty
C
Not
QP
anymore
Also
doesn’t
dis*nguish
near
misses
and
really
bad
mistakes
Slack
variables
–
Hinge
loss
• If
margin
≥
1,
don’t
care
• If
margin
<
1,
pay
linear
penalty
21
Side
note:
What’s
the
difference
between
SVMs
and
logis*c
regression?
SVM:
LogisFc
regression:
Log
loss:
22
Constrained
op*miza*on
23
Lagrange
mul*pliers
–
Dual
variables
Moving
the
constraint
to
objecFve
funcFon
Lagrangian:
Solve:
24
Lagrange
mul*pliers
–
Dual
variables
Solving:
25
Dual
SVM
deriva*on
(1)
–
the
linearly
separable
case
26
Dual
SVM
deriva*on
(2)
–
the
linearly
separable
case
27
w.x
+
b
=
0
Dual
SVM
interpreta*on
28
Dual
SVM
formula*on
–
the
linearly
separable
case
29
Dual
SVM
deriva*on
–
the
non-‐separable
case
30
Dual
SVM
formula*on
–
the
non-‐separable
case
31
Why
did
we
learn
about
the
dual
SVM?
• There
are
some
quadra*c
programming
algorithms
that
can
solve
the
dual
faster
than
the
primal
• But,
more
importantly,
the
“kernel
trick”!!!
– Another
lible
detour…
32
Reminder
from
last
*me:
What
if
the
data
is
not
linearly
separable?
Use
features
of
features
of
features
of
features….
Feature
space
can
get
really
33
large
really
quickly!
number
of
monomial
terms
Higher
order
polynomials
d=4
m
–
input
features
d
–
degree
of
polynomial
d=3
d=2
number
of
input
dimensions
34
grows
fast!
d
=
6,
m
=
100
about
1.6
billion
terms
Dual
formula*on
only
depends
on
dot-‐products,
not
on
w!
35
Dot-‐product
of
polynomials
36
Finally:
the
“kernel
trick”!
• Never
represent
features
explicitly
– Compute
dot
products
in
closed
form
• Constant-‐*me
high-‐dimensional
dot-‐
products
for
many
classes
of
features
• Very
interes*ng
theory
–
Reproducing
Kernel
Hilbert
Spaces
37
Polynomial
kernels
• All
monomials
of
degree
d
in
O(d)
opera*ons:
• How
about
all
monomials
of
degree
up
to
d?
– Solu*on
0:
– Beber
solu*on:
38
Common
kernels
• Polynomials
of
degree
d
• Polynomials
of
degree
up
to
d
• Gaussian
kernels
• Sigmoid
39
Overfivng?
• Huge
feature
space
with
kernels,
what
about
overfivng???
– Maximizing
margin
leads
to
sparse
set
of
support
vectors
– Some
interes*ng
theory
says
that
SVMs
search
for
simple
hypothesis
with
large
margin
– O^en
robust
to
overfivng
40
What
about
at
classifica*on
*me
• For
a
new
input
x,
if
we
need
to
represent
Φ(x),
we
are
in
trouble!
• Recall
classifier:
sign(w.Φ(x)+b)
• Using
kernels
we
are
cool!
41
SVMs
with
kernels
• Choose
a
set
of
features
and
kernel
func*on
• Solve
dual
problem
to
obtain
support
vectors
αi
• At
classifica*on
*me,
compute:
Classify
as
42
What’s
the
difference
between
SVMs
and
Logis*c
Regression?
Loss function
High dimensional
features with
kernels
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
No
43
Kernels
in
logis*c
regression
• Define
weights
in
terms
of
support
vectors:
• Derive
simple
gradient
descent
rule
on
αi
44
What’s
the
difference
between
SVMs
and
Logis*c
Regression?
(Revisited)
Loss function
High dimensional
features with
kernels
Solution sparse
Semantics of
output
SVMs
Logistic
Regression
Hinge loss
Log-loss
Yes!
Yes!
Often yes!
Almost always no!
“Margin”
Real probabilities
45
Sponsor Documents