EE103 (Fall 2011-12)
8. Linear least-squares
• definition
• examples and applications
• solution of a least-squares problem, normal equations
8-1
Definition
overdetermined linear equations
Ax = b (A is m×n with m > n)
if b ∈ range(A), cannot solve for x
least-squares formulation
minimize Ax −b =
_
_
m
i=1
(
n
j=1
a
ij
x
j
−b
i
)
2
_
_
1/2
• r = Ax −b is called the residual or error
• x with smallest residual norm r is called the least-squares solution
• equivalent to minimizing Ax −b
2
Linear least-squares 8-2
Example
A =
_
_
2 0
−1 1
0 2
_
_
, b =
_
_
1
0
−1
_
_
least-squares solution
minimize (2x
1
−1)
2
+ (−x
1
+ x
2
)
2
+ (2x
2
+ 1)
2
to find optimal x
1
, x
2
, set derivatives w.r.t. x
1
and x
2
equal to zero:
10x
1
−2x
2
−4 = 0, −2x
1
+ 10x
2
+ 4 = 0
solution x
1
= 1/3, x
2
= −1/3
(much more on practical algorithms for LS problems later)
Linear least-squares 8-3
−2
0
2
−2
0
2
0
10
20
30
x
1
x
2
r
2
1
= (2x
1
− 1)
2
−2
0
2
−2
0
2
0
5
10
15
20
x
1
x
2
r
2
2
= (−x
1
+ x
2
)
2
−2
0
2
−2
0
2
0
10
20
30
x
1
x
2
r
2
3
= (2x
2
+ 1)
2
−2
0
2
−2
0
2
0
20
40
60
x
1
x
2
r
2
1
+ r
2
2
+ r
2
3
Linear least-squares 8-4
Outline
• definition
• examples and applications
• solution of a least-squares problem, normal equations
Data fitting
fit a function
g(t) = x
1
g
1
(t) + x
2
g
2
(t) +· · · + x
n
g
n
(t)
to data (t
1
, y
1
), . . . , (t
m
, y
m
), i.e., choose coefficients x
1
, . . . , x
n
so that
g(t
1
) ≈ y
1
, g(t
2
) ≈ y
2
, . . . , g(t
m
) ≈ y
m
• g
i
(t) : R →R are given functions (basis functions)
• problem variables: the coefficients x
1
, x
2
, . . . , x
n
• usually m ≫n, hence no exact solution with g(t
i
) = y
i
for all i
• applications: developing simple, approximate model of observed data
Linear least-squares 8-5
Least-squares data fitting
compute x by minimizing
m
i=1
(g(t
i
) −y
i
)
2
=
m
i=1
(x
1
g
1
(t
i
) + x
2
g
2
(t
i
) +· · · + x
n
g
n
(t
i
) −y
i
)
2
in matrix notation: minimize Ax −b
2
where
A =
_
¸
¸
_
g
1
(t
1
) g
2
(t
1
) g
3
(t
1
) · · · g
n
(t
1
)
g
1
(t
2
) g
2
(t
2
) g
3
(t
2
) · · · g
n
(t
2
)
.
.
.
.
.
.
.
.
.
.
.
.
g
1
(t
m
) g
2
(t
m
) g
3
(t
m
) · · · g
n
(t
m
)
_
¸
¸
_
, b =
_
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
_
Linear least-squares 8-6
Example: data fitting with polynomials
g(t) = x
1
+ x
2
t + x
3
t
2
+· · · + x
n
t
n−1
basis functions are g
k
(t) = t
k−1
, k = 1, . . . , n
A =
_
¸
¸
_
1 t
1
t
2
1
· · · t
n−1
1
1 t
2
t
2
2
· · · t
n−1
2
.
.
.
.
.
.
.
.
.
.
.
.
1 t
m
t
2
m
· · · t
n−1
m
_
¸
¸
_
, b =
_
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
_
interpolation (m = n): can satisfy g(t
i
) = y
i
exactly by solving Ax = b
approximation (m > n): make error small by minimizing Ax −b
Linear least-squares 8-7
example. fit a polynomial to f(t) = 1/(1 + 25t
2
) on [−1, 1]
• pick m = n points t
i
in [−1, 1], and calculate y
i
= 1/(1 + 25t
2
i
)
• interpolate by solving Ax = b
−1 −0.5 0 0.5 1
−0.5
0
0.5
1
1.5
n = 5
−1 −0.5 0 0.5 1
−2
0
2
4
6
8
n = 15
(dashed line: f; solid line: polynomial g; circles: the points (t
i
, y
i
))
increasing n does not improve the overall quality of the fit
Linear least-squares 8-8
same example by approximation
• pick m = 50 points t
i
in [−1, 1]
• fit polynomial by minimizing Ax −b
−1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
n = 5
−1 −0.5 0 0.5 1
−0.2
0
0.2
0.4
0.6
0.8
1
n = 15
(dashed line: f; solid line: polynomial g; circles: the points (t
i
, y
i
))
much better fit overall
Linear least-squares 8-9
Least-squares estimation
y = Ax + w
• x is what we want to estimate or reconstruct
• y is our measurement(s)
• w is an unknown noise or measurement error (assumed small)
• ith row of A characterizes ith sensor or ith measurement
least-squares estimation
choose as estimate the vector ˆ x that minimizes
Aˆ x −y
i.e., minimize the deviation between what we actually observed (y), and
what we would observe if x = ˆ x and there were no noise (w = 0)
Linear least-squares 8-10
Navigation by range measurements
find position (u, v) in a plane from distances to beacons at positions (p
i
, q
i
)
(u, v)
(p
1
, q
1
)
(p
2
, q
2
)
(p
3
, q
3
)
(p
4
, q
4
)
ρ
1
ρ
2
ρ
3
ρ
4
beacons
unknown position
four nonlinear equations in two variables u, v:
_
(u −p
i
)
2
+ (v −q
i
)
2
= ρ
i
for i = 1, 2, 3, 4
ρ
i
is the measured distance from unknown position (u, v) to beacon i
Linear least-squares 8-11
linearized distance function: assume u = u
0
+ ∆u, v = v
0
+ ∆v where
• u
0
, v
0
are known (e.g., position a short time ago)
• ∆u, ∆v are small (compared to ρ
i
’s)
_
(u
0
+ ∆u −p
i
)
2
+ (v
0
+ ∆v −q
i
)
2
≈
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
+
(u
0
−p
i
)∆u + (v
0
−q
i
)∆v
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
gives four linear equations in the variables ∆u, ∆v:
(u
0
−p
i
)∆u + (v
0
−q
i
)∆v
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
≈ ρ
i
−
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
for i = 1, 2, 3, 4
Linear least-squares 8-12
linearized equations
Ax ≈ b
where x = (∆u, ∆v) and A is 4 ×2 with
b
i
= ρ
i
−
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
a
i1
=
(u
0
−p
i
)
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
a
i2
=
(v
0
−q
i
)
_
(u
0
−p
i
)
2
+ (v
0
−q
i
)
2
• due to linearization and measurement error, we do not expect an exact
solution (Ax = b)
• we can try to find ∆u and ∆v that ‘almost’ satisfy the equations
Linear least-squares 8-13
numerical example
• beacons at positions (10, 0), (−10, 2), (3, 9), (10, 10)
• measured distances ρ = (8.22, 11.9, 7.08, 11.33)
• (unknown) actual position is (2, 2)
linearized range equations (linearized around (u
0
, v
0
) = (0, 0))
_
¸
¸
_
−1.00 0.00
0.98 −0.20
−0.32 −0.95
−0.71 −0.71
_
¸
¸
_
_
∆u
∆v
_
≈
_
¸
¸
_
−1.77
1.72
−2.41
−2.81
_
¸
¸
_
least-squares solution: (∆u, ∆v) = (1.97, 1.90) (norm of error is 0.10)
Linear least-squares 8-14
Least-squares system identification
measure input u(t) and output y(t) for t = 0, . . . , N of an unknown system
u(t) y(t)
unknown
system
example (N = 70):
0 20 40 60
−4
−2
0
2
4
t
u
(
t
)
0 20 40 60
−5
0
5
t
y
(
t
)
system identification problem: find reasonable model for system based
on measured I/O data u, y
Linear least-squares 8-15
moving average model
y
model
(t) = h
0
u(t) + h
1
u(t −1) + h
2
u(t −2) +· · · + h
n
u(t −n)
where y
model
(t) is the model output
• a simple and widely used model
• predicted output is a linear combination of current and n previous inputs
• h
0
, . . . , h
n
are parameters of the model
• called a moving average (MA) model with n delays
least-squares identification: choose the model that minimizes the error
E =
_
N
t=n
(y
model
(t) −y(t))
2
_
1/2
Linear least-squares 8-16
formulation as a linear least-squares problem:
E =
_
N
t=n
(h
0
u(t) + h
1
u(t −1) +· · · + h
n
u(t −n) −y(t))
2
_
1/2
= Ax −b
A =
_
¸
¸
¸
¸
_
u(n) u(n −1) u(n −2) · · · u(0)
u(n + 1) u(n) u(n −1) · · · u(1)
u(n + 2) u(n + 1) u(n) · · · u(2)
.
.
.
.
.
.
.
.
.
.
.
.
u(N) u(N −1) u(N −2) · · · u(N −n)
_
¸
¸
¸
¸
_
x =
_
¸
¸
¸
¸
_
h
0
h
1
h
2
.
.
.
h
n
_
¸
¸
¸
¸
_
, b =
_
¸
¸
¸
¸
_
y(n)
y(n + 1)
y(n + 2)
.
.
.
y(N)
_
¸
¸
¸
¸
_
Linear least-squares 8-17
example (I/O data of page 8-15) with n = 7: least-squares solution is
h
0
= 0.0240, h
1
= 0.2819, h
2
= 0.4176, h
3
= 0.3536,
h
4
= 0.2425, h
5
= 0.4873, h
6
= 0.2084, h
7
= 0.4412
0 10 20 30 40 50 60 70
−4
−3
−2
−1
0
1
2
3
4
5
t
solid: y(t): actual output
dashed: y
model
(t)
Linear least-squares 8-18
model order selection: how large should n be?
0 20 40
0
0.2
0.4
0.6
0.8
1
n
relative error E/y
• suggests using largest possible n for smallest error
• much more important question: how good is the model at predicting
new data (i.e., not used to calculate the model)?
Linear least-squares 8-19
model validation: test model on a new data set (from the same system)
0 20 40 60
−4
−2
0
2
4
t
¯
u
(
t
)
0 20 40 60
−5
0
5
¯
y
(
t
)
t
0 20 40
0
0.2
0.4
0.6
0.8
1
n
r
e
l
a
t
i
v
e
p
r
e
d
i
c
t
i
o
n
e
r
r
o
r
validation data
modeling data
• for n too large the predictive
ability of the model becomes
worse!
• validation data suggest n = 10
Linear least-squares 8-20
for n = 50 the actual and predicted outputs on system identification and
model validation data are:
0 20 40 60
−5
0
5
t
solid: y(t)
dashed: y
model
(t)
I/O set used to compute model
0 20 40 60
−5
0
5
t
solid: ¯ y(t)
dashed: ¯ y
model
(t)
model validation I/O set
loss of predictive ability when n is too large is called overfitting or
overmodeling
Linear least-squares 8-21
Outline
• definition
• examples and applications
• solution of a least-squares problem, normal equations
Geometric interpretation of a LS problem
minimize Ax −b
2
A is m×n with columns a
1
, . . . , a
n
• Ax −b is the distance of b to the vector
Ax = x
1
a
1
+ x
2
a
2
+· · · + x
n
a
n
• solution x
ls
gives the linear combination of the columns of A closest to b
• Ax
ls
is the projection of b on the range of A
Linear least-squares 8-22
example
A =
_
_
1 −1
1 2
0 0
_
_
, b =
_
_
1
4
2
_
_
a
1
a
2
b
Ax
ls
= 2a
1
+ a
2
least-squares solution x
ls
Ax
ls
=
_
_
1
4
0
_
_
, x
ls
=
_
2
1
_
Linear least-squares 8-23
The solution of a least-squares problem
if A is left-invertible, then
x
ls
= (A
T
A)
−1
A
T
b
is the unique solution of the least-squares problem
minimize Ax −b
2
• in other words, if x = x
ls
, then Ax −b
2
> Ax
ls
−b
2
• recall from page 4-25 that A
T
A is positive definite and that
(A
T
A)
−1
A
T
is a left-inverse of A
Linear least-squares 8-24
proof
we show that Ax −b
2
> Ax
ls
−b
2
for x = x
ls
:
Ax −b
2
= A(x −x
ls
) + (Ax
ls
−b)
2
= A(x −x
ls
)
2
+Ax
ls
−b
2
> Ax
ls
−b
2
• 2nd step follows from A(x −x
ls
) ⊥ (Ax
ls
−b):
(A(x −x
ls
))
T
(Ax
ls
−b) = (x −x
ls
)
T
(A
T
Ax
ls
−A
T
b) = 0
• 3rd step follows from zero nullspace property of A:
x = x
ls
=⇒ A(x −x
ls
) = 0
Linear least-squares 8-25
The normal equations
(A
T
A)x = A
T
b
if A is left-invertible:
• least-squares solution can be found by solving the normal equations
• n equations in n variables with a positive definite coefficient matrix
• can be solved using Cholesky factorization
Linear least-squares 8-26