Undergraduate Text

Published on June 2016 | Categories: Documents | Downloads: 21 | Comments: 0 | Views: 573

of x

Content

Azmy S. Ackleh, University of Louisiana at Lafayette
R. Baker Kearfott, University of Louisiana at Lafayette
Applied Numerical
Methods
Techniques, Software, and Applications from
Biology, Engineering, Physics, Statistics, and
Operations Research
CRC PRESS
Boca Raton London New York Washington, D.C.
Dedication
To my wife Howayda,
my children Aseal and Wedad,
and my parents Sima’an and Wedad–A.S.A.
To my wife Ruth,
my daughter Frances,
and my mother Edith–R.B.K.
Preface
The purpose of this book is to introduce people to underlying techniques,
methods, and software in common use today throughout scientiﬁc computing,
to enable additional study of these tools, or to use these tools in scientiﬁc re-
search, mathematical modeling, or engineering analysis and design. The book
has elements of our graduate-level textbook, but we have omitted some of the
less commonly used topics and theory and we have given many more illustra-
tive examples of the basic concepts. We have omitted many proofs, most of
which can be found in our graduate-level text or as exercises in that text, but
we have supplied additional perspective relating the theory to applications
and implementations on the computer. We have also integrated the use of
matlab more closely into the text itself, and we have supplied examples of
use of the methods in mathematical biology, engineering, physics, statistics,
operations research, as well as many simple, illustrative examples.
We have used notes for this book to teach a one-semester course attended
by a mixture of undergraduate students in computer science, physics, mathe-
matics and statistics, and in various subﬁelds of engineering, as well as some
beginning mathematics and engineering graduate students.
ix
Contents
List of Figures xv
List of Tables xvii
1 Mathematical Review and Computer Arithmetic 1
1.1 Mathematical Review . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Intermediate Value Theorem, Mean Value Theorems,
and Taylor’s Theorem . . . . . . . . . . . . . . . . . . 1
1.1.2 Big “O” Notation . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Convergence Rates . . . . . . . . . . . . . . . . . . . . 7
1.2 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Floating Point Arithmetic and Rounding Error . . . . 11
1.2.2 Practicalities and the IEEE Floating Point Standard . 18
1.3 Interval Computations . . . . . . . . . . . . . . . . . . . . . 24
1.3.1 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . 25
1.3.2 Application of Interval Arithmetic: Examples . . . . . 29
1.4 Programming Environments . . . . . . . . . . . . . . . . . . 30
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Numerical Solution of Nonlinear Equations of One Variable 39
2.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 The Fixed Point Method . . . . . . . . . . . . . . . . . . . . 47
2.3 Newton’s Method (Newton-Raphson Method) . . . . . . . . 54
2.4 The Univariate Interval Newton Method . . . . . . . . . . . 56
2.5 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . 60
2.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3 Linear Systems of Equations 69
3.1 Matrices, Vectors, and Basic Properties . . . . . . . . . . . . 70
3.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . 79
3.2.1 The Gaussian Elimination Algorithm . . . . . . . . . . 81
3.2.2 The LU decomposition . . . . . . . . . . . . . . . . . 84
3.2.3 Determinants and Inverses . . . . . . . . . . . . . . . . 86
3.2.4 Pivoting in Gaussian Elimination . . . . . . . . . . . . 88
xi
xii
3.2.5 Systems with a Special Structure . . . . . . . . . . . . 92
3.3 Roundoﬀ Error and Conditioning . . . . . . . . . . . . . . . 100
3.3.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.2 Condition Numbers . . . . . . . . . . . . . . . . . . . 105
3.3.3 Roundoﬀ Error in Gaussian Elimination . . . . . . . . 110
3.3.4 Interval Bounds . . . . . . . . . . . . . . . . . . . . . . 111
3.4 Orthogonal Decomposition (QR Decomposition) . . . . . . . 116
3.4.1 Properties of Orthogonal Matrices . . . . . . . . . . . 116
3.4.2 Least Squares and the QR Decomposition . . . . . . . 117
3.5 Iterative Methods for Solving Linear Systems . . . . . . . . . 121
3.5.1 The Jacobi Method . . . . . . . . . . . . . . . . . . . 123
3.5.2 The Gauss–Seidel Method . . . . . . . . . . . . . . . . 124
3.5.3 Successive Overrelaxation . . . . . . . . . . . . . . . . 125
3.5.4 Convergence of Iterative Methods . . . . . . . . . . . . 126
3.5.5 The Interval Gauss–Seidel Method . . . . . . . . . . . 130
3.6 The Singular Value Decomposition . . . . . . . . . . . . . . . 133
3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4 Approximating Functions and Data 145
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.2 Taylor Polynomial Approximations . . . . . . . . . . . . . . 145
4.3 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . 146
4.3.1 The Vandermonde System . . . . . . . . . . . . . . . . 146
4.3.2 The Lagrange Form . . . . . . . . . . . . . . . . . . . 148
4.3.3 The Newton Form . . . . . . . . . . . . . . . . . . . . 150
4.3.4 An Error Formula for the Interpolating Polynomial . . 153
4.3.5 Optimal Points of Interpolation: Chebyshev Points . . 156
4.4 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . 159
4.4.1 Piecewise Linear Interpolation . . . . . . . . . . . . . 159
4.4.2 Cubic Spline Interpolation . . . . . . . . . . . . . . . . 163
4.5 Approximation Other Than by Interpolation . . . . . . . . . 171
4.5.1 Least Squares Approximation . . . . . . . . . . . . . . 172
4.5.2 Minimax Approximation . . . . . . . . . . . . . . . . . 172
4.5.3 Sum of Absolute Values Approximation . . . . . . . . 175
4.5.4 Weighted Fits . . . . . . . . . . . . . . . . . . . . . . . 177
4.6 Approximation Other Than by Polynomials . . . . . . . . . . 179
4.7 Interval (Rigorous) Bounds on the Errors . . . . . . . . . . . 182
4.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
xiii
5 Eigenvalue-Eigenvector Computation 191
5.1 Facts About Eigenvalues and Eigenvectors . . . . . . . . . . 191
5.2 The Power Method . . . . . . . . . . . . . . . . . . . . . . . 196
5.3 Other Methods for Eigenvalues and Eigenvectors . . . . . . . 202
5.3.1 The Inverse Power Method . . . . . . . . . . . . . . . 202
5.3.2 The QR Method . . . . . . . . . . . . . . . . . . . . . 204
5.3.3 Jacobi Diagonalization (Jacobi Method) . . . . . . . . 205
5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6 Numerical Diﬀerentiation and Integration 211
6.1 Numerical Diﬀerentiation . . . . . . . . . . . . . . . . . . . . 211
6.1.1 Derivation of Formulas . . . . . . . . . . . . . . . . . . 211
6.1.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . 212
6.2 Automatic (Computational) Diﬀerentiation . . . . . . . . . . 215
6.2.1 The Forward Mode . . . . . . . . . . . . . . . . . . . . 216
6.2.2 The Reverse Mode . . . . . . . . . . . . . . . . . . . . 218
6.2.3 Implementation of Automatic Diﬀerentiation . . . . . 220
6.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . 221
6.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 221
6.3.2 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . 222
6.3.3 Gaussian Quadrature . . . . . . . . . . . . . . . . . . 223
6.3.4 More General Integrals . . . . . . . . . . . . . . . . . . 227
6.3.5 Error Terms . . . . . . . . . . . . . . . . . . . . . . . . 229
6.3.6 Changes of Variables . . . . . . . . . . . . . . . . . . . 233
6.3.7 Composite Quadrature . . . . . . . . . . . . . . . . . . 234
6.3.8 Adaptive Quadrature . . . . . . . . . . . . . . . . . . 238
6.3.9 Multiple Integrals, Singular Integrals, and Inﬁnite In-
tervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.3.10 Interval Bounds . . . . . . . . . . . . . . . . . . . . . . 246
6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7 Initial Value Problems for Ordinary Diﬀerential Equations 253
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . 254
7.3 Higher-Order and Systems of Diﬀerential Equations . . . . . 256
7.4 Higher-Order Taylor Series Methods . . . . . . . . . . . . . . 259
7.5 Runge–Kutta Methods . . . . . . . . . . . . . . . . . . . . . 261
7.6 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.7 Adaptive Step Controls . . . . . . . . . . . . . . . . . . . . . 266
7.8 Multistep, Implicit, and Predictor-Corrector Methods . . . . 269
7.9 Stiﬀ Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.9.1 Stiﬀ Systems and Linear Systems . . . . . . . . . . . . 273
7.9.2 Stability of Stiﬀ Systems . . . . . . . . . . . . . . . . . 277
xiv
7.9.3 Methods for Stiﬀ Systems . . . . . . . . . . . . . . . . 281
7.10 Application to Parameter Estimation in Diﬀerential Equations 282
7.11 Application for Solving an SIRS Epidemic Model . . . . . . . 283
7.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8 Numerical Solution of Systems of Nonlinear Equations 291
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3 Multidimensional Fixed Point Iteration . . . . . . . . . . . . 298
8.4 Multivariate Interval Newton Methods . . . . . . . . . . . . 303
8.5 Quasi-Newton (Multivariate Secant) Methods . . . . . . . . 307
8.6 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . 310
8.7 Methods for Finding All Solutions . . . . . . . . . . . . . . . 317
8.7.1 Homotopy Methods . . . . . . . . . . . . . . . . . . . 317
8.7.2 Branch and Bound Methods . . . . . . . . . . . . . . . 317
8.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
8.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
References 329
Index 333
List of Figures
1.1 Illustration of the Intermediate Value Theorem. . . . . . . . . 2
1.2 An example ﬂoating point system: β = 10, t = 1, and m = 1. 12
2.1 Example for the Intermediate Value Theorem applied to roots
of a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Graph of e
x
+x for Example 2.1. . . . . . . . . . . . . . . . . 41
2.3 Example of when the method of bisection cannot be applied. 42
2.4 Example of monotonic convergence of ﬁxed point iteration. . 52
2.5 Illustration of two iterations of Newton’s method. . . . . . . . 55
2.6 Examples of divergence of Newton’s method. On the left, the
sequence diverges; on the right, the sequence oscillates. . . . . 55
2.7 Geometric interpretation of the secant method. . . . . . . . . 61
4.1 An example of a piecewise linear function. . . . . . . . . . . . 159
4.2 Graphs of the “hat” functions ϕ
i
(x). . . . . . . . . . . . . . . 161
4.3 B-spline basis functions. . . . . . . . . . . . . . . . . . . . . . 164
5.1 Illustration of Gerschgorin discs for Example 5.2. . . . . . . . 195
6.1 Illustration of the total error (roundoﬀ plus truncation) bound
in forward diﬀerence quotient approximation to f
′
. . . . . . . 213
7.1 Actual solutions to the stiﬀ ODE system of Example 7.12. . . 276
xv
List of Tables
1.1 Parameters for IEEE arithmetic . . . . . . . . . . . . . . . . . 20
1.2 Machine constants for IEEE arithmetic . . . . . . . . . . . . . 21
2.1 Convergence of the interval Newton method with f(x) = x
2
−2. 59
3.1 Condition numbers of some Hilbert matrices . . . . . . . . . . 109
3.2 Iterates of the Jacobi and Gauss–Seidel methods, for Exam-
ple 3.32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.1 Error factors K and M in polynomial approximations f(x) =
p(x) +K(x)M(f; x). . . . . . . . . . . . . . . . . . . . . . . . 183
6.1 Weights and sample points: Gauss–Legendre quadrature . . . 227
6.2 Error terms seen so far . . . . . . . . . . . . . . . . . . . . . . 230
6.3 Some quadrature formula error terms . . . . . . . . . . . . . . 232
xvii
Chapter 1
Mathematical Review and Computer
Arithmetic
1.1 Mathematical Review
The tools of scientiﬁc, engineering, and operations research computing are
ﬁrmly based in the calculus. In particular, formulating and solving mathe-
matical models in these areas involves approximation of quantities, such as
integrals, derivatives, solutions to diﬀerential equations, and solutions to sys-
tems of equations, ﬁrst seen an a calculus course. Indeed, techniques from
such a course are the basis of much of scientiﬁc computation. We review
these techniques here, with particular emphasis on how we will use them.
In addition to basic calculus techniques, scientiﬁc computing involves ap-
proximation of the real number system by decimal numbers with a ﬁxed
number of digits in their representation. Except for certain research-oriented
systems, computer number systems today for this purpose are ﬂoating point
systems, and almost all such ﬂoating point systems in use today adhere to the
IEEE 754-2008 ﬂoating point standard. We describe ﬂoating point numbers
and the ﬂoating point standard in this chapter, paying particular attention to
consequences and pitfalls of its use.
Third, programming and software tools are used in scientiﬁc computing.
Considering how commonly it is used, ease of programming and debugging,
documentation, and packages accessible from it, we have elected to use mat-
lab throughout this book. We introduce the basics of matlab in this chapter.
1.1.1 Intermediate Value Theorem, Mean Value Theorems,
and Taylor’s Theorem
Throughout, C
n
[a, b] will denote the set of real-valued functions f deﬁned
on the interval [a, b] such that f and its derivatives, up to and including its
n-th derivative f
(n)
, are continuous on [a, b].
THEOREM 1.1
(Intermediate value theorem) If f ∈ C[a, b] and k is any number between
m = min
a≤x≤b
f(x) and M = max
a≤x≤b
f(x), then there exists a number c in [a, b]
1
2 Applied Numerical Methods
for which f(c) = k (Figure 1.1).
x
y
y = f(x)
m +
k +
M +
a
+
c
+
b
+
FIGURE 1.1: Illustration of the Intermediate Value Theorem.
Example 1.1
Consider f(x) = e
x
− x −2. Using a computational device (such as a calcu-
lator) on which we trust the approximation of e
x
to be accurate, we compute
f(0) = −1 and f(2) ≈ 3.3891. We know f is continuous, since it is a sum
of continuous functions. Since 0 is between f(0) and f(2), the Intermediate
Value Theorem tells us there is a point c ∈ [0, 2] such that f(c) = 0. At such
a c, e
c
= c + 2.
THEOREM 1.2
(Mean value theorem for integrals) Let f be continuous and w be Riemann
integrable
1
on [a, b] and suppose that w(x) ≥ 0 for x ∈ [a, b]. Then there exists
a point c in [a, b] such that
_
b
a
w(x)f(x)dx = f(c)
_
b
a
w(x)dx.
Example 1.2
Suppose we want bounds on
_
1
0
x
2
e
−x
2
dx.
1
This means that the limit of the Riemann sums exists. For example w may be continuous,
or w may have a ﬁnite number of breaks.
Mathematical Review and Computer Arithmetic 3
With w(x) = x
2
and f(x) = e
−x
2
, the Mean Value Theorem for integrals tells
us that
_
1
0
x
2
e
−x
2
dx = e
−c
2
_
1
0
x
2
dx
for some c ∈ [0, 1], so
1
3e
≤
_
1
0
x
2
e
−x
2
dx = e
−c
2
_
1
0
x
2
dx ≤
1
3
.
The following is extremely important in scientiﬁc and engineering comput-
ing.
THEOREM 1.3
(Taylor’s theorem) Suppose that f ∈ C
n+1
[a, b]. Let x
0
∈ [a, b]. Then for
any x ∈ [a, b],
f(x) = P
n
(x) +R
n
(x), where
P
n
(x) = f(x
0
) +f
′
(x
0
)(x −x
0
) + +
f
(n)
(x
0
)(x −x
0
)
n
n!
=
n

k=0
1
k!
f
(k)
(x
0
)(x −x
0
)
k
, and
R
n
(x) =
1
n!
_
x
x0
f
(n+1)
(t)(x −t)
n
dt (integral form of remainder).
Furthermore, there is a ξ = ξ(x) between x
0
and x with
R
n
(x) =
f
(n+1)
(ξ(x))(x −x
0
)
n+1
(n + 1)!
(Lagrange form of remainder).
PROOF Recall the integration by parts formula
_
udv = uv −
_
vdu.
4 Applied Numerical Methods
Thus,
f(x) −f(x
0
) =
_
x
x0
f
′
(t)dt (let u = f
′
(t), v = t −x, dv = dt)
= f
′
(x
0
)(x −x
0
) +
_
x
x0
(x −t)f
′′
(t)dt
(let u = f
′′
(t), dv = (x −t)dt)
= f
′
(x
0
)(x −x
0
) −
(x −t)
2
2
f
′′
(t)
¸
¸
¸
¸
¸
x
x0
+
_
x
x0
(x −t)
2
2
f
′′′
(t)dt
= f
′
(x
0
)(x −x
0
) +
(x −x
0
)
2
2
f
′′
(x
0
) +
_
x
x0
(x −t)
2
2
f
′′′
(t)dt
Continuing this procedure,
f(x) = f(x
0
) +f
′
(x
0
)(x −x
0
) +
(x −x
0
)
2
2
f
′′
(x
0
)
+ +
(x −x
0
)
n
n!
f
(n)
(x
0
) +
_
x
x0
(x −t)
n
n!
f
(n+1)
(t)dt
= P
n
(x) +R
n
(x).
Now consider R
n
(x) =
_
x
x0
(x −t)
n
n!
f
(n+1)
(t)dt and assume that x
0
< x (same
argument if x
0
> x). Then, by Theorem 1.2,
R
n
(x) = f
(n+1)
(ξ(x))
_
x
x0
(x −t)
n
n!
dt = f
(n+1)
(ξ(x))
(x −x
0
)
n+1
(n + 1)!
,
where ξ is between x
0
and x and thus, ξ = ξ(x).
Example 1.3
Approximate sin(x) by a polynomial p(x) such that [ sin(x) − p(x)[ ≤ 10
−16
for −0.1 ≤ x ≤ 0.1.
For Example 1.3, Taylor polynomials about x
0
= 0 are appropriate, since
that is the center of the interval about which we wish to approximate. We
observe that the terms of even degree in such a polynomial are absent, so, for
Mathematical Review and Computer Arithmetic 5
n even, Taylor’s theorem gives
n P
n
R
n
2 x −
x
3
3!
cos(c
2
)
4 x −
x
3
3!
x
5
5!
cos(c
4
)
6 x −
x
3
3!
+
x
5
5!
−
x
7
7!
cos(c
6
)
.
.
.
.
.
.
.
.
.
n — (−1)
n/2
x
n+1
(n + 1)!
cos(c
n
)
Observing that [ cos(c
n
)[ ≤ 1, we see that
[R
n
(x)[ ≤
[x[
n+1
(n + 1)!
.
We may thus form the following table.
n bound on error R
n
2 1.67 10
−4
4 8.33 10
−8
6 1.98 10
−11
8 2.76 10
−15
10 2.51 10
−19
Thus, a polynomial with the required accuracy for x ∈ [−0.1, 0.1] is
p(x) = x −
x
3
3!
+
x
5
5!
−
x
7
7!
+
x
9
9!
.
An important special case of Taylor’s theorem is obtained with n = 0 (that
is, directly from the Fundamental Theorem of Calculus).
THEOREM 1.4
(Mean value theorem) Suppose f ∈ C
1
[a, b], x ∈ [a, b], and y ∈ [a, b] (and,
without loss of generality, x ≤ y). Then there is a c ∈ [x, y] ⊆ [a, b] such that
f(y) −f(x) = f
′
(c)(y −x).
Example 1.4
Suppose f(1) = 1 and [f
′
(x)[ ≤ 2 for x ∈ [1, 2]. What are an upper bound
and a lower bound on f(2)?
6 Applied Numerical Methods
The mean value theorem tells us that
f(2) = f(1) +f
′
(c)(2 −1) = f(1) +f
′
(c)
for some c ∈ (1, 2). Furthermore, the fact [f
′
(x)[ ≤ 2 is equivalent to −2 ≤
f
′
(x) ≤ 2. Combining these facts gives
1 −2 = −1 ≤ f(2) ≤ 1 + 2 = 3.
1.1.2 Big “O” Notation
We study “rates of growth” and “rates of decrease” of errors. For example,
if we approximate e
h
by a ﬁrst degree Taylor polynomial about x = 0, we get
e
h
−(1 +h) =
1
2
h
2
e
ξ
,
where ξ is some unknown quantity between 0 and h. Although we don’t
know exactly what e
ξ
is, we know that it is nearly constant (in this case,
approximately 1) for h near 0, so the error e
h
−(1+h) is roughly proportional
to h
2
for h small. This approximate proportionality is often more important to
know than the slowly-varying constant e
ξ
. The big “O” and little “o” notation
are used to describe and keep track of this approximate proportionality.
DEFINITION 1.1 Let E(h) be an expression that depends on a small
quantity h. We say that E(h) = O(h
k
) if there are an ǫ and C such that
E(h) ≤ Ch
k
for all [h[ ≤ ǫ.
The “O” denotes “order.” For example, if f(h) = O(h
2
), we say that “f
exhibits order 2 convergence to 0 as h tends to 0.”
Example 1.5
E(h) = e
h
−h −1. Then E(h) = O(h
2
).
PROOF By Taylor’s Theorem,
e
h
= e
0
+ e
0
(h −0) +
h
2
2
e
ξ
for some c between 0 and h. Thus,
E(h) = e
h
−1 −h ≤ h
2
_
e
1
2
_
, and E(h) ≥ 0
Mathematical Review and Computer Arithmetic 7
for h ≤ 1, that is, ǫ = 1 and C = e/2 work.
Example 1.6
Show that
¸
¸
f(x+h)−f(x)
h
−f
′
(x)
¸
¸
= O(h) for x, x +h ∈ [a, b], assuming that f
has two continuous derivatives at each point in [a, b].
PROOF
¸
¸
¸
¸
f(x +h) −f(x)
h
−f
′
(x)
¸
¸
¸
¸
=
¸
¸
¸
¸
¸
¸
¸
¸
¸
f(x) +f
′
(x)h +
_
x+h
x
(x +h −t)f
′′
(t)dt −f(x)
h
−f
′
(x)
¸
¸
¸
¸
¸
¸
¸
¸
¸
=
1
h
¸
¸
¸
¸
¸
_
x+h
x
(x +h −t)f
′′
(t)dt
¸
¸
¸
¸
¸
≤ max
a≤t≤b
[f
′′
(t)[
h
2
= ch.
1.1.3 Convergence Rates
DEFINITION 1.2 Let ¦x
k
¦ be a sequence with limit x
∗
. If there are
constants C and α and an integer N such that [x
k+1
−x
∗
[ ≤ C[x
k
−x
∗
[
α
for
k ≥ N we say that the rate of convergence is of order at least α. If α = 1
(with C < 1), the rate is said to be linear. If α = 2, the rate is said to be
quadratic.
Example 1.7
A sequence sometimes learned in elementary classes for computing the square
root of a number a is
x
k+1
=
x
k
2
+
a
2x
k
.
8 Applied Numerical Methods
We have
x
k+1
−
√
a =
x
k
2
+
a
2x
k
−
√
a
= x
k
−
x
2
k
−a
2x
k
−
√
a
= (x
k
−
√
a) −(x
k
−
√
a)
x
k
+
√
a
2x
k
= (x
k
−
√
a)
_
1 −
x
k
+
√
a
2x
k
_
= (x
k
−
√
a)
x
k
−
√
a
2x
k
=
1
2x
k
(x
k
−
√
a)
2
≈
1
2
√
a
(x
k
−
√
a)
2
for x
k
near
√
a, thus showing that the convergence rate is quadratic.
Quadratic convergence is very fast. We can think of quadratic convergence,
with C ≈ 1, as doubling the number of signiﬁcant ﬁgures on each iteration. (In
contrast, linear convergence with C = 0.1 adds one decimal digit of accuracy
to the approximation on each iteration.) For example, if we use the square
root computation from Example 1.7 with a = 2, and starting with x
0
= 2, we
obtain the following table
k x
k
x
k
−
√
2
x
k
−
√
2
(x
k−1
−
√
2)
2
0 2 0.5858 10
0
—
1 1.5 0.8579 10
−1
0.2500
2 1.416666666666667 0.2453 10
−2
0.3333
3 1.414215686274510 0.2123 10
−6
0.3529
4 1.414213562374690 0.1594 10
−13
0.3535
5 1.414213562373095 0.2204 10
−17
—
In this table, the correct digits are underlined. This table illustrates that
the total number of digits more than doubles on each iteration. In fact, the
multiplying factor C for the quadratic convergence appears to be approaching
0.3535. (The last error ratio is not meaningful in this sense, because only
roughly 16 digits were carried in the computation.) Based on our analysis,
the limiting value of C should be about 1/(2
√
2) ≈ 0.353553390593274. (We
explain how we computed the table at the end of this chapter.)
Mathematical Review and Computer Arithmetic 9
Example 1.8
As an example of linear convergence, consider the iteration
x
k+1
= x
k
−
x
2
k
3.5
+
2
3.5
,
which converges to
√
2. We obtain the following table.
k x
k
x
k
−
√
2
x
k
−
√
2
(x
k−1
−
√
2)
0 2 0.5858 10
0
—
1 1.428571428571429 0.1436 10
−1
0.2451 10
−1
2 1.416909620991254 0.2696 10
−2
0.1878
3 1.414728799831946 0.5152 10
−3
0.1911
4 1.414312349239392 0.9879 10
−4
0.1917
5 1.414232514607664 0.1895 10
−4
0.1918
6 1.414217198786659 0.3636 10
−5
0.1919
7 1.414214260116949 0.6955 10
−6
0.1919
8 1.414213696254626 0.1339 10
−6
0.1919
.
.
.
.
.
.
.
.
.
.
.
.
19 1.414213562373097 0.1554 10
−14
—
Here, the constant C in the linear convergence, to four signiﬁcant digits,
appears to be 0.1919 ≈ 1/5. That is, the error is reduced by approximately a
factor of 5 each iteration. We can think of this as obtaining one more correct
base-5 digit on each iteration.
1.2 Computer Arithmetic
In numerical solution of mathematical problems, two common types of error
are:
1. Method (algorithm or truncation) error. This is the error due to ap-
proximations made in the numerical method.
2. Rounding error. This is the error made due to the ﬁnite number of digits
available on a computer.
10 Applied Numerical Methods
Example 1.9
By the mean value theorem for integrals (Theorem 1.2, as in Example 1.6 on
page 7), if f ∈ C
2
[a, b], then
f
′
(x) =
f(x +h) −f(x)
h
+
1
h
_
x+h
x
f
′′
(t)(x +h −t)dt
and
¸
¸
¸
¸
¸
1
h
_
x+h
x
f
′′
(t)(x +h −t)dt
¸
¸
¸
¸
¸
≤ ch.
Thus, f
′
(x) ≈ (f(x +h) −f(x))/h, and the error is O(h). We will call this
the method error or truncation error, as opposed to roundoﬀ errors due to
using machine approximations.
Now consider f(x) = ln x and approximate f
′
(3) ≈
ln(3+h)−ln 3
h
for h small
using a calculator having 11 digits. The following results were obtained.
h
ln(3 +h) −ln(3)
h
Error =
1
3
−
ln(3 +h) −ln(3)
h
= O(h)
10
−1
0.3278982 5.44 10
−3
10
−2
0.332779 5.54 10
−4
10
−3
0.3332778 5.55 10
−5
10
−4
0.333328 5.33 10
−6
10
−5
0.333330 3.33 10
−6
10
−6
0.333300 3.33 10
−5
10
−7
0.333 3.33 10
−4
10
−8
0.33 3.33 10
−3
10
−9
0.3 3.33 10
−2
10
−10
0.0 3.33 10
−1
One sees that, in the ﬁrst four steps, the error decreases by a factor of 10
as h is decreased by a factor of 10 (That is, the method error dominates).
However, starting with h = 0.00001, the error increases. (The error due to a
ﬁnite number of digits, i.e., roundoﬀ error dominates).
There are two possible ways to reduce rounding error:
1. The method error can be reduced by using a more accurate method.
This allows larger h to be used, thus avoiding roundoﬀ error. Consider
f
′
(x) =
f(x +h) −f(x −h)
2h
+ ¦error¦, where ¦error¦ is O(h
2
).
h
ln(3 +h) − ln(3 −h)
2h
error
0.1 0.3334568 1.24 10
−4
0.01 0.3333345 1.23 10
−6
0.001 0.3333333 1.91 10
−8
Mathematical Review and Computer Arithmetic 11
The error decreases by a factor of 100 as h is decreased by a factor of 10.
2. Rounding error can be reduced by using more digits of accuracy, such
as using double precision (or multiple precision) arithmetic.
To fully understand and avoid roundoﬀ error, we should study some de-
tails of how computers and calculators represent and work with approximate
numbers.
1.2.1 Floating Point Arithmetic and Rounding Error
Let β = ¦a positive integer¦, the base of the computer system. (Usually,
β = 2 (binary) or β = 16 (hexadecimal)). Suppose a number x has the exact
base representation
x = (±0.α
1
α
2
α
3
α
t
α
t+1
)β
m
= ±qβ
m
,
where q is the mantissa, β is the base, m is the exponent, 1 ≤ α
1
≤ β −1 and
0 ≤ α
i
≤ β −1 for i > 1.
On a computer, we are restricted to a ﬁnite set of ﬂoating-point numbers
F = F(β, t, L, U) of the form x
∗
= (±0.a
1
a
2
a
t
)β
m
, where 1 ≤ a
1
≤ β −1,
0 ≤ a
i
≤ β − 1 for 2 ≤ i ≤ t, L ≤ m ≤ U, and t is the number of digits. (In
most ﬂoating point systems, L is about −64 to −1000 and U is about 64 to
1000.)
Example 1.10
(binary) β = 2
x
∗
= (0.1011)2
3
=
_
1
1
2
+ 0
1
4
+ 1
1
8
+ 1
1
16
_
8
=
11
2
= 5.5 (decimal).
REMARK 1.1 Most numbers cannot be exactly represented on a com-
puter. Consider x = 10.1 = 1010.0001 1001 1001 (β = 2). If L = −127, U =
127, t = 24, and β = 2, then x ≈ x
∗
= (0.10100001 1001 1001 1001 1001)2
4
.
Question: Given a real number x, how do we deﬁne a ﬂoating point number
ﬂ(x) in F, such that ﬂ(x) is close to x?
On modern machines, one of the following four ways is used to approximate
a real number x by a machine-representable number ﬂ(x).
round down: ﬂ(x) = x ↓, the nearest machine representable number to the
real number x that is less than or equal to x
12 Applied Numerical Methods
round up: ﬂ(x) = x ↑, the nearest machine number to the real number x
that is greater than or equal to x.
round to nearest: ﬂ(x) is the nearest machine number to the real number
x.
round to zero, or “chopping”: ﬂ(x) is the nearest machine number to the
real number x that is closer to 0 than x. The term “chopping” is because
we simply “chop” the expansion of the real number, that is, we simply
ignore the digits in the expansion of x beyond the t-th one.
The default on modern systems is usually round to nearest, although chopping
is faster or requires less circuitry. Round down and round up may be used,
but with care, to produce results from a string of computations that are
guaranteed to be less than or greater than the exact result.
Example 1.11
β = 10, t = 5, x = 0.12345666 10
7
. Then
ﬂ(x) = 0.12345 10
7
(chopping).
ﬂ(x) = 0.12346 10
7
(rounded to nearest).
(In this case, round down corresponds to chopping and round up corresponds
to round to nearest.)
See Figure 1.2 for an example with β = 10 and t = 1. In that ﬁgure, the
exhibited ﬂoating point numbers are (0.1) 10
1
, (0.2) 10
1
, . . . , (0.9) 10
1
,
0.1 10
2
.
+
β
m−1
= 1
+ + + + + + + + + +
β
m−t
= 10
0
= 1
successive ﬂoating point numbers
β
m
= 10
1
FIGURE 1.2: An example ﬂoating point system: β = 10, t = 1, and
m = 1.
Example 1.12
Let a = 0.410, b = 0.000135, and c = 0.000431. Assuming 3-digit decimal
computer arithmetic with rounding to nearest, does a + (b +c) = (a +b) +c
when using this arithmetic?
Mathematical Review and Computer Arithmetic 13
Following the “rounding to nearest” deﬁnition of ﬂ, we emulate the opera-
tions a machine would do, as follows:
a ← 0.410 10
0
, b ← 0.135 10
−3
, c ← 0.431 10
−3
,
and
ﬂ(b +c) = ﬂ(0.135 10
−3
+ 0.431 10
−3
)
= ﬂ(0.566 10
−3
)
= 0.566 10
−3
,
so
ﬂ(a + 0.566 10
−3
) = ﬂ(0.410 10
0
+ 0.566 10
−3
)
= ﬂ(0.410 10
0
+ 0.000566 10
0
)
= ﬂ(0.410566 10
0
)
= 0.411 10
0
.
On the other hand,
ﬂ(a +b) = ﬂ(0.410 10
0
+ 0.135 10
−3
)
= ﬂ(0.410000 10
0
+ 0.000135 10
0
)
= ﬂ(0.410135 10
0
)
= 0.410 10
0
,
so
ﬂ(0.410 10
0
+c) = ﬂ(0.410 10
0
+ 0.431 10
−3
)
= ﬂ(0.410 10
0
+ 0.000431 10
0
)
= ﬂ(0.410431 10
0
)
= 0.410 10
0
,= 0.411 10
0
.
Thus, the distributive law does not hold for ﬂoating point arithmetic with
“round to nearest.” Furthermore, this illustrates that accuracy is improved if
numbers of like magnitude are added ﬁrst in a sum.
The following error bound is useful in some analyses.
THEOREM 1.5
[x −ﬂ(x)[ ≤
1
2
[x[β
1−t
p,
where p = 1 for rounding and p = 2 for chopping.
14 Applied Numerical Methods
DEFINITION 1.3 δ =
p
2
β
1−t
is called the unit roundoﬀ error.
Let ǫ =
ﬂ(x) −x
x
. Then ﬂ(x) = (1 +ǫ)x, where [ǫ[ ≤ δ. With this, we have
the following.
THEOREM 1.6
Let ⊙ denote the operation +, −, , or ÷, and let x and y be machine num-
bers. Then
ﬂ(x ⊙ y) = (x ⊙y)(1 +ǫ), where [ǫ[ ≤ δ =
p
2
β
1−t
.
Roundoﬀ error that accumulates as a result of a sequence of arithmetic
operations can be analyzed using this theorem. Such an analysis is called
forward error analysis.
Because of the properties of ﬂoating point arithmetic, it is unreasonable to
demand strict tolerances when the exact result is too large.
Example 1.13
Suppose β = 10 and t = 3 (3-digit decimal arithmetic), and suppose we wish
to compute 10
4
π with a computed value x such that [10
4
π −x[ < 10
−2
. The
closest ﬂoating point number in our system to 10
4
π is x = 0.31410
5
= 31400.
However [10
4
π − x[ = 15.926 . . . . Hence, it is impossible to ﬁnd a number x
in the system with [10
4
π −x[ < 10
−2
.
The error [10
4
π −x[ in this example is called the absolute error in approx-
imating 10
4
π. We see that absolute error is not an appropriate measure of
error when using ﬂoating point arithmetic. For this reason, we use relative
error:
DEFINITION 1.4 Let x
∗
be an approximation to x. Then [x − x
∗
[ is
called the absolute error, and
¸
¸
¸
¸
x −x
∗
x
¸
¸
¸
¸
is called the relative error.
For example,
¸
¸
¸
¸
x −ﬂ(x)
x
¸
¸
¸
¸
≤ δ =
p
2
β
1−t
(unit roundoﬀ error).
1.2.1.1 Expression Evaluation and Condition Numbers
We now examine some common situations in which roundoﬀ error can be-
come large, and explain how to avoid many of these situations.
Mathematical Review and Computer Arithmetic 15
Example 1.14
β = 10, t = 4, p = 1. (Thus, δ =
1
2
10
−3
= 0.0005.) Let x = 0.5795 10
5
,
y = 0.6399 10
5
. Then
ﬂ(x +y) = 0.1219 10
6
= (x +y)(1 +ǫ
1
), ǫ
1
≈ −3.28 10
−4
, [ǫ
1
[ < δ, and
ﬂ(xy) = 0.3708 10
10
= (xy)(1 +ǫ
2
), ǫ
2
≈ −5.95 10
−5
, [ǫ
2
[ < δ.
(Note: x +y = 0.12194 10
6
, xy = 0.37082205 10
10
.)
Example 1.15
Suppose β = 10 and t = 4 (4 digit arithmetic), suppose x
1
= 10000 and
x
2
= x
3
= = x
1001
= 1. Then
ﬂ(x
1
+x
2
) = 10000,
ﬂ(x
1
+x
2
+x
3
) = 10000,
.
.
.
ﬂ
_
1001

i=1
x
i
_
= 10000,
when we sum forward from x
1
. But going backwards,
ﬂ(x
1001
+x
1000
) = 2,
ﬂ(x
1001
+x
1000
+x
999
) = 3,
.
.
.
ﬂ
_
1

i=1001
x
i
_
= 11000,
which is the correct sum.
This example illustrates the point that large relative errors occur when
a large number of almost small numbers is added to a large number,
or when a very large number of small almost-equal numbers is added. To avoid
such large relative errors, one can sum from the smallest number to the largest
number. However, this will not work if the numbers are all approximately
equal. In such cases, one possibility is to group the numbers into sets of two
adjacent numbers, summing two almost equal numbers together. One then
groups those results into sets of two and sums these together, continuing until
the total sum is reached. In this scheme, two almost equal numbers are always
being summed, and the large relative error from repeatedly summing a small
number to a large number is avoided.
16 Applied Numerical Methods
Example 1.16
x
1
= 15.314768, x
2
= 15.314899, β = 10, t = 6 (6-digit decimal accuracy).
Then x
2
−x
1
≈ ﬂ(x
2
) −ﬂ(x
1
) = 15.3149 −15.3148 = 0.0001. Thus,
¸
¸
¸
¸
x
2
−x
1
−(ﬂ(x
2
) −ﬂ(x
1
))
x
2
−x
1
¸
¸
¸
¸
=
0.000131 −0.0001
0.000131
= 0.237
= 23.7% relative accuracy.
This example illustrates that large relative errors can occur when
two nearly equal numbers are subtracted on a computer. Sometimes,
an algorithm can be modiﬁed to reduce rounding error occurring from this
source, as the following example illustrates.
Example 1.17
Consider ﬁnding the roots of ax
2
+ bx + c = 0, where b
2
is large compared
with [4ac[. The most common formula for the roots is
x
1,2
=
−b ±
√
b
2
−4ac
2a
.
Consider x
2
+ 100x + 1 = 0, β = 10, t = 4, p = 2, and 4-digit chopped
arithmetic. Then
x
1
=
−100 +
√
9996
2
, x
2
=
−100 −
√
9996
2
,
but
√
9996 ≈ 99.97 (4 digit arithmetic chopped). Thus,
x
1
≈
−100 + 99.97
2
, x
2
≈
−100 −99.97
2
.
Hence, x
1
≈ −0.015, x
2
≈ −99.98, but x
1
= −0.010001 and x
2
= −99.989999,
so the relative errors in x
1
and x
2
are 50% and 0.01%, respectively.
Let’s change the algorithm. Assume b ≥ 0 (can always make b ≥ 0). Then
x
1
=
−b +
√
b
2
−4ac
2a
_
−b −
√
b
2
−4ac
−b −
√
b
2
−4ac
_
=
4ac
2a(−b −
√
b
2
−4ac)
=
−2c
b +
√
b
2
−4ac
,
and
x
2
=
−b −
√
b
2
−4ac
2a
(the same as before).
Mathematical Review and Computer Arithmetic 17
Then, for the above values,
x
1
=
−2(1)
100 +
√
9996
≈
−2
100 + 99.97
= −0.0100.
Now, the relative error in x
1
is also 0.01%.
Let us now consider error in function evaluation. Consider a single valued
function f(x) and let x
∗
= ﬂ(x) be the ﬂoating point approximation of x.
Therefore the machine evaluates f(x
∗
) = f(ﬂ(x)), which is an approximate
value of f(x) at x = x
∗
. Then the perturbation in f(x) for small perturbations
in x can be computed via Taylor’s formula. This is illustrated in the next
theorem.
THEOREM 1.7
The relative error in functional evaluation is,
¸
¸
¸
¸
f(x) −f(x
∗
)
f(x)
¸
¸
¸
¸
≈
¸
¸
¸
¸
x f
′
(x)
f(x)
¸
¸
¸
¸
¸
¸
¸
¸
x −x
∗
x
¸
¸
¸
¸
PROOF The linear Taylor approximation of f(x
∗
) about f(x) for small
values of [x − x
∗
[ is given by f(x
∗
) ≈ f(x) + f
′
(x)(x
∗
−x). Rearranging the
terms immediately yields the result.
This leads us to the following deﬁnition.
DEFINITION 1.5 The condition number of a function f(x) is
κ
f
(x) :=
¸
¸
¸
¸
x f
′
(x)
f(x)
¸
¸
¸
¸
The condition number describes how large the relative error in function
evaluation is with respect to the relative error in the machine representation
of x. In other words, κ
f
(x) is a measure of the degree of sensitivity of the
function at x.
Example 1.18
Let f(x) =
√
x. The condition number of f(x) about x is
κ
f
(x) =
¸
¸
¸
¸
¸
x
1
2
√
x
√
x
¸
¸
¸
¸
¸
=
1
2
.
This suggests that f(x) is well-conditioned.
18 Applied Numerical Methods
Example 1.19
Let f(x) =
√
x −2. The condition number of f(x) about x is
κ
f
(x) =
¸
¸
¸
¸
x
2(x −2)
¸
¸
¸
¸
.
This is not deﬁned at x
∗
= 2. Hence the function f(x) is numerically unstable
and ill-conditioned for values of x close to 2.
REMARK 1.2 If x = f(x) = 0, then the condition number is simply
[f
′
(x)[. If x = 0, f(x) ,= 0 (or f(x) = 0, x ,= 0) then it is more useful
to consider the relation between absolute errors than relative errors. The
condition number then becomes [f
′
(x)/f(x)[.
REMARK 1.3 Generally, if a numerical approximation ˜ z to a quantity
z is computed, the relative error is related to the number of digits after the
decimal point that are correct. For example if z = 0.0000123453 and ˜ z =
0.00001234543, we say that ˜ z is correct to 5 signiﬁcant digits. Expressing
z as 0.123453 10
−4
and ˜ z as 0.123454 10
−4
, we see that if we round ˜ z
to the nearest number with ﬁve digits in its mantissa, all of those digits are
correct, whereas, if we do the same with six digits, the sixth digit is not
correct. Signiﬁcant digits is the more logical way to talk about accuracy in
a ﬂoating point computation where we are interested in relative error, rather
than “number of digits after the decimal point,” which can have a diﬀerent
meaning. (Here, one might say that ˜ z is correct to 9 digits after the decimal
point.)
1.2.2 Practicalities and the IEEE Floating Point Standard
Prior to 1985, diﬀerent machines used diﬀerent word lengths and diﬀerent
bases, and diﬀerent machines rounded, chopped, or did something else to form
the internal representation ﬂ(x) for real numbers x. For example, IBM main-
frames generally used hexadecimal arithmetic (β = 16), with 8 hexadecimal
digits total (for the base, sign, and exponent) in “single precision” numbers
and 16 hexadecimal digits total in “double precision” numbers. Machines such
as the Univac 1108 and Honeywell Multics systems used base β = 2 and 36
binary digits (or “bits”) total in single precision numbers and 72 binary digits
total in double precision numbers. An unusual machine designed at Moscow
State University from 1955-1965, the “Setun” even used base-3 (β = 3, or
“ternary”) numbers. Some computers had 32 bits total in single precision
numbers and 64 bits total in double precision numbers, while some “super-
computers” (such as the Cray-1) had 64 bits total in single precision numbers
and 128 bits total in double precision numbers.
Some hand-held calculators in existence today (such as some Texas Instru-
ments calculators) can be viewed as implementing decimal (base 10, β = 10)
Mathematical Review and Computer Arithmetic 19
arithmetic, say, with L = −999 and U = 999, and t = 14 digits in the man-
tissa.
Except for the Setun (the value of whose ternary digits corresponded to
“positive,” “negative,” and “neutral” in circuit elements or switches), digital
computers are mostly based on binary switches or circuit elements (that is,
“on” or “oﬀ”), so the base β is usually 2 or a power of 2. For example, the
IBM hexadecimal digit could be viewed as a group of 4 binary digits
2
.
Older ﬂoating point implementations did not even always ﬁt exactly into
the model we have previously described. For example, if x is a number in the
system, then −x may not have been a number in the system, or, if x were a
number in the system, then 1/x may have been too large to be representable
in the system.
To promote predictability, portability, reliability, and rigorous error bound-
ing in ﬂoating point computations, the Institute of Electrical and Electronics
Engineers (IEEE) and American National Standards Institute (ANSI) pub-
lished a standard for binary ﬂoating point arithmetic in 1985: IEEE/ANSI
754-1985: Standard for Binary Floating Point Arithmetic, often referenced as
“IEEE-754,” or simply “the IEEE standard
3
.” Almost all computers in exis-
tence today, including personal computers and workstations based on Intel,
AMD, Motorola, etc. chips, implement most of the IEEE standard.
In this standard, β = 2, 32 bits total are used in a single precision number
(an “IEEE single”), and 64 bits total are used for a double precision number
(“IEEE double”). In a single precision number, 1 bit is used for the sign, 8 bits
are used for the exponent, and t = 23 bits are used for the mantissa. In double
precision numbers, 1 bit is used for the sign, 11 bits are used for the exponent,
and 52 bits are used for the mantissa. Thus, for single precision numbers,
the exponent is between 0 and (11111111)
2
= 255, and 128 is subtracted
from this, to get an exponent between −127 and 128. In IEEE numbers,
the minimum and maximum exponent are used to denote special symbols
(such as inﬁnity and “unnormalized” numbers), so the exponent in single
precision represents magnitudes between 2
−126
≈ 10
−38
and 2
127
≈ 10
38
. The
mantissa for single precision numbers represents numbers between (2
0
= 1
and

23
i=0
2
−i
= 2(1 −2
−24
) ≈ 2. Similarly, the exponent for double precision
numbers is, eﬀectively, between 2
−1022
≈ 10
−308
and 2
1023
≈ 10
308
, while the
mantissa for double precision numbers represents numbers between 2
0
= 1
and

52
i=0
2
−i
≈ 2.
Summarizing, the parameters for IEEE arithmetic appear in Table 1.1.
In many numerical computations, such as solving the large linear systems
arising from partial diﬀerential equation models, more digits or a larger ex-
ponent range is required than is available with IEEE single precision. For
2
An exception is in some systems for business calculations, where base 10 is implemented.
3
An update to the 1985 standard was made in 2008. This update gives clariﬁcations of
certain ambiguous points, provides certain extensions, and speciﬁes a standard for decimal
arithmetic.
20 Applied Numerical Methods
TABLE 1.1: Parameters for
IEEE arithmetic
precision β L U t
single 2 -126 127 24
double 2 -1022 1023 53
this reason, many numerical analysts at present have adopted IEEE double
precision as the default precision. For example, underlying computations in
the popular computational environment matlab are done in IEEE double
precision.
IEEE arithmetic provides four ways of deﬁning ﬂ(x), that is, four “rounding
modes,” namely, “round down,” “round up,” “round to nearest,” and “round
to zero,” are speciﬁed as follows. The four elementary operations +, −, ,
and / must be such that ﬂ(x⊙y) is implemented for all four rounding modes,
for ⊙ ∈
_
−, +, , /,
√

_
.
The default mode (if the rounding mode is not explicitly set) is normally
“round to nearest,” to give an approximation after a long string of compu-
tations that is hopefully near the exact value. If the mode is set to “round
down” and a string of computations is done, then the result is less than or
equal to the exact result. Similarly, if the mode is set to “round up,” then
the result of a string of computations is greater than or equal to the exact
result. In this way, mathematically rigorous bounds on an exact result can be
obtained. (This technique must be used astutely, since naive use could result
in bounds that are too large to be meaningful.)
Several parameters more directly related to numerical computations than
L, U, and t are associated with any ﬂoating point number system. These are
HUGE: the largest representable number in the ﬂoating point system;
TINY: the smallest positive representable number in the ﬂoating point system.
ǫ
m
: the machine epsilon, the smallest positive number which, when added to
1, gives something other than 1 when using the rounding mode–round
to the nearest.
These so-called “machine constants” appear in Table 1.2 for the IEEE single
and IEEE double precision number systems.
For IEEE arithmetic, 1/TINY < HUGE, but 1/HUGE < TINY. This brings up
the question of what happens when the result of a computation has absolute
value less than the smallest number representable in the system, or has abso-
lute value greater than the largest number representable in the system. In the
ﬁrst case, an underﬂow occurs, while, in the second case, an overﬂow occurs.
In ﬂoating point computations, it is usually (but not always) reasonable to
replace the result of an underﬂow by 0, but it is usually more problematical
Mathematical Review and Computer Arithmetic 21
TABLE 1.2: Machine constants for IEEE arithmetic
Precision HUGE TINY ǫm
single 2
127
≈ 3.40 · 10
38
2
−126
≈ 1.18 · 10
−38
2
−24
+ 2
−45
≈ 5.96 · 10
−8
double 2
1023
≈ 1.79 · 10
308
2
−1022
≈ 2.23 · 10
−308
2
−53
+ 2
−105
≈ 1.11 · 10
−16
when an overﬂow occurs. Many systems prior to the IEEE standard replaced
an underﬂow by 0 but stopped when an overﬂow occurred.
The IEEE standard speciﬁes representations for special numbers ∞, −∞,
+0, −0, and NaN, where the latter represents “not a number.” The standard
speciﬁes that computations do not stop when an overﬂow or underﬂow occurs,
or when quantities such as
√
−1, 1/0, −1/0, etc. are encountered (although
many programming languages by default or optionally do stop). For example,
the result of an overﬂow is set to ∞, whereas the result of
√
−1 is set to NaN,
and computation continues. The standard also speciﬁes “gradual underﬂow,”
that is, setting the result to a “denormalized” number, or a number in the
ﬂoating point format whose ﬁrst digit in the mantissa is equal to 0. Com-
putation rules for these special numbers, such as NaN any number = NaN,
∞any positive normalized number = ∞, allow such “nonstop” arithmetic.
Although the IEEE nonstop arithmetic is useful in many contexts, the nu-
merical analyst should be aware of it and be cautious in interpreting results.
In particular, algorithms may not behave as expected if many intermediate
results contain ∞ or NaN, and the accuracy is less than expected when denor-
malized numbers are used. In fact, many programming languages, by default
or with a controllable option, stop if ∞ or NaN occurs, but implement IEEE
nonstop arithmetic with an option.
Example 1.20
IEEE double precision ﬂoating point arithmetic underlies most computations
in matlab. (This is true even if only four decimal digits are displayed.) One
obtains the machine epsilon with the function eps, one obtains TINY with the
function realmax, and one obtains HUGE with the function realmin. Observe
the following matlab dialog:
>> epsm = eps(1d0)
epsm = 2.2204e-016
>> TINY = realmin
TINY = 2.2251e-308
>> HUGE = realmax
HUGE = 1.7977e+308
>> 1/TINY
ans = 4.4942e+307
>> 1/HUGE
22 Applied Numerical Methods
ans = 5.5627e-309
>> HUGE^2
ans = Inf
>> TINY^2
ans = 0
>> new_val = 1+epsm
new_val = 1.0000
>> new_val - 1
ans = 2.2204e-016
>> too_small = epsm/2
too_small = 1.1102e-016
>> not_new = 1+too_small
not_new = 1
>> not_new - 1
ans = 0
>>
Example 1.21
(Illustration of underﬂow and overﬂow) Suppose, for the purposes of illustra-
tion, we have a system with β = 10, t = 2 and one digit in the exponent, so
that the positive numbers in the system range from 0.10 10
−9
to 0.99 10
9
,
and suppose we wish to compute N =
_
x
2
1
+x
2
2
, where x
1
= x
2
= 10
6
. Then
both x
1
and x
2
are exactly represented in the system, and the nearest ﬂoating
point number in the system to N is 0.14 10
7
, well within range. However,
x
2
1
= 10
12
, larger than the maximum ﬂoating point number in the system.
In older systems, an overﬂow usually would result in stopping the compu-
tation, while in IEEE arithmetic, the result would be assigned the symbol
“Inﬁnity.” The result of adding “Inﬁnity” to “Inﬁnity” then taking the square
root would be “Inﬁnity,” so that N would be assigned “Inﬁnity.” Similarly,
if x
1
= x
2
= 10
−6
, then x
2
1
= 10
−12
, smaller than the smallest representable
machine number, causing an “underﬂow.” On older systems, the result is usu-
ally set to 0. On IEEE systems, if “gradual underﬂow” is switched on, the
result either becomes a denormalized number, with less than full accuracy,
or is set to 0; without gradual underﬂow on IEEE systems, the result is set
to 0. When the result is set to 0, a value of 0 is stored in N, whereas the
closest ﬂoating point number in the system is 0.14 10
−5
, well within range.
To avoid this type of catastrophic underﬂow and overﬂow in the computation
of N, we may use the following scheme.
1. s ← max¦[x
1
[, [x
2
[¦.
2. η
1
← x
1
/s; η
2
← x
2
/s.
3. N ← s
_
η
2
1
+η
2
2
.
Mathematical Review and Computer Arithmetic 23
1.2.2.1 Input and Output
For examining the output to large numerical computations arising from
mathematical models, plots, graphs, and movies comprised of such plots and
graphs are often preferred over tables of values. However, to develop such
models and study numerical algorithms, it is necessary to examine individual
numbers. Because humans are trained to comprehend decimal numbers more
easily than binary numbers, the binary format used in the machine is usually
converted to a decimal format for display or printing. In many programming
languages and environments (such as all versions of Fortran, C, C++, and
in matlab), the format is of a form similar to ±d
1
.d
2
d
3
...d
m
e±δ
1
δ
2
δ
3
, or
±d
1
.d
2
d
3
...d
m
E±δ
1
δ
2
δ
3
, where the “e” or “E” denotes the “exponent” of
10. For example, -1.00e+003 denotes −1 10
3
= −1000. Numbers are
usually also input either in a standard decimal form (such as 0.001) or in
this exponential format (such as 1.0e-3). (This notation originates from
the earliest computers, where the only output was a printer, and the printer
could only print numerical digits and the 26 upper case letters in the Roman
alphabet.)
Thus, for input, a decimal fraction needs to be converted to a binary ﬂoat-
ing point number, while, for output, a binary ﬂoating point number needs
to be converted to a decimal fraction. This conversion necessarily is inexact.
For example, the exact decimal fraction 0.1 converts to the inﬁnitely repeat-
ing binary expansion (0.00011)
2
, which needs to be rounded into the binary
ﬂoating point system. The IEEE 754 standard speciﬁes that the result of a
decimal to binary conversion, within a speciﬁed range of input formats, be
the nearest ﬂoating point number to the exact result, over a speciﬁed range,
and that, within a speciﬁed range of formats, a binary to decimal conversion
be the nearest number in the speciﬁed format (which depends on the number
m of decimal digits requested to be printed).
Thus, the number that one sees as output is usually not exactly the num-
ber that is represented in the computer. Furthermore, while the ﬂoating
point operations on binary numbers are usually implemented in hardware or
“ﬁrmware” independently of the software system, the decimal to binary and
binary to decimal conversions are usually implemented separately as part of
the programming language (such as Fortran, C, C++, Java, etc.) or software
system (such as matlab). The individual standards for these languages, if
there are any, may not specify accuracy for such conversions, and the lan-
guages sometimes do not conform to the IEEE standard. That is, the number
that one sees printed may not even be the closest number in that format to
the actual number.
This inexactness in conversion usually does not cause a problem, but may
cause much confusion in certain instances. In those instances (such as in
“debugging,” or ﬁnding programming blunders), one may need to examine
the binary numbers directly. One way of doing this is in an “octal,” or base-8
format, in which each digit (between 0 and 7) is interpreted as a group of
24 Applied Numerical Methods
three binary digits, or in hexadecimal format (where the digits are 0-9, A, B,
C, D, E, F), in which each digit corresponds to a group of four binary digits.
1.2.2.2 Standard Functions
To enable accurate computation of elementary functions such as sin, cos,
and exp, IEEE 754 speciﬁes that a “long” 80-bit register (with “guard digits”)
be available for intermediate computations. Furthermore, IEEE 754-2008, an
oﬃcial update to IEEE 754-1985, provides a list of functions it recommends be
implemented, and speciﬁes accuracy requirements (in terms of correct round-
ing), for those functions a programming language elects to implement.
REMARK 1.4 Alternative number systems, such as variable precision
arithmetic, multiple precision arithmetic, rational arithmetic, and combina-
tions of approximate and symbolic arithmetic have been investigated and
implemented. These have various advantages over the traditional ﬂoating
point arithmetic we have been discussing, but also have disadvantages, and
usually require more time, more circuitry, or both. Eventually, with the ad-
vance of computer hardware and better understanding of these alternative
systems, their use may become more ubiquitous. However, for the foreseeable
future, traditional ﬂoating point number systems will be the primary tool in
numerical computations.
1.3 Interval Computations
Interval computations are useful for two main purposes:
• to use ﬂoating point computations to compute mathematically rigor-
ous bounds on an exact result (and hence to rigorously bound roundoﬀ
error);
• to use ﬂoating point computations to compute mathematically rigorous
bounds on the ranges of functions over boxes.
In complicated traditional ﬂoating point algorithms, naive arrangement of in-
terval computations usually gives bounds that are too wide to be of practical
use. For this reason, interval computations have been ignored by many. How-
ever, used cleverly and where appropriate, interval computations are powerful,
and provide rigor and validation when other techniques cannot.
Interval computations are based on interval arithmetic.
Mathematical Review and Computer Arithmetic 25
1.3.1 Interval Arithmetic
In interval arithmetic, we deﬁne operations on intervals, which can be con-
sidered as ordered pairs of real numbers. We can think of each interval as
representing the range of possible values of a quantity. The result of an op-
eration is then an interval that represents the range of possible results of the
operation as the range of all possible values, as the ﬁrst argument ranges over
all points in the ﬁrst interval and the second argument ranges over all values
in the second interval. To state this symbolically, let x = [x, x] and y = [y, y],
and deﬁne the four elementary operations by
x ⊙y = ¦x ⊙y [ x ∈ x and y ∈ y¦ for ⊙ ∈ ¦+, −, , ÷¦. (1.1)
Interval arithmetic’s usefulness derives from the fact that the mathematical
characterization in Equation (1.1) is equivalent to the following operational
deﬁnitions.
x +y = [x +y, x +y],
x −y = [x −y, x −y],
x y = [min¦xy, xy, xy, xy¦, max¦xy, xy, xy, xy¦]
1
x
= [
1
x
,
1
x
] if x > 0 or x < 0
x ÷y = x
1
y
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
(1.2)
The ranges of the four elementary interval arithmetic operations are ex-
actly the ranges of the corresponding real operations, but, if such operations
are composed, bounds on the ranges of real functions can be obtained. For
example, if
f(x) = (x + 1)(x −1), (1.3)
then
f([−2, 2]) =
_
[−2, 2] + 1
__
[−2, 2] −1
_
= [−1, 3][−3, 1] = [−9, 3],
which contains the exact range [−1, 3].
REMARK 1.5 In some deﬁnitions of interval arithmetic, division by
intervals containing 0 is deﬁned, consistent with (1.1). For example,
[1, 2]
[−3, 4]
=
_
−∞, −
1
3
_
_
_
1
4
, ∞
_
= R
∗
_
_
−
1
3
,
1
4
_
,
where R
∗
is the extended real number system,
4
consisting of the real numbers
with the two additional numbers −∞ and ∞. This extended interval arith-
4
also known as the two-point compactiﬁcation of the real numbers
26 Applied Numerical Methods
metic
5
was originally invented by William Kahan
6
for computations with con-
tinued fractions, but has wider use than that. Although a closed system can
be deﬁned for the sets arising from this extended arithmetic, typically, the
complements of intervals (i.e., the unions of two semi-inﬁnite intervals) are
immediately intersected with intervals, to obtain zero, one, or two intervals.
Interval arithmetic can then proceed using (1.2).
The power of interval arithmetic lies in its implementation on computers. In
particular, outwardly rounded interval arithmetic allows rigorous enclosures
for the ranges of operations and functions. This makes a qualitative diﬀerence
in scientiﬁc computations, since the results are now intervals in which the
exact result must lie. It also enables use of ﬂoating point computations for
automated theorem proving.
Outward rounding can be implemented on any machine that has downward
rounding and upward rounding, such as any machine that complies with the
IEEE 754 standard. For example, take x + y = [x + y, x + y]. If x + y
is computed with downward rounding, and x + y is computed with upward
rounding, then the resulting interval z = [z, z] that is represented in the
machine must contain the exact range of x + y for x ∈ x and y ∈ y. We call
the expansion of the interval from rounding the lower end point down and the
upper end point up roundout error.
Interval arithmetic is only subdistributive. That is, if x, y, and z are
intervals, then
x(y +z) ⊆ xy +xz, but x(y +z) ,= xy +xz in general. (1.4)
As a result, algebraic expressions that would be equivalent if real values are
substituted for the variables are not equivalent if interval values are used. For
example, if, instead of writing (x − 1)(x + 1) for f(x) in (1.3), suppose we
write
f(x) = x
2
−1, (1.5)
and suppose we provide a routine that computes an enclosure for the range
of x
2
that is the exact range to within roundoﬀ error. Such a routine could
be as follows:
ALGORITHM 1.1
(Computing an interval whose end points are machine numbers and which
encloses the range of x
2
.)
5
There are small diﬀerences in current deﬁnitions of extended interval arithmetic. For
example, in some systems, −∞ and ∞ are not considered numbers, but just descriptive
symbols. In those systems, [1, 2]/[−3, 4] = (−∞, −1/3] ∪ [1/4, ∞) = R\(−1/3, 1/4). See
[31] for a theoretical analysis of extended arithmetic.
6
who also was a major contributor to the IEEE 754 standard
Mathematical Review and Computer Arithmetic 27
INPUT: x = [x, x].
OUTPUT: a machine-representable interval that contains the range of x
2
over
x.
IF x ≥ 0 THEN
RETURN [x
2
, x
2
], where x
2
is computed with downward rounding and
x
2
is computed with upward rounding.
ELSE IF x ≤ 0 THEN
RETURN [x
2
, x
2
], where x
2
is computed with downward rounding and
x
2
is computed with upward rounding.
ELSE
1. Compute x
2
and x
2
with both downward and upward rounding; that is,
compute x
2
l
and x
2
u
such that x
2
l
and x
2
u
are machine representable num-
bers and x
2
∈ [x
2
l
, x
2
u
], and compute x
2
l
and x
2
u
such that x
2
l
and x
2
u
are
machine representable numbers and x
2
∈ [x
2
l
, x
2
u
].
2. RETURN [0, max
_
x
2
u
, x
2
u
_
].
END IF
END ALGORITHM 1.1.
With Algorithm 1.1 and rewriting f(x) from (1.3) as in (1.5), we obtain
f([−2, 2]) = [−2, 2]
2
− 1 = [0, 4] −1 = [−1, 3]
which, in this case, is equal to the exact range of f over [−2, 2].
In fact, this illustrates a general principle: If each variable in the expression
occurs only once, then interval arithmetic gives the exact range, to within
roundout error. We state this formally as
THEOREM 1.8
(Fundamental theorem of interval arithmetic.) Suppose f(x
1
, x
2
, . . . , x
n
) is
an algebraic expression in the variables x
1
through x
n
(or a computer program
with inputs x
1
through x
n
), and suppose that this expression is evaluated with
interval arithmetic. The algebraic expression or computer program can con-
tain the four elementary operations and operations such as x
n
, sin(x), exp(x),
and log(x), etc., as long as the interval values of these functions contain their
range over the input intervals. Then
1. The interval value f(x
1
, . . . , x
n
) contains the range of f over the inter-
val vector (or box) (x
1
, . . . , x
n
).
28 Applied Numerical Methods
2. If the single functions (the elementary operations and functions x
n
, etc.)
have interval values that represent their exact ranges, and if each vari-
able x
i
, 1 ≤ i ≤ n occurs only once in the expression for f, then the
values of f obtained by interval arithmetic represent the exact ranges of
f over the input intervals.
If the expression for f contains one or more variables more than once,
then overestimation of the range can occur due to interval dependency. For
example, when we evaluate our example function f([−2, 2]) according to (1.3),
the ﬁrst factor, [−1, 3] is the exact range of x + 1 for x ∈ [−2, 2], while the
second factor, [−3, 1] is the exact range of x −1 for x ∈ [−2, 2]. Thus, [−9, 3]
is the exact range of
˜
f(x
1
, x
2
) = (x
1
+ 1)(x
2
−1) for x
1
and x
2
independent,
x
1
∈ [−2, 2], x
2
∈ [−2, 2].
We now present some deﬁnitions and theorems to clarify the practical con-
sequences of interval dependency.
DEFINITION 1.6 An expression for f(x
1
, . . . , x
n
) which is written so
that each variable occurs only once is called a single use expression, or SUE.
Fortunately, we do not need to transform every expression into a single use
expression for interval computations to be of value. In particular, the interval
dependency becomes less as the widths of the input intervals becomes smaller.
The following formal deﬁnition will help us to describe this precisely.
DEFINITION 1.7 Suppose an interval evaluation f(x
1
, . . . , x
n
) gives
[a, b] as a result interval, but the exact range ¦f(x
1
, . . . , x
n
), x
i
∈ x
i
, 1 ≤ i ≤ n¦
is [c, d] ⊆ [a, b]. We deﬁne the excess width E(f; x
1
, . . . , x
n
) in the interval
evaluation f(x
1
, . . . , x
n
) by E(f; x
1
, . . . , x
n
) = (c −a) + (b −d).
For example, the excess width in evaluating f(x) represented as (x+1)(x−1)
over x = [−2, 2] is (−1 −(−9)) + (3 −3) = 8. In general, we have
THEOREM 1.9
Suppose f(x
1
, x
2
, . . . , x
n
) is an algebraic expression in the variables x
1
through
x
n
(or a computer program with inputs x
1
through x
n
), and suppose that this
expression is evaluated with interval arithmetic, as in Theorem 1.8, to obtain
an interval enclosure f(x
1
, . . . , x
n
) to the range of f for x
i
∈ x
i
, 1 ≤ i ≤ n.
Then, if E(f; x
1
, . . . , x
n
) is as in Deﬁnition 1.7, we have
E(f; x
1
, . . . , x
n
) = O
_
max
1≤i≤n
w(x
i
)
_
,
where w(x) denotes the width of the interval x.
Mathematical Review and Computer Arithmetic 29
That is, the overestimation becomes less as the uncertainty in the arguments
to the function becomes smaller.
Interval evaluations as in Theorem 1.9 are termed ﬁrst-order interval ex-
tensions. It is not diﬃcult to obtain second-order extensions, where required.
(See Exercise ?? below.)
1.3.2 Application of Interval Arithmetic: Examples
We give one such example here.
Example 1.22
Using 4-digit decimal ﬂoating point arithmetic, compute an interval enclosure
for the ﬁrst two digits of e, and prove that these two digits are correct.
Solution: The ﬁfth degree Taylor polynomial representation for e is
e = 1 + 1 +
1
2!
+
1
3!
+
1
4!
+
1
5!
+
1
6!
e
ξ
,
for some ξ ∈ [0, 1]. If we assume we know e < 3 and we assume we know e
x
is an increasing function of x, then the error term is bounded by
¸
¸
¸
¸
1
6!
e
ξ
¸
¸
¸
¸
≤
3
6!
< 0.005,
so this ﬁfth-degree polynomial representation should be adequate. We will
evaluate each term with interval arithmetic, and we will replace e
ξ
with [1, 3].
We obtain the following computation:
[1.000, 1.000] + [1.000, 1.000] → [2.000, 2.000]
[1.000, 1.000]/[2.000, 2.000] → [0.5000, 0.5000]
[2.000, 2.000] + [0.5000, 0.5000] → [2.500, 2.500]
[1.000, 1.000]/[6.000, 6.000] → [0.1666, 0.1667]
[2.500, 2.500] + [0.1666, 0.1667] → [2.666, 2.667]
[1.000, 1.000]/[24.00, 24.00] → [0.04166, 0.04167]
[2.666, 2.667] + [0.04166, 0.04167] → [2.707, 2.709]
[1.000, 1.000]/[120.0, 120.0] → [0.008333, 0.008334]
[2.707, 2.709] + [0.008333, 0.008334] → [2.715, 2.718]
[1.000, 1.000]/[720.0, 720.0] → [0.001388, 0.001389]
[.001388, .001389] [1, 3] → [0.001388, 0.004167]
[2.715, 2.718] + [0.001388, 0.004167] → [2.716, 2.723]
Since we used outward rounding in these computations, this constitutes a
mathematical proof that e ∈ [2.716, 2.723].
Note:
30 Applied Numerical Methods
1. These computations can be done automatically on a computer, as simply
as evaluating the function in ﬂoating point arithmetic. We will explain
some programming techniques for this in Chapter 6, Section 6.2.
2. The solution is illustrative. More sophisticated methods, such as argu-
ment reduction, would be used in practice to bound values of e
x
more
accurately and with less operations.
Proofs of the theorems, as well as greater detail, appear in various texts
on interval arithmetic. A good book on interval arithmetic is R. E. Moore’s
classic text [27] although numerous more recent monographs and reviews are
available. A World Wide Web search on the term “interval computations”
will lead to some of these.
A general introduction to interval computations is [26]. That work gives
not only a complete introduction, with numerous examples and explanation
of pitfalls, but also provides examples with intlab, a free matlab toolbox for
interval computations, and reference material for intlab. If you have mat-
lab available, we recommend intlab for the exercises in this book involving
interval computations.
1.4 Programming Environments
Modern scientiﬁc computing (with ﬂoating point numbers) is usually done
with high-level “imperative” (as opposed to “functional”) programming lan-
guages. Common programming environments in use for general scientiﬁc com-
puting today are Fortran (or FORmula TRANslation), C/C++, and matlab.
Fortran is the original such language, with its origins in the late 1950’s. There
is a large body of high-quality publicly available software in Fortran for com-
mon computations in numerical analysis and scientiﬁc computing. Such soft-
ware can be found, for example, on NETLIB, at
http://www.netlib.org/
Fortran has evolved over the years, becoming a modern, multi-faceted lan-
guage with the Fortran 2003 standard. Throughout, the emphasis by both
the standardization committee and suppliers of compilers for Fortran has been
features that simplify programming of solutions to large problems in numerical
analysis and scientiﬁc computing, and features that enable high performance,
especially on computers that can process vectors and matrices eﬃciently.
The “C” language, originally developed in conjunction with the Unix oper-
ating system, was originally meant to be a higher-level language for designing
and accessing the operating system, but has become more ubiquitous since
then. C++, appearing in the late 1980’s, was the ﬁrst widely available lan-
Mathematical Review and Computer Arithmetic 31
guage
7
to allow the object-oriented programming paradigm. In recent years,
computer science departments have favored teaching C++ over teaching For-
tran, and Fortran has fallen out of favor in relative terms. However, Fortran is
still favored in certain large-scale applications such as ﬂuid dynamics (e.g. in
weather prediction and similar simulations), and some courses are still oﬀered
in it in engineering schools. However, some people still think of Fortran as
the now somewhat rudimentary language known as FORTRAN 77.
Reasonably high-quality compilers for both Fortran and C/C++ are avail-
able free of charge with Linux operating systems. Fortran 2003, largely im-
plemented in these compilers, has a standardized interface to C, so functions
written in C can be called from Fortran programs, and visa versa. These com-
pilers include interactive graphical-user-interface-oriented debuggers, such as
“insight,” available with the Linux operating system. Commercially available
compilation and debugging systems are also available under Windows.
The matlab

system has become increasingly popular over the last two
decades or so. The matlab (or MATrix LABoratory) began in the early
1980’s as a National Science Foundation project, written by Cleve Moler in
FORTRAN 66, to provide an interactive environment for computing with ma-
trices and vectors, but has since evolved to be both an interactive environment
and full-featured programming language. matlab is highly favored in courses
such as this, because the ease of programming, debugging, and general use
(such as graphing), and because of the numerous toolboxes, supplied by both
Mathworks (Cleve Moler’s company) and others, for many basic computing
tasks and applications. The main drawback to use of matlab in all scientiﬁc
computations is that the language is interpretive, that is, matlab translates
each line of a program to machine language each time that it executes the line.
This makes complicated programs that involve nested iterations much slower
(a factor of 60 or more) than comparable programs written in a compiled
language such as C or Fortran. However, functions compiled from Fortran or
C/C++ can be called from matlab. A common strategy has been to initially
develop algorithms in matlab, then translate all or part of the program to a
compilable language, as necessary for eﬃciency.
One perceived disadvantage of matlab is that it is proprietary. Undesirable
possible consequences are that it is not free, and there is no oﬃcial guarantee
that it will be available forever, unchanged. However, its use has become
so widespread in recent years that these concerns do not seem to be major.
Several projects including “Octave” and “Scilab” have produced free products
that partially support the matlab programming language. The most widely
distributed of these “Octave,” is integrated into Linux systems. However,
the object-oriented features of Octave are rudimentary compared to those of
matlab, and some toolboxes, such as intlab (which we will mention later)
will not function with Octave.
7
with others, including Fortran, to follow
32 Applied Numerical Methods
Alternative systems sometimes used for scientiﬁc computing are computer
algebra systems. Perhaps the most common of these are Mathematica

and
Maple

, while a free such system under development is “SAGE.” These sys-
tems admit a diﬀerent way of thinking about programming, termed functional
programming, in which rules are deﬁned and available at all times to automat-
ically simplify expressions that are presented to the system. (In contrast, in
imperative programming, a sequence of commands is executed one after the
other.) Although these systems have become comprehensive, they are based
in computations of a diﬀerent character, rather than in the ﬂoating point
computations and linear algebra involved in numerical analysis and scientiﬁc
computing.
We will use matlab in this book to illustrate the concepts, techniques, and
applications. With newer versions of matlab, a student can study how to
use the system and make programs largely by using the matlab help system.
The ﬁrst place to turn will be the “Getting started” demos, which in newer
versions are presented as videos. There are also many books devoted to use
of matlab. Furthermore, we will be giving examples throughout this book.
matlab programs can be written as matlab scripts and matlab func-
tions.
Example 1.23
The matlab script we used to produce the table following Example 1.7 (on
page 8) is:
a = 2;
x=2;
xold=x;
err_old = 1;
for k=0:10
k
x
err = x - sqrt(2);
err
ratio = err/err_old^2
err_old = err;
x = x/2 + 1/x;
end
Example 1.24
The matlab script we used to produce the table in Example 1.8 (on page 9)
is
format long
a = 2;
x=2;
Mathematical Review and Computer Arithmetic 33
xold=x;
err_old = 1;
for k=0:25
k
x
err = x - sqrt(2);
err
ratio = err/err_old
err_old = err;
x = x - x^2/3.5 + 2/3.5;
end
An excellent alternative text book that focuses on matlab functions is
Cleve Moler’s Numerical Computing with Matlab [25]. An on-line version,
along with “m” ﬁles, etc., is currently available at http://www.mathworks.
com/moler/chapters.html.
1.5 Applications
The purpose of the methods and techniques in this book ultimately is to
provide both accurate predictions and insight into practical problems. This
includes understanding and predicting, and managing or controlling the evolu-
tion of ecological systems and epidemics, designing and constructing durable
but inexpensive bridges, buildings, roads, water control structures, under-
standing chemical and physical processes, designing chemical plants and elec-
tronic components and systems, minimizing costs or maximizing delivery of
products or services within companies and governments, etc. To achieve these
goals, the numerical methods are a small part of the overall modeling process,
that can be viewed as consisting of the following steps.
Identify the problem: This is the ﬁrst step in translating an often vague
situation into a mathematical problem to be solved. What questions
must be answered and how can they be quantiﬁed?
Assumptions: Which factors are to be ignored and which are important?
The real world is usually signiﬁcantly more complicated than mathe-
matical models of it, and simpliﬁcations must be made, because some
factors are poorly understood, because there isn’t enough data to de-
termine some minor factors, or because it is not practical to accurately
solve the resulting equations unless the model is simpliﬁed. For exam-
ple, the theory of relativity and variations in the acceleration of gravity
34 Applied Numerical Methods
due to the fact that the earth is not exactly round and due to the fact
that the density diﬀers from point to point on the surface of the earth
in principle will aﬀect the trajectory of a baseball as it leaves the bat.
However, such eﬀects can be ignored when we write down a model of
the trajectory of the baseball. On the other hand, we need to include
such eﬀects if we are measuring the change in distance between two
satellites in a tandem orbit to detect the location of mineral deposits on
the surface of the earth.
Construction: In this step, we actually translate the problem into mathe-
matical language.
Analysis: We solve the mathematical problem. Here is where the numerical
techniques in this book come into play. With more complicated models,
there is an interplay between the previous three steps and this solution
process: We may need to simplify the process to enable practical so-
lution. Also, presentation of the result is important here, to maximize
the usefulness of the results. In the early days of scientiﬁc computing,
printouts of numbers are used, but increasingly, results are presented as
two and three-dimensional graphs and movies.
Interpretation: The numerical solution is compared to the original problem.
If it does not make sense, go back and reformulate the assumptions.
Validation: Compare the model to real data. For example, in climate mod-
els, the model might be used to predict climate changes in past years,
before it is used to predict future climate changes.
Note that there is an intrinsic error introduced in the modeling process
(such as when certain phenomena are ignored), that is outside the scope of
our study of numerical methods. Such error can only be measured indirectly
through the interpretation and validation steps. In the model solution process
(the “analysis” step), errors are also introduced due to roundoﬀ error and the
approximation process. We have seen that such error consists of approxima-
tion error and roundoﬀ error. In a study of numerical methods and numerical
analysis, we quantify and ﬁnd bounds on such errors. Although this may not
be the major source of error, it is important to know. Consequences of this
type of error might be that a good model is rejected, that incorrect conclu-
sions are deduced about the process being modeled, etc. The authors of this
book have personal experience with these events.
Errors in the modeling process can sometimes be quantiﬁed in the solution
process. If the model depends on parameters that are not known precisely, but
bounds on those parameters are known, knowledge of these bounds can some-
times be incorporated into the mathematical equations, and the set of possible
solutions can sometimes be computed or bounded. One tool that sometimes
works is interval arithmetic. Other tools, less mathematically deﬁnite but
Mathematical Review and Computer Arithmetic 35
applicable in diﬀerent situations, are statistical methods and computing solu-
tions to the model for many diﬀerent values of the parameter.
Throughout this book, we introduce applications from many areas.
Example 1.25
The formula for the net capacitance when two capacitors of values x and y
are connected in series is
z =
xy
x +y
.
Suppose the measured values of x and y are x = 1 and y = 2, respectively.
Estimate the range of possible values of z, given that the true values of x and
y are known to be within ±10% of the measured value.
In this example, the identiﬁcation, assumptions, and construction have al-
ready been done. (It is well known how capacitances in a linear electrical
circuit behave.) We are asked to analyze the error in the output of the com-
putation, due to errors in the data. We may proceed using interval arithmetic,
relying on the accuracy assumptions for the measured values. In particular,
these assumptions imply that x ∈ [0.9, 1.1] and y ∈ [1.8, 2.2]. We will plug
these intervals into the expression for z, but we ﬁrst use Theorem 1.8, part (2)
as a guide to rewrite the expression for z so x and y only occur once. (We
do this so we obtain sharp bounds on the range, without overestimation.)
Dividing the numerator and denominator for z by xy, we obtain
z =
1
1
x
+
1
y
.
We use the intlab toolbox
8
for matlab to evaluate z. We have the following
dialog in matlab’s command window.
>> intvalinit(’DisplayInfsup’)
===> Default display of intervals by infimum/supremum
>> x = intval(’[0.9,1.1]’)
intval x =
[ 0.8999, 1.1001]
>> y = intval(’[1.8,2.2]’)
intval y =
[ 1.7999, 2.2001]
>> z = 1/(1/x + 1/y)
intval z =
[ 0.5999, 0.7334]
>> format long
>> z
intval z =
8
If one has matlab, intlab is available free of charge for non-commercial use from http:
//www.ti3.tu-harburg.de/
~
rump/intlab/
36 Applied Numerical Methods
[ 0.59999999999999, 0.73333333333334]
>>
Thus, the capacitance must lie between 0.5999 and 0.7334.
Note that x and y are input as strings. This is to assure that roundoﬀ
errors in converting the decimal expressions 0.9, 1.1, 1.8, and 2.2 into internal
binary format are taken into account. See [26] for more examples of the use
of intlab.
1.6 Exercises
1. Write down a polynomial p(x) such that [S(x)−p(x)[ ≤ 10
−10
for −0.2 ≤
x ≤ 0.2, where
S(x) =
_
_
_
sin(x)
x
if x ,= 0,
1 if x = 0.
Note: sinc(x) = S(πx) = sin(πx)/(πx) is the “sinc” function (well-
known in signal processing, etc.).
(a) Show that your polynomial p satisﬁes the condition [sinc(x) −
p(x)[ ≤ 10
−10
for x ∈ [−0.2, 0.2].
Hint: You can obtain polynomial approximations with error terms
for sinc(x) by writing down Taylor polynomials and corresponding
error terms for sin(x), then dividing these by x. This can be easier
than trying to diﬀerentiate sinc(x). For the proof part, you can
use, for example, the Taylor polynomial remainder formula or the
alternating series test and you can use interval arithmetic to obtain
bounds.
(b) Plot your polynomial approximation and sinc(x) on the same graph,
(i) over the interval [−0.2, 0.2],
(ii) over the interval [−3, 3],
(iii) over the interval [−10, 10].
2. Suppose f has a continuous third derivative. Show that
¸
¸
¸
¸
f(x +h) −f(x −h)
2h
−f
′
(x)
¸
¸
¸
¸
= O(h
2
).
3. Suppose f has a continuous fourth derivative. Show that
¸
¸
¸
¸
f(x +h) −2f(x) +f(x −h)
h
2
− f
′′
(x)
¸
¸
¸
¸
= O(h
2
).
Mathematical Review and Computer Arithmetic 37
4. Let a = 0.41, b = 0.36, and c = 0.7. Assuming a 2-digit decimal com-
puter arithmetic with rounding, show that
a −b
c
,=
a
c
−
b
c
when using
this arithmetic.
5. Write down a formula relating the unit roundoﬀ δ of Deﬁnition 1.3
(page 14) and the machine epsilon ǫ
m
deﬁned on page 20.
6. Store and run the following matlab script. What are your results?
What does the script compute? Can you say anything about the com-
puter arithmetic underlying matlab?
eps = 1;
x = 1+eps;
while(x~=1)
eps = eps/2;
x = 1+eps;
end
eps = eps+(2*eps)^2
y = 1+eps;
y-1
7. Suppose, for illustration, we have a system with base β = 10, t = 3
decimal digits in the mantissa, and L = −9, U = 9 for the exponent.
For example, 0.123 10
4
, that is, 1230 is a machine number in this
system. Suppose also that “round to nearest” is used in this system.
(a) What is HUGE for this system?
(b) What is TINY for this system?
(c) What is the machine epsilon ǫ
m
for this system?
(d) Let f(x) = sin(x) + 1.
i. Write down ﬂ(f(0)) and ﬂ(f(0.0008)) in normalized format
for this toy system.
ii. Compute ﬂ(ﬂ(f(0.0008)) −ﬂ(f(0))) On the other hand, what
is the nearest machine number to the exact value of f(0.0008)−
f(0)?
iii. Compute ﬂ(ﬂ(f(0.0008)) −ﬂ(f(0)))/ﬂ(0.0008). Compare this
to the nearest machine number to the exact value of (f(0.0008)−
f(0))/0.0008 and to f
′
(0).
8. Let f(x) =
ln(x + 1) −ln(x)
2
.
(a) Use four-digit decimal arithmetic with rounding to evaluate
f(100, 000).
38 Applied Numerical Methods
(b) Use the Mean Value Theorem to approximate f(x) in a form that
avoids the loss of signiﬁcant digits. Use this form to evaluate f(x)
for x = 100, 000 once again.
(c) Compare the relative errors for the answers obtained in (a) and
(b).
9. Compute the condition number of f(x) = e
√
x
2
−1
, x > 1 and discuss
any possible ill-conditioning.
10. Let f(x) = (sin(x))
2
+x/2. Use interval arithmetic to prove that there
are no solutions to f(x) = 0 for x ∈ [−1, −0.8].
Chapter 2
Numerical Solution of Nonlinear
Equations of One Variable
In this chapter, we study methods for ﬁnding approximate solutions to the
equation f(x) = 0, where f is a real-valued function of a real variable. Some
classical examples include the equation x − tan x = 0 that occurs in the
diﬀraction of light, or Kepler’s equation x − b sinx = 0 used for calculating
planetary orbits. Other examples include transcendental equations such as
f(x) = e
x
+x = 0 and algebraic equations such as x
7
+4x
5
−7x
2
+6x+3 = 0.
2.1 Bisection Method
The bisection method is simple, reliable, and almost always can be applied,
but is generally not as fast as other methods. Note that, if y = f(x), then
f(x) = 0 corresponds to the point where the curve y = f(x) crosses the x-
axis. The bisection method is based on the following direct consequence of
the Intermediate Value Theorem.
THEOREM 2.1
Suppose that f ∈ C[a, b] and f(a)f(b) < 0. Then there is a z ∈ [a, b] such
that f(z) = 0. (See Figure 2.1.)
The method of bisection is simple to implement as illustrated in the following
algorithm:
ALGORITHM 2.1
(The bisection algorithm)
INPUT: An error tolerance ǫ
OUTPUT: Either a point x that is within ǫ of a solution z or “failure to ﬁnd
a sign change”
1. Find a and b such that f(a)f(b) < 0. (By Theorem 2.1, there is a
39
40 Applied Numerical Methods
x
y y = f(x)
a
+
z
b
+
FIGURE 2.1: Example for the Intermediate Value Theorem applied to
roots of a function.
z ∈ [a, b] such that f(z) = 0.) (Return with “failure to ﬁnd a sign
change” if such an interval cannot be found.)
2. Let a
0
= a, b
0
= b, k = 0.
3. Let x
k
= (a
k
+b
k
)/2.
4. IF f(x
k
)f(a
k
) > 0 THEN
(a) a
k+1
← x
k
,
(b) b
k+1
← b
k
.
ELSE
(a) b
k+1
← x
k
,
(b) a
k+1
← a
k
.
END IF
5. IF (b
k
−a
k
)/2 < ǫ
THEN
Stop, since x
k
is within ǫ of z. (See the explanation below.)
ELSE
(a) k ← k + 1.
(b) Return to step 3.
END IF
END ALGORITHM 2.1.
Basically, in the method of bisection, the interval [a
k
, b
k
] contains z and
b
k
− a
k
= (b
k−1
− a
k−1
)/2. The interval containing z is reduced by a factor
Numerical Solution of Nonlinear Equations of One Variable 41
of 2 at each iteration.
Note: In practice, when programming bisection, we usually do not store the
numbers a
k
and b
k
for all k as the iteration progresses. Instead, we usually
store just two numbers a and b, replacing these by new values, as indicated
in Step 4 of our bisection algorithm (Algorithm 2.1).
x
y
f(x) = e
x
+x
−1
+
z
FIGURE 2.2: Graph of e
x
+x for Example 2.1.
Example 2.1
f(x) = e
x
+ x, f(0) = 1, f(−1) = −0.632. Thus, −1 < z < 0. (There is
a unique zero, because f
′
(x) = e
x
+ 1 > 0 for all x.) Setting a
0
= −1 and
b
0
= 0, we obtain the following table of values.
k a
k
b
k
x
k
0 −1 0 −1/2
1 −1 −1/2 −3/4
2 −3/4 −1/2 −0.625
3 −0.625 −0.500 −0.5625
4 −0.625 −0.5625 −0.59375
Thus z ∈ (−0.625, −0.5625); see Figure 2.2.
The method always works for f continuous, as long as a and b can be found
such that f(a)f(b) < 0 (and as long as we assume roundoﬀ error does not
cause us to incorrectly evaluate the sign of f(x)). However, consider y = f(x)
with f(x) ≥ 0 for every x, but f(z) = 0. There are no a and b such that
f(a)f(b) < 0. Thus, the method is not applicable to all problems in its
present form. (See Figure 2.3 for an example of a root that cannot be found
by bisection.)
42 Applied Numerical Methods
x
y y = f(x)
FIGURE 2.3: Example of when the method of bisection cannot be applied.
Is there a way that we can know how many iterations to do for the method
of bisection without actually performing the test in Step 5 of Algorithm 2.1?
Simply examining how the widths of the intervals decrease leads us to the
following fact.
THEOREM 2.2
Suppose that f ∈ C[a, b] and f(a)f(b) < 0, then
[x
k
− z[ ≤
b −a
2
k+1
.
Thus, in the algorithm, if
1
2
(b
k
−a
k
) = (b −a)/2
k+1
< ǫ, then [z −x
k
[ < ǫ.
Example 2.2
How many iterations are required to reduce the error to less than 10
−6
if
a = 0 and b = 1?
Solution: We need
1
2
k+1
(1 −0) < 10
−6
. Thus, 2
k+1
> 10
6
, or k = 19.
This example illustrates the preferred way of stopping the method of bisec-
tion. Namely, if the method of bisection is programmed, it is preferable to
compute an integer N such that
N >
log((b −a)/ǫ)
log(2)
−1,
and test k > N, rather than testing the length of the interval directly as in
Step 5 of Algorithm 2.1. One reason is because integer comparisons (compar-
ing k to N, or doing it implicitly in a programming language loop, such as the
matlab loop for k=1:N) are more eﬃcient than ﬂoating point comparisons.
Another reason is because, if ǫ were chosen too small (such as smaller than
the distance between machine numbers near the solution z), the comparison
in Step 5 of Algorithm 2.1 would never hold in practice, and the algorithm
would never stop.
Numerical Solution of Nonlinear Equations of One Variable 43
The following is an example of programming Algorithm 2.1 in matlab.
function [root,success] = bisect_method (a, b, eps, f)
%
% [root, success] = bisect_method (a, b, eps, func) returns the
% result of the method of bisection, with starting interval [a, b],
% tolerance eps, and with function defined by y = f(x). For example,
% suppose an m-file xsqm2.m is available in Matlab’s working
% directory, with the following contents:
% function [y] = xsqm2(x)
% y = x^2-2;
% return
% Then, issuing
% [root,success] = bisect (1, 2, 1e-10, ’xsqm2’)
% from Matlab’s command window will cause an approximation to
% the square root of 2 that, in the absence of excessive roundoff
% error, has absolute error of at most 10^{-16}
% to be returned in the variable root, and success to be set to
% ’true’.
%
% success is set to ’false’ if f(a) and f(b) do not have the same
% sign. success is also set to ’false’ if the tolerance cannot be met.
% In either case, a message is printed, and the midpoint of the present
% interval is returned in the variable root.
error=b-a;
fa=feval(f,a);
fb=feval(f,b);
success = true;
% First, handle incorrect arguments --
if (fa*fb > 0)
disp(’Error: f(a)*f(b)>0’);
success = false;
root = a + (b-a)/2;
return
end
if (eps <=0)
disp(’Error: eps is less than or equal to 0’)
success = false;
root = a + (b-a)/2;
return
end
if (b < a)
disp(’Error: b < a’)
success = false;
root = (a+b)/2;
return
end
44 Applied Numerical Methods
% Set N to be the smallest integer such that N iterations of bisection
% suffices to meet the tolerance --
N = ceil( log((b-a)/eps)/log(2) - 1 )
% This is where we actually do Algorithm 2.1 --
disp(’ -----------------------------’);
disp(’ Error Estimate ’);
disp(’ -----------------------------’);
for i=1:N
x= a + (b-a)/2;
fx=feval(f,x);
if(fx*fa > 0)
a=x;
else
b=x;
end
error=b-a;
disp(sprintf(’ %12.4e %12.4e’, error, x));
end
% Finally, check to see if the tolerance was actually met. (With
% additional analysis of the minimum possible relative error
% (according to the distance between floating point
% numbers), unreasonable values of epsilon can be determined
% before the loop on i, avoiding unnecessary work.)
error = (b-a)/2;
root = a + (b-a)/2;
if (error > eps)
disp(’Error: epsilon is too small for tolerance to be met’);
success = false;
return
end
This program includes practical considerations beyond the raw mathematical
operations in the algorithm Observe the following.
1. The comments at the beginning of the program state precisely how
the function is used. In fact, within the matlab system, if the ﬁle
bisect method.m contains this program within the working directory
or within matlab’s search path, and one issues the command help
bisect method from the matlab command window, all of these com-
ments (those lines starting with “%”) prior to the ﬁrst non-comment
line are printed to the command window.
2. There are statements to catch errors in the input arguments.
When developing such computer programs, it is wise to use a uniform style
in the comments, indentation of “if” blocks and “for” loops, etc. To a large
Numerical Solution of Nonlinear Equations of One Variable 45
extend, matlab’s editor does indentation automatically, and automatically
highlights comments and syntax elements such as “if” and “for” in diﬀerent
colors. It is also a good idea to identify the author and date programmed, as
well as the package (if any) to which the program belongs. This is done for
bisect method.m in the version posted on the web page for the book, but is
not reproduced here, for brevity.
The above implementation, stored in a matlab “m-ﬁle,” is an example of
a matlab function, that is, an m-ﬁle that begins with a function statement.
In such a ﬁle, quantities that are to be returned must appear in the list in
brackets on the left of the “=,” while quantities that are input must appear in
the list in parentheses on the right of the statement. In a function m-ﬁle, the
only quantities from the command line that are available while the operations
within the m-ﬁle are being done are those passed on the left, and the only
quantities available to the command environment (or other function) from
which the function is called are the ones in the bracketed list on the left. For
example, consider the following dialog in the matlab command window.
>> eps = 1e-16
eps =
1.0000e-016
>> [root,success] = bisect_method(1,2,1e-2,’xsqm2’)
N =
6
-----------------------------
Error Estimate
-----------------------------
5.0000e-001 1.5000e+000
2.5000e-001 1.2500e+000
1.2500e-001 1.3750e+000
6.2500e-002 1.4375e+000
3.1250e-002 1.4063e+000
1.5625e-002 1.4219e+000
root =
1.4141
success =
1
>> N
??? Undefined function or variable ’N’.
>>eps
eps =
1.0000e-016
>>
Observe that N is not available within the environment calling bisect method,
and eps is not available within bisect method. This contrasts with matlab
m-ﬁles that do not begin with a function statement. These ﬁles, termed mat-
lab scriptsMatlab!script. For example, the script run bisect method.m
might contain the following lines.
46 Applied Numerical Methods
clear
a = 1
b = 2
eps = 1e-1
[root,success] = bisect_method(a, b, eps, ’xsqm2’)
The clear command removes all quantities from the environment. Observe
now the following dialog in the matlab command window.
>> clear
>> a
??? Undefined function or variable ’a’.
>> b
??? Undefined function or variable ’b’.
>> run_bisect_method
a =
1
b =
2
eps =
0.1000
N =
3
-----------------------------
Error Estimate
-----------------------------
5.0000e-001 1.5000e+000
2.5000e-001 1.2500e+000
1.2500e-001 1.3750e+000
root =
1.4375
success =
1
>> a
a =
1
>> b
b =
2
>> eps
eps =
0.1000
>>
The reader is invited to use matlab’s help system to explore the other aspects
of the function bisect method.
We end our present discussion of matlab programs with a note on the use
of the symbol “=.” In statements entered into the command line and in m-ﬁles,
= means “store the computed contents to the left of the = into the variable
Numerical Solution of Nonlinear Equations of One Variable 47
represented by the symbol on the right.” In quantities printed by the matlab
system, =” means “the value stored in the memory locations represented by
the printed symbol is approximately equal to the printed quantity. Note that
this is signiﬁcantly diﬀerent from the meaning that a mathematician attaches
to the symbol. For example, the approximations might not be close enough
for our purposes to the intended value, due to roundoﬀ error or other errors,
or even due to error in conversion from the internal binary form to the printed
decimal form.
2.2 The Fixed Point Method
The so-called “ﬁxed point method” is a really general way of viewing com-
putational processes involving equations, systems of equations, and equilibria.
We introduce it here, and will see it again when we study systems of linear
and nonlinear equations. It is also seen in more advanced studies of systems
of diﬀerential equations.
DEFINITION 2.1 z ∈ G is a ﬁxed point of g if g(z) = z.
REMARK 2.1 If f(x) = g(x) −x, then a ﬁxed point of g is a zero of f.
The ﬁxed-point iteration method is deﬁned by the following: For x
0
∈ G,
x
k+1
= g(x
k
) for k = 0, 1, 2, . . . .
Example 2.3
Suppose
g(x) =
1
2
(x + 1).
Then, starting with x
0
= 0, ﬁxed point iteration becomes
x
k+1
=
1
2
(x
k
+ 1),
and the ﬁrst few iterates are x
0
= 0, x
1
= 1/2, x
2
= 3/4, x
3
= 7/8, x
4
=
15/16, . We see that this iteration converges to z = 1.
Example 2.4
If f is as in Example 2.1 on page 41, a corresponding g is g(x) = −e
x
. We
can study ﬁxed point iteration with this g with the following matlab dialog.
48 Applied Numerical Methods
>> x = -0.5
x =
-0.5000
>> x = -exp(x)
x =
-0.6065
>> x = -exp(x)
x =
-0.5452
>> x = -exp(x)
x =
-0.5797
>> x = -exp(x)
x =
-0.5601
>> x = -exp(x)
x =
-0.5712
>> x = -exp(x)
x =
-0.5649
>>
(Here, we can recall the expression x = - exp(x) by simply pressing the
up-arrow button on the keyboard.) We observe a convergence in which the
approximation appears to alternate about the limit, but the convergence does
not appear to be quadratic.
An important question is: when does ¦x
k
¦
∞
k=0
converge to z, a ﬁxed point
of g? Fixed-point iteration does not always converge. Consider g(x) = x
2
,
whose ﬁxed points are x = 0 and x = 1. If x
0
= 2, then x
k+1
= x
2
k
, so x
1
= 4,
x
2
= 16, x
3
= 256, .
Although it is tempting to pose problems as ﬁxed point iteration, the ﬁxed
point iterates do not always converge. We talk about convergence of ﬁxed
point iteration in terms of Lipschitz constants.
DEFINITION 2.2 g satisﬁes a Lipschitz condition on G if there is a
Lipschitz constant L ≥ 0 such that
[g(x) −g(y)[ ≤ L[x −y[ for all x, y ∈ G. (2.1)
If g satisﬁes (2.1) with 0 ≤ L < 1, g is said to be a contraction on the set
G. For diﬀerentiable functions, a common way of thinking about Lipschitz
constants is in terms of the derivative of g. For instance, it is not hard to
show (using the mean value theorem) that, if g
′
is continuous and [g
′
(x)[ ≤ L
for every x, then g satisﬁes a Lipschitz condition with Lipschitz constant L.
Numerical Solution of Nonlinear Equations of One Variable 49
Basically, if L < 1 (or if [g
′
[ < 1), then ﬁxed point iteration converges. In
fact, in such instances,
[x
k+1
−z[ = [g(x
k
) −g(z)[ ≤ L[x
k
− z[,
so ﬁxed point iteration is linearly convergent with convergence factor C = L.
(Later, we state conditions under which the convergence is faster than linear.)
This is embodied in the following theorem.
THEOREM 2.3
(Contraction Mapping Theorem in one variable) Suppose that g maps G into
itself (i.e., if x ∈ G then g(x) ∈ G) and g satisﬁes a Lipschitz condition with
0 ≤ L < 1 (i.e., g is a contraction on G). Then, there is a unique z ∈ G such
that z = g(z), and the sequence determined by x
0
∈ G, x
k+1
= g(x
k
), k = 0,
1, 2, converges to z, with error estimates
[x
k
−z[ ≤
L
k
1 −L
[x
1
−x
0
[, k = 1, 2, (2.2)
[x
k
−z[ ≤
L
1 −L
[x
k
−x
k−1
[, k = 1, 2, (2.3)
Example 2.5
Suppose
g(x) = −
x
3
6
+
x
5
120
,
and suppose we wish to ﬁnd a Lipschitz constant for g over the interval
[−1/2, 1/2].
We will proceed by an interval evaluation of g
′
over [−1/2, 1/2]. Since
g
′
(x) = −x
2
/2 +x
4
/24, we have
g
′
([−1/2, 1/2]) ∈ −
1
2
[−1/2, 1/2]
2
+
1
24
[−1/2, 1/2]
4
= −
1
2
[0, 1/4] +
1
24
[0, 1/16] = [−1/8, 0] + [0, 1/384]
⊆ [−0.125, 0] + [0, 0.002605] ⊆ [−0.125, 0.00261].
Thus, since [g
′
(x)[ ≤ max
y∈[−0.125,0.00261]
[y[ = 0.125, g satisﬁes a Lipschitz
condition with Lipschitz constant 0.125.
If g is a contraction for all real numbers x, then the hypotheses of the
contraction mapping theorem are automatically satisﬁed, and ﬁxed point it-
eration converges for any x. (That is, the domain G can be taken to be the
set of all real numbers.) On the other hand, if G must be restricted (such as
if g is not a contraction everywhere or if g is not deﬁned everywhere), then, to
50 Applied Numerical Methods
be assured that ﬁxed point iteration converges, we need to know that g maps
G into itself. Two possibilities are with the following two theorems.
THEOREM 2.4
Let ρ > 0 and G = [c − ρ, c + ρ]. Suppose that g is a contraction on G with
Lipschitz constant L, 0 ≤ L < 1, and
[g(c) −c[ ≤ (1 −L)ρ.
Then g maps G into itself.
THEOREM 2.5
Assume that z is a solution of x = g(x), g
′
(x) is continuous in an interval
about z, and [g
′
(z)[ < 1. Then g is a contraction in a suﬃciently small
interval about z, and g maps this interval into itself. Thus, provided x
0
is
picked suﬃciently close to z, the iterates will converge.
Example 2.6
Let
g(x) =
x
2
+
1
x
.
Can we show that the ﬁxed point iteration x
k+1
= g(x
k
) converges for any
starting point x
0
∈ [1, 2]? We will use Theorem 2.4, and Theorem 2.3 to show
convergence. In particular, g
′
(x) = 1/2 − 1/x
2
. Evaluating g
′
(x) over [1, 2]
with interval arithmetic, we obtain
g
′
([1, 2]) ∈
1
2
−
1
[1, 2]
2
=
_
1
2
,
1
2
_
−
1
[1, 4]
=
_
1
2
,
1
2
_
−
_
1
4
, 1
_
=
_
1
2
,
1
2
_
+
_
−1, −
1
4
_
=
_
−
1
2
,
1
4
_
.
Thus, since g
′
(x) ∈ g
′
([1, 2]) ∈ [−
1
2
,
1
4
] for every x ∈ [1, 2],
[g
′
(x)[ ≤ max
x∈[−
1
2
,
1
4
]
[x[ =
1
2
for every x ∈ [1, 2]. Thus, g is a contraction on [1, 2]. Furthermore, letting
ρ = 1/2 and c = 3/2, [g(3/2) −3/2[ = 1/12 ≤ 1/4. Thus, by Proposition 2.4,
g maps [1, 2] into [1, 2]. Therefore, we can conclude from Theorem 2.3 that
the ﬁxed point iteration converges for any starting point x
0
∈ [1, 2] to the
unique ﬁxed point z = g(z).
Numerical Solution of Nonlinear Equations of One Variable 51
Of course, it may be relatively easy to verify that [g
′
[ < 1, after which we
may actually try ﬁxed point iteration to see if it stays in the domain and
converges. In fact, in Theorem 2.4, we essentially do one iteration of ﬁxed
point iteration and compare the change to the size of the region.
Example 2.7
Let g(x) = 4 +
1
3
sin 2x and x
k+1
= 4 +
1
3
sin2x
k
. Observing that
[g
′
(x)[ =
¸
¸
¸
¸
2
3
cos 2x
¸
¸
¸
¸
≤
2
3
for all x shows that g is a contraction on all of R, so we can take G = R. Then
g : G → G and g is a contraction on R. Thus, for any x
0
∈ R, the iterations
x
k+1
= g(x
k
) will converge to z, where z = 4 +
1
3
sin 2z. For x
0
= 4, the
following values are obtained.
k x
k
0 4
1 4.3298
2 4.2309
.
.
.
.
.
.
14 4.2615
15 4.2615
It is not hard to show that, if −1 < −L ≤ g
′
(x) ≤ 0 and ﬁxed point
iterates stay within G, then ﬁxed point iteration converges, with the iterates
x
k
alternately less than and greater than the ﬁxed point z = g(z). On the
other hand, if 0 ≤ g
′
(x) ≤ L < 1 and the ﬁxed point iterates stay within the
domain G, then the ﬁxed point iterates x
k
converge monotonically to z. This
latter situation is illustrated in Figure 2.4.
There are conditions under which the convergence of ﬁxed point iteration
is faster than linear. Recall if lim
k→∞
x
k
= z and [x
k+1
−z[ ≤ c[x
k
−z[
α
, we say
¦x
k
¦ converges to z with rate of convergence α. (We specify that c < 1 for
α = 1.)
THEOREM 2.6
Assume that the iterations x
k+1
= g(x
k
) converge to a ﬁxed point z. Fur-
thermore, assume that q is the ﬁrst positive integer for which g
(q)
(z) ,= 0 and
if q = 1 then [g
′
(z)[ < 1. Then the sequence ¦x
k
¦ converges to z with order
q. (It is assumed that g ∈ C
q
(G), where G contains z.)
52 Applied Numerical Methods
x
y
y = g(x)
y = x
g(a)+
g(x
1
)
+
g(x
2
)
+
a
+
x
1
+
x
2
+
z
+
b
+
FIGURE 2.4: Example of monotonic convergence of ﬁxed point iteration.
Example 2.8
Let
g(x) =
x
2
+ 6
5
and G = [1, 2.3]. Since g
′
(x) = 2x/5, the range of g
′
is 2/5[1, 2.3] > 0, so g
is monotonically increasing. Furthermore, g(1) = 7/5 and g(2.3) = 2.258, so
the exact range of g over [1, 2.3] is the interval [1.4, 2.258] ⊂ [1, 2.3], that is, g
maps G into G. Also,
[g
′
(x)[ =
¸
¸
¸
¸
2x
5
¸
¸
¸
¸
≤ 0.92 < 1
for x ∈ G. (Indeed, in this case, an interval evaluation gives 2[1, 2.3]/5 =
[0.4, 0.92], the exact range of g
′
in this case, since x occurs only once in the
expression for g
′
.) Theorem 2.3 then implies that there is a unique ﬁxed point
z ∈ G. It is easy to see that the ﬁxed point is z = 2. In addition, since
g
′
(z) =
4
5
,= 0, there is a linear rate of convergence. Inspecting the values in
the following table, notice that the convergence is not fast.
k x
k
0 2.2
1 2.168
2 2.140
3 2.116
4 2.095
Example 2.9
Let
g(x) =
x
2
+
2
x
=
x
2
+ 4
2x
Numerical Solution of Nonlinear Equations of One Variable 53
be as in Example 2.6. It can be shown that if 0 < x
0
< 2, then x
1
> 2.
Also, x
k
> x
k+1
> 2 when x
k
> 2. Thus, ¦x
k
¦ is a monotonically decreasing
sequence bounded by 2 and hence is convergent. Thus, for any x
0
∈ (0, ∞),
the sequence x
k+1
= g(x
k
) converges to z = 2.
Now consider the convergence rate. We have that
g
′
(x) =
1
2
−
2
x
2
,
so g
′
(2) = 0, and
g
′′
(x) =
4
x
3
,
so g
′′
(2) ,= 0. By Theorem 2.6, the convergence is quadratic, and as indicated
in the following table, the convergence is rapid.
k x
k
0 2.2
1 2.00909
2 2.00002
3 2.00000000
Example 2.10
Let
g(x) =
3
8
x
4
−4.
There is a unique ﬁxed point z = 2. However, g
′
(x) =
3
2
x
3
, so g
′
(2) = 12,
and we cannot conclude linear convergence. Indeed, the ﬁxed point iterations
converge only if x
0
= 2. If x
0
> 2, then x
1
> x
0
> 2, x
2
> x
1
> x
0
> 2, .
Similarly, if x
0
< 2, it can be veriﬁed that, for some k, x
k
< 0, after which
x
k+1
> 2, and we are in the same situation as if x
0
> 2. That is, ﬁxed point
iterations diverge unless x
0
= 2.
Example 2.11
Consider again g from Example 2.8. Starting with x
0
= 2.2, how many
iterations would be required to obtain the ﬁxed point z with [x
k
− z[ <
10
−16
? Can this number of iterations be computed before actually doing
the iterations?
We can use the bound
[x
k
−z[ ≤
L
k
1 −L
[x
1
−x
0
[
from the Contraction Mapping Theorem (on page 49). The mean value theo-
rem gives
x
k+1
−z = g
′
(c
k
)(x
k
−z),
54 Applied Numerical Methods
but the smallest bound we know on [g
′
(c
k
)[ (and hence the smallest L in the
formula) is L = 0.92). We also compute x
1
=
_
(2.2)
2
+ 6
_
/5, and
[x
1
−x
0
[ = 0.032.
Therefore,
[x
k
−z[ ≤
0.92
k
1 −0.92
0.032 = 0.4 (0.92)
k
.
Solving
0.4 (0.92)
k
< 10
−16
for k gives
k > −16
log(25)
log(0.92)
≈ 617.7.
Thus, 618 iterations would be required to achieve, roughly, IEEE double pre-
cision accuracy.
2.3 Newton’s Method (Newton-Raphson Method)
We now return to the problem: given f(x), ﬁnd z such that f(z) = 0.
Newton’s iteration for ﬁnding approximate solutions to this problem has the
form
x
k+1
= x
k
−
f(x
k
)
f
′
(x
k
)
for k = 0, 1, 2, . (2.4)
REMARK 2.2 Newton’s method is a special ﬁxed-point method with
g(x) = x −f(x)/f
′
(x).
Figure 2.5 illustrates the geometric interpretation of Newton’s method. To
ﬁnd x
k+1
, the tangent line to the curve at point (x
k
, f(x
k
)) is followed to
the x-axis. The tangent line is y − f(x
k
) = f
′
(x
k
)(x − x
k
). Thus, at y = 0,
x = x
k
−f(x
k
)/f
′
(x
k
) = x
k+1
.
Newton’s method is quadratically convergent, and is therefore fast when
compared to a typical linearly convergent ﬁxed point method. However, New-
ton’s method may diverge if x
0
is not suﬃciently close to a root z at which
f(z) = 0. To see this, study Figure 2.6.
Another conceptually useful way of deriving Newton’s method is using Tay-
lor’s formula. We have
0 = f(z) = f(x
k
) +f
′
(x
k
)(z −x
k
) +
(z −x
k
)
2
2
f
′′
(ξ
k
),
Numerical Solution of Nonlinear Equations of One Variable 55
x
y
(x
k
, f(x
k
))
(x
k+1
, f(x
k+1
))
x
k
+
x
k+1
+
x
k+2
+
FIGURE 2.5: Illustration of two iterations of Newton’s method.
x
y
x
k
+
x
k+1
+
x
k+2
+
z
+ x
y
x
k
+
x
k+1
+
z
FIGURE 2.6: Examples of divergence of Newton’s method. On the left,
the sequence diverges; on the right, the sequence oscillates.
where ξ
k
is between z and x
k
. Thus, assuming that (z −x
k
)
2
is small,
z ≈ x
k
−
f(x
k
)
f
′
(x
k
)
.
Hence, when x
k+1
= x
k
−f(x
k
)/f
′
(x
k
), we would expect x
k+1
to be closer to
z than x
k
.
The quadratic convergence rate of Newton’s method can be inferred from
Theorem 2.6 by analyzing Newton’s method as a ﬁxed point iteration. Con-
sider
x
k+1
= x
k
−
f(x
k
)
f
′
(x
k
)
= g(x
k
).
Observe that g(z) = z,
g
′
(z) = 0 = 1 −
f
′
(z)
f
′
(z)
+
f(z)f
′′
(z)
(f
′
(z))
2
,
and, usually, g
′′
(z) ,= 0. Thus, the quadratic convergence follows from Theo-
rem 2.6.
56 Applied Numerical Methods
Example 2.12
Let f(x) = x + e
x
. Compare bisection, simple ﬁxed-point iteration, and
Newton’s method.
• Newton’s method: x
k+1
= x
k
−
f(x
k
)
f
′
(x
k
)
= x
k
−
(x
k
+e
x
k
)
(1 +e
x
k
)
=
(x
k
−1)e
x
k
1 +e
x
k
.
• Fixed-Point (one form): x
k+1
= −e
x
k
= g(x
k
).
k x
k
(Bisection) x
k
(Fixed-Point) x
k
(Newton’s)
a = −1, b = 0
0 -0.5 -1.0 -1.0
1 -0.75 -0.367879 -0.537883
2 -0.625 -0.692201 -0.566987
3 -0.5625 -0.500474 -0.567143
4 -0.59375 -0.606244 -0.567143
5 -0.578125 -0.545396 -0.567143
10 -0.566895 -0.568429 -0.567143
20 -0.567143 -0.567148 -0.567143
2.4 The Univariate Interval Newton Method
A simple application of the ideas behind Newton’s method and the Mean
Value Theorem leads to a mathematically rigorous computation of the zeros
of a function f. In particular, suppose x = [x, x] is an interval, and suppose
that there is a z ∈ x with f(z) = 0. Let ˇ x be any point (such as the midpoint
of x). Then the Mean Value Theorem (page 5) gives
0 = f(ˇ x) +f
′
(ξ)(z − ˇ x). (2.5)
Solving (2.5) for z, then applying the fundamental theorem of interval arith-
metic (page 27) gives
z = ˇ x −
f(ˇ x)
f
′
(ξ)
∈ ˇ x −
f(ˇ x)
f
′
(x)
= N(f; x, ˇ x). (2.6)
We thus have the following.
THEOREM 2.7
Any solution z ∈ x of f(x) = 0 must also be in N(f; x, ˇ x).
Numerical Solution of Nonlinear Equations of One Variable 57
We call N(f; x, ˇ x) the univariate interval Newton operator.
The interval Newton operator forms the basis of a ﬁxed-point type of iter-
ation of the form
x
k+1
← N(f; x
k
, ˇ x
k
) for k = 1, 2, . . . .
The interval Newton method is similar in many ways to the traditional
Newton–Raphson method of Section 2.3 (page 54), but provides a way to use
ﬂoating point arithmetic (with upward and downward roundings) to provide
rigorous upper and lower bounds on exact solutions. We now discuss existence
and uniqueness properties of the interval Newton method. In addition to
providing bounds on any solutions within a given region, the interval Newton
method has the following property.
THEOREM 2.8
Suppose f ∈ C(x) = C([x, x]), ˇ x ∈ x, and N(f; x, ˇ x) ⊆ x. Then there is an
x
∗
∈ x such that f(x
∗
) = 0. Furthermore, this x
∗
is unique.
A formal algorithm for the interval Newton method is as follows.
ALGORITHM 2.2
(The univariate interval Newton method)
INPUT: x = [x, x], f : x ⊂ R → R, a maximum number of iterations N, and
a stopping tolerance ǫ.
OUTPUT: Either
1. “solution does not exist within the original x”, or
2. a new interval x
∗
such that any x
∗
∈ x with f(x
∗
) = 0 has x
∗
∈ x
∗
,
and one of:
(a) “existence and uniqueness veriﬁed” and “tolerance met.”
(b) “existence and uniqueness not veriﬁed,”
(c) “solution does not exist,” or
(d) “existence and uniqueness veriﬁed” but “tolerance not met.”
1. k ← 1.
2. “existence and uniqueness veriﬁed” ← “false.”
3. “solution does not exist” ← “false.”
4. DO WHILE k <= N.
(a) ˇ x ← (x +x)/2.
(b) IF ˇ x ,∈ x THEN RETURN.
58 Applied Numerical Methods
(c) ˜ x ← N(f; x, ˇ x).
(d) IF ˜ x ⊆ x (that is, if ˜ x ≥ x and ˜ x ≤ x) THEN
“existence and uniqueness veriﬁed ← “true.”
(e) IF ˜ x ∩ x = ∅ (that is, if ˜ x ≤ x or ˜ x ≥ x) THEN
i. “solution does not exist” ← “true.”
ii. RETURN.
(f ) IF w(˜ x) < ǫ THEN
i. x
∗
← ˜ x.
ii. tolerance met ← “true.”
iii. RETURN.
END IF
(g) x ← x ∩ ˜ x.
(h) k ← k + 1.
END DO
5. “tolerance met” ← “false.”
6. RETURN.
END ALGORITHM 2.2.
Notes:
1. The interval Newton method generally becomes stationary. (That is, the
end points of x can be proven to not change, under certain assumptions
on the machine arithmetic.) However, it is good general programming
practice to enforce an upper limit on the total number of iterations of
any iterative process, to avoid problems arising from slow convergence,
etc.
2. In Step 4a of Algorithm 2.2, the midpoint is computed approximately,
and it occasionally occurs (when the interval is very narrow), that the
machine approximation lies outside the interval. Thus, we need to check
for this possibility.
3. Although f is evaluated at a point in the expression
ˇ x −
f(ˇ x)
f
′
(x)
for N(f; x, ˇ x), the machine must evaluate f with interval arithmetic to
take account of rounding error. (That is, we start with the computa-
tions with the degenerate interval [ˇ x, ˇ x].) Otherwise, the results are not
mathematically rigorous.
Numerical Solution of Nonlinear Equations of One Variable 59
Similar to the traditional Newton–Raphson method, the interval Newton
method exhibits quadratic convergence. (This is common knowledge.) An
example of a speciﬁc theorem along these lines is
THEOREM 2.9
(Quadratic convergence of the interval Newton method) Suppose f : x → R,
suppose f ∈ C(x) and f
′
∈ C(x), and suppose there is an x
∗
∈ x such that
f(x
∗
) = 0. Suppose further that f
′
is a ﬁrst order or higher order interval
extension of f in the sense of Theorem 1.9 (on page 28). Then, for the initial
width w(x) suﬃciently small,
w(N(f; x, ˇ x)) = O(w(x))
2
.
We will not give a proof of Theorem 2.9 here, although Theorem 2.9 is a
special case of Theorem 6.3, page 222 in [20]. We will illustrate this quadratic
convergence with
Example 2.13
(Taken from [22].) Apply the interval Newton method x ← N(f; x, ˇ x),
ˇ x ← ﬂ((x +x)/2), to
f(x) = x
2
−2,
starting with x = [1, 2] and ˇ x = 1.5. The results for Example 2.13 appear in
Table 2.1. Here,
δ
k
=
w(x
k
)
max ¦max
y∈x
k
¦[y[¦, 1¦
is a scaled version of the width w(x
k
), and ρ
k
= max
y∈f(x
k
)
¦[y[¦. The dis-
played decimal intervals have been rounded out from the corresponding binary
intervals.
TABLE 2.1: Convergence of the interval Newton method with
f(x) = x
2
−2.
k x
k
δ
k
ρ
k
0 [1.00000000000000, 2.00000000000000] 5.00 ×10
−1
2.00 ×10
0
1 [1.37499999999999, 1.43750000000001] 4.35 ×10
−2
1.09 ×10
−1
2 [1.41406249999999, 1.41441761363637] 2.51 ×10
−4
5.77 ×10
−4
3 [1.41421355929452, 1.41421356594718] 4.70 ×10
−9
1.01 ×10
−8
4 [1.41421356237309, 1.41421356237310] 4.71 ×10
−16
1.33 ×10
−15
5 [1.41421356237309, 1.41421356237310] 4.71 ×10
−16
1.33 ×10
−15
6 [1.41421356237309, 1.41421356237310] 4.71 ×10
−16
1.33 ×10
−15
60 Applied Numerical Methods
2.5 The Secant Method
Under certain circumstances, f may have a continuous derivative, but it
may not be possible to explicitly compute it. This is less true now than in
the past, because techniques of automatic diﬀerentiation (or “computational
diﬀerentiation”), such as we explain in Section 6.2, page 215, have been devel-
oped, have become more widely available, and are used in practice. However,
there are still various situations involving black box functions f. In “black
box” functions, f is evaluated by some external procedure (such as a software
system provided by someone other than its user), in which one supplies the
input x, and the output f(x) is returned, but the user (or the designer of the
method for ﬁnding points x
∗
, f(x
∗
) = 0) does not have access to the internal
workings, so that f
′
cannot be easily computed. In such cases, methods that
converge more rapidly than the method of bisection, but that do not require
evaluation of f
′
, are useful.
Example 2.14
Suppose we wish to ﬁnd a zero of
f(x) = e
−ax
−g
_
cos x +
a
x
sin x + ln x
_
,
where
g(x) =
1
1 +x
2
+ 3x
2
+ 5x +h(5 +e
x
+ cos x),
h(x) =
e
2x
2
(1 +x +x
2
)
,
and a is a constant.
Problems as complicated as this are not uncommon. Prior to widespread
use of automatic diﬀerentiation, applying Newton’s method to this problem
was quite diﬃcult because it would have been diﬃcult and time-consuming
to calculate f
′
(x
k
) at each time step. Automatic diﬀerentiation is now an
option for many problems of this type. However, in certain situations, such
as applying the shooting method to solution of boundary-value problems (see
the discussion in Chapter 10), f
′
cannot be directly computed and the secant
method is useful. In this section, we will assume that f
′
cannot be computed,
and we will treat f as a “black-box” function.
In the secant method, f
′
(x
k
) is approximated by
f
′
(x
k
) ≈
f(x
k
) −f(x
k−1
)
x
k
−x
k−1
.
Numerical Solution of Nonlinear Equations of One Variable 61
The secant method thus has the form
x
k+1
= x
k
−f(x
k
)
_
x
k
−x
k−1
f(x
k
) −f(x
k−1
)
_
. (2.7)
If f(x
k
) and f(x
k+1
) have opposite signs, then, as with the bisection method,
there must be an x
∗
between x
k
and x
k+1
for which f(x
∗
) = 0.
For the secant method, we need starting values x
0
and x
1
. However, only
one evaluation of the function f is required at each iteration, since f(x
k−1
) is
known from the previous iteration.
Geometrically, (see ﬁgure 2.7), to obtain x
k+1
, the secant to the curve
through (x
k−1
, f(x
k−1
)) and (x
k
, f(x
k
)) is followed to the x-axis.
x
y
(x
k−1
, f(x
k−1
))
(x
k
, f(x
k
))
x
k+1
+
FIGURE 2.7: Geometric interpretation of the secant method.
Interestingly, the convergence rate of the secant method is faster than linear
but slower than quadratic.
THEOREM 2.10
(Convergence of the secant method) Let G be a subset of R containing a zero
z of f(x). Assume f ∈ C
2
(G) and there exists an M ≥ 0 such that
M =
max
x∈G
[f
′′
(x)[
2 min
x∈G
[f
′
(x)[
.
Let x
0
and x
1
be two initial guesses to z and let
K
ǫ
(z) = (z −ǫ, z +ǫ) ∈ G,
62 Applied Numerical Methods
where ǫ =
δ
M
and δ < 1. Let x
0
, x
1
∈ K
ǫ
(z). Then, the iterates x
2
, x
3
, x
4
,
remain in K
ǫ
(z) and converge to z with error
[x
k
−z[ ≤
1
M
δ
(
1+
√
5
2
)
k
.
Note that (1 +
√
5)/2 ≈ 1.618, a fractional order of convergence between 1
and 2. For Newton’s method [x
k
−z[ ≤ q
2
k
with q < 1.
2.6 Software
The matlab function bisect method we presented, as well as a matlab
function for Newton’s method, are available from the web page for the grad-
uate version of this book, namely at
http://interval.louisiana.edu/Classical-and-Modern-NA/
A step of the interval Newton method is implemented with the matlab func-
tion i newton step no fp.m, explained in [26], and available from the web
page for the book, at http://www.siam.org/books/ot110. This function
uses intlab (see page 35) for the interval arithmetic and for automatically
computing derivatives of f.
Additional techniques for root-ﬁnding, such as ﬁnding complex roots and
ﬁnding all roots of polynomials, appear in the graduate version of this book
[1]. One common computation is ﬁnding all of the roots of a polynomial
equation p(x) = 0. The matlab function roots accepts an array containing
the coeﬃcients of the polynomial, and returns the roots of that polynomial.
For example, we might have the following matlab dialog.
>> c = [1,1,1]
c =
1 1 1
>> r = roots(c)
r =
-0.5000 + 0.8660i
-0.5000 - 0.8660i
>> c = [1 5 4]
c =
1 5 4
>> r = roots(c)
r =
-4
-1
>>
The ﬁrst computation computes approximations to the roots of the polynomial
p(x) = x
2
+x+1, namely, approximations to −1/2 ±
√
3/2i, while the second
Numerical Solution of Nonlinear Equations of One Variable 63
computation computes the roots of the polynomial p(x) = x
2
+5x+4, namely
x = −4 and x = −1.
matlab also contains a function fzero for ﬁnding zeros of more general
equations f(x) = 0. A matlab dialog with its use is
>> x = fzero(’exp(x)+x’,-0.5)
x =
-0.5671
>>
(Compare this with Example 2.1.) Various examples, as well as explanations
of the underlying algorithms, are available within the matlab help system
for roots and fzero.
NETLIB (at http://www.netlib.org/ contains various software packages
in Fortran, C, etc. for computing roots of polynomial equations and other
equations.
Software for ﬁnding veriﬁed bounds on all solutions is also available. See
[26] for an introduction to some of the techniques. intlab has the function
verifypoly for ﬁnding certiﬁed bounds on the roots of polynomials. There
is a general function verifynlss in intlab for ﬁnding
Generally, ﬁnding a root of an equation is a computation done as part
of an overall modeling or simulation process. It is usually advantageous to
use polished programs within the chosen system (matlab, a programming
language such as Fortran or C++, etc.). However, in developing specialized
packages, if the function f has special properties, one can take advantage of
these. It may also be eﬃcient in certain cases to program directly the simple
methods described in this chapter, if one is certain of their convergence in the
context of their use.
2.7 Applications
The problem of ﬁnding x such that f(x) = 0 arises frequently when trying
to solve for equilibrium solutions (constant solutions) of diﬀerential equation
models in many ﬁelds including biology, engineering and physics. Here, we
focus on a model from population biology. To this end, consider the general
population model given by
dx
dt
= f(x)x = (b(x) −d(x))x. (2.8)
Here, x(t) is the population density at time t. The function b(x) is the density-
dependent birth rate and d(x) is the density-dependent death rate. Thus, f(x)
is the density-dependent growth rate of the population. An important problem
in population biology is analyzing the solution behavior of such dynamical
64 Applied Numerical Methods
models. A ﬁrst step in such analysis is often ﬁnding the equilibrium solutions.
Clearly, these solutions satisfy dx/dt = 0 which is equivalent to x = 0, i.e.,
the trivial solution is an equilibrium solution of this population model usually
referred to as the extinction equilibrium and f(x) = 0, i.e., values x which
make the growth rate equal zero.
To focus on a concrete example, assume the birth rate is of Ricker type
b(x) = e
−x
and the mortality rate is linear function given by d(x) = 2x.
This implies that the growth rate is given by f(x) = e
−x
− 2x. To ﬁnd the
unique positive equilibrium we need to solve the equation e
−x
−2x = 0. Using
Newton Method given by the following programming algorithm in matlab.
function [x_star,success] = newton (x0, f, f_prime, eps, maxitr)
%
% [x_star,success] = newton(x0,f,f_prime,eps,maxitr)
% does iterations of Newton’s method for a single variable,
% using x0 as initial guess, f (a character string giving
% an m-file name) as function, and f_prime (also a character
% string giving an m-file name) as the derivative of f.
% For example, suppose an m-file xsqm2.m is available in Matlab’s working
% directory, with the following contents:
% function [y] = xsqm2(x)
% y = x^2-2;
% return
% and an m-fine xsqm2_prime is also available, with the following
% function [y] = xsqm2_prime(x)
% y = 2*x;
% return
% contents:
% Then, issuing
% [x_star,success] = newton(1.5, ’xsqm2’, ’xsqm2_prime’, 1e-10, 20)
% from Matlab’s command window will cause an approximation to the square
% root of 2 to be stored in x_star.
% iteration stops successfully if |f(x)| < eps, and iteration
% stops unsuccessfully if maxitr iterations have been done
% without stopping successfully or if a zero derivative
% is encountered.
% On return:
% success = 1 if iteration stopped successfully, and
% success = 0 if iteration stopped unsuccessfully.
% x_star is set to the approximate solution to f(x) = 0
% if iteration stopped successfully, and x_star
% is set to x0 otherwise.
success = 0;
x = x0;
for i=1:maxitr;
fval = feval(f,x);
if abs(fval) < eps;
success = 1;
Numerical Solution of Nonlinear Equations of One Variable 65
disp(sprintf(’ %10.0f %15.9f %15.9f ’, i, x, fval));
x_star = x;
return;
end;
fpval = feval(f_prime,x);
if fpval == 0;
x_star = x0;
end;
disp(sprintf(’ %10.0f %15.9f %15.9f ’, i, x, fval));
x = x - fval / fpval;
end;
x_star =x0;
and the following matlab dialog
>> y=inline(’exp(-x)-2*x’)
>> yp=inline(’-exp(-x)-2’)
>> [x_star,success]=newton(0,y,yp,1e-10,40)
we obtain the following table of iterations for the solution
1 0.000000000
2 0.333333333
3 0.351689332
4 0.351733711
5 0.351733711
Thus, x = 0.351733711 is the unique positive equilibrium of this model.
2.8 Exercises
1. Consider the method of bisection applied to f(x) = arctan(x), with
initial interval x = [−4.9, 5.1].
(a) Are the hypotheses under which the method of bisection converges
valid? If so, then how many iterations would it take to obtain the
solution to within an absolute error of 10
−2
?
(b) Apply Algorithm 2.1 with pencil and paper, until k = 5, arranging
your computations carefully so you gain some intuition into the
process.
2. Let f and x be as in Problem 1.
66 Applied Numerical Methods
(a) Modify bisect method so it prints a
k
, b
k
, f(a
k
), f(b
k
), and f(x
k
)
for each step, so you can see what is happening. Hint: Deleting
the semicolon from the end of a matlab statement causes the value
assigned to the left of the statement to be printed, while a statement
consisting only of a variable name causes that variable name to
be printed. If you want more neatly printed quantities, study the
matlab functions disp and sprintf.
(b) Try to solve f(x) = 0 with ǫ = 10
−2
, ǫ = 10
−4
, ǫ = 10
−8
, ǫ = 10
−16
,
ǫ = 10
−32
, ǫ = 10
−64
, and ǫ = 10
−128
.
i. For each ǫ, compute the k at which the algorithm should stop.
ii. What behavior do you actually observe in the algorithm? Can
you explain this behavior?
3. Repeat Problem 2, but with f(x) = x
2
−2 and initial interval x = [1, 2].
4. Use the program for the bisection method in Problem 2 to ﬁnd an ap-
proximation to 1000
1
4
which is correct to within 10
−5
.
5. Consider g(x) = x −arctan(x).
(a) Perform 10 iterations of the ﬁxed point method x
k+1
= g(x
k
),
starting with x = 5, x = −5, x = 1, x = −1, and x = 0.1.
(b) What do you observe for the diﬀerent starting points? What is [g
′
[
at each starting point, and how might this relate to the behavior
you observe?
6. It is desired to ﬁnd the positive real root of the equation x
3
+x
2
−1 = 0.
(a) Find an interval x = [x, x] and a suitable ﬁxed point iteration func-
tion g(x) to accomplish this. Verify all conditions of the contraction
mapping theorem.
(b) Find the minimum number of iterations n needed so that the abso-
lute error in the n-th approximation to the root is correct to 10
−4
.
Also, use the ﬁxed-point iteration method (with the g you deter-
mined in part (a)) to determine this positive real root accurate to
within 10
−4
.
7. Find an approximation to 1000
1
4
correct to within 10
−5
using the ﬁxed
point iteration method.
8. Consider f(x) = arctan(x). This function has a unique zero z = 0.
(a) Use a digital computer with double precision arithmetic to do it-
erations of Newton’s method, starting with x
0
= 0.5, 1.0, 1.3, 1.4,
1.35, 1.375, 1.3875, 1.39375, 1.390625, 1.3921875. Iterate until one
of the following occurs:
Numerical Solution of Nonlinear Equations of One Variable 67
• [f(x)[ ≤ 10
−10
,
• an operation exception occurs, or
• 20 iterations are completed.
(i) Describe the behavior you observe.
(ii) Explain the behavior you observe in terms of the graph of f.
(iii) Evidently, there is a point p such that, if x
0
> p, then New-
ton’s method diverges, and if x
0
< p, then Newton’s method
converges.
(α) What would happen if x
0
= p exactly? Illustrate what
would happen on a graph of f.
(β) Do you think we could choose x
0
= p exactly in practice?
9. Let f(x) = x
2
−a.
(a) Write down and simplify the Newton’s method iteration equation
for f(x) = 0.
(b) For a = 2, form a table of 15 iterations of Newton’s method, start-
ing with x
0
= 2, x
0
= 4, x
0
= 8, x
0
= 16, x
0
= 32, and x
0
= 64.
(c) Explain your results in terms of the shape of the graph of f and in
terms of the convergence theory in this section.
(d) Compare your analysis here to the analysis in Example 2.9 on
page 52.
10. Hint: The free intlab toolbox, mentioned on page 35, is recommended
for this problem.
(a) Let f be as in Problem 8 of this set. Experiment with the interval
Newton method for this problem, and with various intervals that
contain zero. Try some intervals of the form [−a, a] (with ˇ x = 0)
and other intervals of the form [−a, b], a > 0, b > 0 and a ,= b.
Explain what you have found.
(b) Use the interval Newton method to prove that there exists a unique
solution to f(x) = 0 for x ∈ [−1, 0], where f(x) = x +e
x
.
(c) Iterate the interval Newton method to ﬁnd as narrow bounds as
possible on the solution proven to exist in part 10b.
11. Repeat Exercise 8a, page 66, but with the secant method instead of
Newton’s method. (Use pairs of starting points ¦0.5, 1.0¦, ¦1.0, 1.3¦,
etc.)
12. Do three steps of Newton’s method, using complex arithmetic, for the
function f(z) = z
2
+ 1, with starting guess z
0
= 0.2 + 0.7i. Although
you may use a computer program, you should show intermediate re-
sults, including z
k
, f(z
k
), and f
′
(z
k
). (Note: Newton’s method with
68 Applied Numerical Methods
complex arithmetic can be viewed as a multivariate Newton method in
two variables; see Exercise 5 on page 324, in Section 8.2.)
Chapter 3
Linear Systems of Equations
The solution of linear systems of equations is an extremely important pro-
cess in scientiﬁc computing. Linear systems of equations directly serve as
mathematical models in many situations, while solution of linear systems of
equations is an important intermediary computation in the analysis of other
models, such as nonlinear systems of diﬀerential equations.
Example 3.1
Find x
1
, x
2
, and x
3
such that
x
1
+ 2x
2
+ 3x
3
= −1,
4x
1
+ 5x
2
+ 6x
3
= 0,
7x
1
+ 8x
2
+ 10x
3
= 1.
This chapter deals with the analysis and approximate solution of such sys-
tems of equations with ﬂoating point arithmetic. We will study two direct
methods, Gaussian elimination (the LU decomposition) and the QR decom-
position), as well as iterative methods, such as the Gauss–Seidel method, for
solving such systems. (Computations in direct methods ﬁnish with a ﬁnite
number of operations, while iterative methods involve a limiting process, as
ﬁxed point iteration does.) We will also study the singular value decomposi-
tion, a powerful technique for obtaining information about linear systems of
equations, the mathematical models that give rise to such linear systems, and
the eﬀects of errors in the data on the solution of such systems.
The process of dealing with linear systems of equations comprises the sub-
ject numerical linear algebra. Before studying the actual solution of linear
systems, we introduce (or review) underlying commonly used notation and
facts.
69
70 Applied Numerical Methods
3.1 Matrices, Vectors, and Basic Properties
The coeﬃcients of x
1
, x
2
, and x
3
in Example 3.1 can be written as an array
of numbers
A =
_
_
1 2 3
4 5 6
7 8 10
_
_
,
which we call a matrix. The horizontal lines of numbers are the rows, while
the vertical lines are the columns. In the example, we say “A is a 3 by 3
matrix,” meaning that it has 3 rows and 3 columns. (If a matrix B had two
rows and 5 columns, for example, we would say “B is a 2 by 5 matrix.”)
In numerical linear algebra, the variables x
1
, x
2
, and x
3
as in Example 3.1
are typically represented in a matrix
x =
_
_
x
1
x
2
x
3
_
_
with 3 rows and 1 column, as is the set of right members of the equations:
b =
_
_
−1
0
1
_
_
;
x and b are called column vectors.
Often, the system of linear equations will have real coeﬃcients, but the
coeﬃcients will sometimes be complex. If the system has n variables in the
column vector x, and the variables are assumed to be real, we say that x ∈ R
n
.
If B is an m by n matrix whose entries are real numbers, we say B ∈ R
m×n
.
If vector x has n complex coeﬃcients, we say x ∈ C
n
, and if an m by n matrix
B has complex coeﬃcients, we say B ∈ C
m×n
.
Systems such as in Example 3.1 can be written using matrices and vectors,
with the concept of matrix multiplication. We use upper case letters to denote
matrices, lower case letters without subscripts to denote vectors (which we
consider to be column vectors, we denote the element in the i-th row, j-th
column of a matrix A by a
ij
, and we sometimes denote the entire matrix A
by (a
ij
). The numbers that comprise the elements of matrices and vectors are
called scalars.
DEFINITION 3.1 If A = (a
ij
), then A
T
= (a
ji
) and A
H
= (a
ji
) denote
the transpose and conjugate transpose of A, respectively.
Example 3.2
On most computers in use today, the basic quantity in matlab is a matrix
whose entries are double precision ﬂoating point numbers according to the
Linear Systems of Equations 71
IEEE 754 standard. Matrices are marked with square brackets “[” and “]”,
with commas or spaces separating the entries in a row and semicolons or the
end of a line separating the rows. The transpose of a matrix is obtained by
typing a single quotation mark (or apostrophe) after the matrix. Consider
the following matlab dialog.
>> A = [1 2 3;4 5 6;7 8 10]
A =
1 2 3
4 5 6
7 8 10
>> A’
ans =
1 4 7
2 5 8
3 6 10
>>
DEFINITION 3.2 If A is an m n matrix and B is an n p matrix,
then C = AB where
c
ij
=
n

k=1
a
ik
b
kj
for i = 1, , m, j = 1, , p. Thus, C is an mp matrix.
Example 3.3
Continuing the matlab dialog from Example 3.2, we have
>> B = [-1 0 1
2 3 4
3 2 1]
B =
-1 0 1
2 3 4
3 2 1
>> C = A*B
C =
12 12 12
24 27 30
72 Applied Numerical Methods
39 44 49
>>
(If the reader or student is not already comfortable with matrix multiplica-
tion, we suggest conﬁrming the above calculation by doing it with paper and
pencil.)
With matrix multiplication, we can write the linear system in Example 3.1
at the beginning of this chapter as
Ax = b, A =
_
_
1 2 3
4 5 6
7 8 10
_
_
, x =
_
_
x
1
x
2
x
3
_
_
, b =
_
_
−1
0
1
_
_
.
Matrix multiplication can be easily described in terms of the dot product :
DEFINITION 3.3 Suppose we have two real vectors
v =
_
_
_
_
_
v
1
v
2
.
.
.
v
n
_
_
_
_
_
and w =
_
_
_
_
_
w
1
w
2
.
.
.
w
n
_
_
_
_
_
Then the dot product v ◦ w, also written (v, w), of v and w is the matrix
product
v
T
w =
n

i=1
v
i
w
i
.
If the vectors have complex components, the dot product is deﬁned to be
v
H
w =
n

i=1
v
i
w
i
,
where v
i
is the complex conjugate of v
i
.
Dot products can also be deﬁned more generally and abstractly, and are
useful throughout pure and applied mathematics. However, our interest here
is the fact that many computations in scientiﬁc computing can be written in
terms of dot products, and most modern computers have special circuitry and
software to do dot products eﬃciently.
Example 3.4
matlab represents the transpose of a vector V as V’. Also, if A is an n by
n matrix in matlab, the i-th row of A is accessed as A(i,:), while the j-th
Linear Systems of Equations 73
column of A is accessed as A(:,j). Continuing Example 3.3, we have the
following matlab dialog, illustrating writing the product matrix C in terms
of dot products.
>> C = [ A(1,:)*B(:,1), A(1,:)*B(:,2), A(1,:)*B(:,3)
A(2,:)*B(:,1), A(2,:)*B(:,2), A(2,:)*B(:,3)
A(3,:)*B(:,1), A(3,:)*B(:,2), A(3,:)*B(:,3)]
C =
12 12 12
24 27 30
39 44 49
>>
Matrix inverses are also useful in describing linear systems of equations:
DEFINITION 3.4 Suppose A is an n by n matrix. (That is, suppose
a is square.) Then, A
−1
is the inverse of A if A
−1
A = AA
−1
= I, where I
is the n by n identity matrix, consisting of 1’s on the diagonal and 0’s in all
oﬀ-diagonal elements. If A has an inverse, then A is said to be nonsingular
or invertible.
Example 3.5
Continuing the matlab dialog from the previous examples, we have
>> Ainv = inv(A)
Ainv =
-0.66667 -1.33333 1.00000
-0.66667 3.66667 -2.00000
1.00000 -2.00000 1.00000
>> Ainv*A
ans =
1.00000 0.00000 -0.00000
0.00000 1.00000 -0.00000
-0.00000 0.00000 1.00000
>> A*Ainv
ans =
1.0000e+00 4.4409e-16 -4.4409e-16
1.1102e-16 1.0000e+00 -1.1102e-15
3.3307e-16 2.2204e-15 1.0000e+00
>> eye(3)
74 Applied Numerical Methods
ans =
1 0 0
0 1 0
0 0 1
>> eye(3)-A*inv(A)
ans =
1.1102e-16 -4.4409e-16 4.4409e-16
-1.1102e-16 -8.8818e-16 1.1102e-15
-3.3307e-16 -2.2204e-15 2.3315e-15
>> eye(3)*B-B
ans =
0 0 0
0 0 0
0 0 0
>>
Above, observe that I B = B. Also observe that the computed value of
I − AA
−1
is not exactly the matrix consisting entirely of zeros, but is a
matrix whose entries are small multiples of the machine epsilon for IEEE
double precision ﬂoating point arithmetic.
Some matrices do not have inverses.
DEFINITION 3.5 A matrix that does not have an inverse is called a
singular matrix. A matrix that does have an inverse is said to be non-singular.
Singular matrices are analogous to the number zero when we are dealing
with a single equation in a single unknown. In particular, if we have the
system of equations Ax = b, it follows that x = A
−1
b (since A
−1
(Ax) =
(A
−1
A)x = Ix = x = A
−1
b), just as if ax = b, then x = (1/a)b.
Example 3.6
The matrix
_
_
1 2 3
4 5 6
7 8 9
_
_
is singular. However, if we use matlab to try to ﬁnd an inverse, we obtain:
>> A = [1 2 3;4 5 6;7 8 9]
Linear Systems of Equations 75
A =
1 2 3
4 5 6
7 8 9
>> inv(A)
ans =
1.0e+016 *
-0.4504 0.9007 -0.4504
0.9007 -1.8014 0.9007
-0.4504 0.9007 -0.4504
>>
Observe that the matrix matlab gives for the inverse has large elements (on
the order of the reciprocal of the machine epsilon ǫ
m
≈ 1.11 10
−16
times
the elements of A). This is due to roundoﬀ error. This can be viewed as
analogous to trying to form 1/a when a = 0, but, due to roundoﬀ error (such
as some cancelation error) a is a small number, on the order of the machine
epsilon ǫ
m
. We then have 1/a is on the order of 1/ǫ
m
.
The following two deﬁnitions and theorem clarify which matrices are sin-
gular and clarify the relationship between singular matrices and solution of
linear systems of equations involving those matrices.
DEFINITION 3.6 Let
_
v
(i)
_
m
i=1
be m vectors. Then
_
v
(i)
_
m
i=1
is said
to be linearly independent provided

m
i=1
β
i
v
(i)
= 0, then β
i
= 0 for i =
1, 2, , m.
Example 3.7
Let
a
1
=
_
_
1
2
3
_
_
, a
2
=
_
_
4
5
6
_
_
, and a
3
=
_
_
7
8
9
_
_
be the rows of the matrix A from Example 3.6 (expressed as column vectors).
Then
a
1
−2a
2
+a
3
= 0,
so a
1
, a
2
, and a
3
are linearly dependent. In particular, the third row of A is
two times the second row minus the ﬁrst row.
DEFINITION 3.7 The rank of a matrix A, rank(A), is the maximum
number of linearly independent rows it possesses. It can be shown that this is
the same as the maximum number of linearly independent columns. If A is an
m by n matrix and rank(A) = min¦m, n¦, then A is said to be of full rank.
For example, if m < n and the rows of A are linearly independent, then A is
of full rank.
76 Applied Numerical Methods
The following theorem deals with rank, nonsingularity, and solutions to
systems of equations.
THEOREM 3.1
Let A be an n n matrix (A ∈ L(C
n
)). Then the following are equivalent:
1. A is nonsingular.
2. det(A) ,= 0, where det(A) is the determinant
1
of the matrix A.
3. The linear system Ax = 0 has only the solution x = 0.
4. For any b ∈ C
n
, the linear system Ax = b has a unique solution.
5. The columns (and rows) of A are linearly independent. (That is, A is
of full rank, i.e. rank(A) = n.)
When the matrices for a system of equations have special properties, we
can often use these properties to take short cuts in the computation to solve
corresponding systems of equations, or to know that roundoﬀ error will not
accumulate when solving such systems. Symmetry and positive deﬁniteness
are important properties for these purposes.
DEFINITION 3.8 If A
T
= A, then A is said to be symmetric. If
A
H
= A, then A is said to be Hermitian.
Example 3.8
If A =
_
1 2 −i
2 +i 3
_
, then A
H
=
_
1 2 −i
2 +i 3
_
, so A is Hermitian.
DEFINITION 3.9 If A is an n by n matrix with real entries, if A
T
= A
and x
T
Ax > 0 for any x ∈ R
n
except x = 0, then A is said to be symmetric
positive deﬁnite. If A is an n by n matrix with complex entries, if A
H
= A
and x
H
Ax > 0 for x ∈ C
n
, x ,= 0, then A is said to be Hermitian positive
deﬁnite. Similarly, if x
T
Ax ≥ 0 (for a real matrix A) or x
H
Ax ≥ 0 (for
a complex matrix A) for every x ,= 0, we say that A is symmetric positive
semi-deﬁnite or Hermitian positive semi-deﬁnite, respectively.
1
We will not give a formal deﬁnition of determinant here, but we will use their properties.
Determinants are generally deﬁned well in a good linear algebra course. We explain a good
way of computing determinants in Section 3.2.3 on page 86. When computing a determinant
of small matrices symbolically, expansion by minors is often used.
Linear Systems of Equations 77
Example 3.9
If A =
_
4 1
1 3
_
, then A
T
= A, so A is symmetric.
Also, x
T
Ax = 4x
2
1
+ 2x
1
x
2
+ 3x
2
2
= 3x
2
1
+ (x
1
+ x
2
)
2
+ 2x
2
2
> 0 for x ,= 0.
Thus, A is symmetric positive deﬁnite.
Prior to studying actual methods for analyzing systems of linear equations,
we introduce the following concepts.
DEFINITION 3.10 If v = (v
1
, . . . , v
n
)
T
is a vector and λ is a number,
we deﬁne scalar multiplication w = λv by w
i
= λv
i
, that is, we multiply each
component of v by λ. We say that we have scaled v by λ. We can similarly
scale a matrix.
Example 3.10
Observe the following matlab dialog.
>> v = [1;-1;2]
v =
1
-1
2
>> lambda = 3
lambda =
3
>> lambda*v
ans =
3
-3
6
>>
DEFINITION 3.11 If A is an nn matrix, a scalar λ and a nonzero x
are an eigenvalue and eigenvector of A if Ax = λx.
DEFINITION 3.12 ρ(A) = max
1≤i≤n
[λ
i
[, where ¦λ
i
¦
n
i=1
is the set of eigen-
values of A, is called the spectral radius of A.
Example 3.11
The matlab function eig computes eigenvalues and eigenvectors. Consider
the following matlab dialog.
78 Applied Numerical Methods
>> A = [1,2,3
4 5 6
7 8 10]
A =
1 2 3
4 5 6
7 8 10
>> [V,Lambda] = eig(A)
V =
-0.2235 -0.8658 0.2783
-0.5039 0.0857 -0.8318
-0.8343 0.4929 0.4802
Lambda =
16.7075 0 0
0 -0.9057 0
0 0 0.1982
>> A*V(:,1) - Lambda(1,1)*V(:,1)
ans =
1.0e-014 *
0.0444
-0.1776
0.1776
>> A*V(:,2) - Lambda(2,2)*V(:,2)
ans =
1.0e-014 *
0.0777
0.1985
0.0944
>> A*V(:,3) - Lambda(3,3)*V(:,3)
ans =
1.0e-014 *
0.0666
-0.0444
0.2109
>>
Note that the eigenvectors of the matrix A are stored in the columns of V, while
corresponding eigenvalues are stored in the diagonal entries of the diagonal
matrix Lambda. In this case, the spectral radius is ρ(A) ≈ 16.7075.
Although we won’t study computation of eigenvalues and eigenvectors until
Chapter 5, we refer to the concept in this chapter.
With these facts and concepts, we can now study the actual solution of
systems of equations on computers.
Linear Systems of Equations 79
3.2 Gaussian Elimination
We can think of Gaussian elimination as a process of repeatedly adding
a multiple of one equation to another equation to transform the system of
equations into one that is easy to solve. We ﬁrst focus on these elementary
row operations.
DEFINITION 3.13 Consider a linear system of equations Ax = b, where
A is n n, and b, x ∈ R
n
.Elementary row operations on a system of linear
equations are of the following three types:
1. interchanging two equations,
2. multiplying an equation by a nonzero number,
3. adding to one equation a scalar multiple of another equation.
THEOREM 3.2
If system Bx = d is obtained from system Ax = b by a ﬁnite sequence of
elementary operations, then the two systems have the same solutions.
(A proof of Theorem 3.2 can be found in elementary texts on linear alge-
bra and can be done, for example, with Theorem 3.1 and using elementary
properties of determinants.)
The idea underlying Gaussian elimination is simple:
1. Subtract multiples of the ﬁrst equation from the second through the
n-th equations to eliminate x
1
from the second through n-th equations.
2. Then, subtract multiples of the new second equation from the third
through n-th equations to eliminate x
2
from these. After this step, the
third through n-th equations contain neither x
1
nor x
2
.
3. Continue this process until the resulting n-th equation contains only x
n
,
the resulting n −1-st equation contains only x
n
and x
n−1
, etc.
4. Solve the resulting n-th equation for x
n
.
5. Plug the value for x
n
into the resulting (n − 1)-st equation, and solve
that equation for x
n−1
.
6. Continue this back-substitution process until we have solved for x
1
in
the ﬁrst equation.
80 Applied Numerical Methods
Example 3.12
We will apply this process to the system in Example 3.1. In illustrating
the process, we can write the original system and transformed systems as an
augmented matrix, with a number’s position in the matrix telling us to which
variable (or right-hand-side) and which equation it belongs. The original
system is thus written as
_
_
1 2 3 −1
4 5 6 0
7 8 10 1
_
_
.
We will use ∼ to denote that two systems of equations are equivalent, and we
will indicate below this symbol which multiples are subtracted: For example
R
3
← R
3
− 2R
2
would mean that we replace the third row (i.e. the third
equation) by the third equation minus two times the second equation. The
Gaussian elimination process then proceeds as follows.
_
_
1 2 3 −1
4 5 6 0
7 8 10 1
_
_
∼
R2←R2−4R1
R3←R3−7R1
_
_
1 2 3 −1
0 −3 −6 4
0 −6 −11 8
_
_
∼
R3←R3−2R2
_
_
1 2 3 −1
0 −3 −6 4
0 0 1 0
_
_
.
The transformed third equation now reads “x
3
= 0,” while the transformed
second equation reads “−3x
2
− 6x
3
= 4.” Plugging x
3
= 0 into the trans-
formed second equation thus gives
x
2
= (4 + 6 x
3
)/(−3) = (4)/(−3) = −
4
3
.
Similarly plugging x
3
= 0 and x
2
= −4/3 into the transformed ﬁrst equation
gives
x
1
= (−1 −2x
2
−3x
3
) = (−1 −2(−4/3)) = 5/3.
The solution vector is thus
_
_
x
1
x
2
x
3
_
_
=
_
_
5/3
−4/3
0
_
_
.
We check by computing the residual :
Ax −b =
_
_
1 2 3
4 5 6
7 8 10
_
_
_
_
5/3
−4/3
0
_
_
−
_
_
−1
0
1
_
_
=
_
_
−1
0
1
_
_
−
_
_
−1
0
1
_
_
=
_
_
0
0
0
_
_
.
Linear Systems of Equations 81
Example 3.13
If we had used ﬂoating point arithmetic in Example 3.12, 5/3 and −4/3
would not have been exactly representable, and the residual would not have
been exactly zero. In fact, a variant
2
of Gaussian elimination with back-
substitution is programmed in matlab and accessible with the backslash (¸)
operator:
>> A = [1 2 3
4 5 6
7 8 10]
A =
1 2 3
4 5 6
7 8 10
>> b = [-1;0;1]
b =
-1
0
1
>> x = A\b
x =
1.6667
-1.3333
-0.0000
>> A*x-b
ans =
1.0e-015 *
0.2220
0.8882
0.8882
>>
3.2.1 The Gaussian Elimination Algorithm
Following the pattern in the examples we have presented, we can write down
the process in general. The system will be written
a
11
x
1
+a
12
x
2
+ +a
1n
x
n
= b
1
,
a
21
x
1
+a
22
x
2
+ +a
2n
x
n
= b
2
,
.
.
.
a
n1
x
1
+a
n2
x
2
+ +a
nn
x
n
= b
n
.
2
using partial pivoting, which we will see later
82 Applied Numerical Methods
Now, the transformed matrix
_
_
1 2 3
0 −3 −6
0 0 1
_
_
from Example 3.12 (page 80) is termed an upper triangular matrix, since it has
zeros in all entries below the diagonal. The goal of Gaussian elimination is to
reduce A to an upper triangular matrix through a sequence of elementary row
operations as in Deﬁnition 3.13. We will call the transformed matrix before
working on the r-th column A
(r)
, with associated right-hand-side vector b
(r)
,
and we begin with A
(1)
= A = (a
(1)
ij
) and b = b
(1)
= (b
(1)
1
, b
(1)
2
, , b
(1)
n
)
T
,
with A
(1)
x = b
(1)
. The process can then be described as follows.
Step 1: Assume that a
(1)
11
,= 0. (Otherwise, the nonsingularity of A guaran-
tees that the rows of A can be interchanged in such a way that the new
a
(1)
11
is nonzero.) Let
m
i1
=
a
(1)
i1
a
(1)
11
, 2 ≤ i ≤ n.
Now multiply the ﬁrst equation of A
(1)
x = b
(1)
by m
i1
and subtract the
result from the i-th equation. Repeat this for each i, 2 ≤ i ≤ n. As a
result, we obtain A
(2)
x = b
(2)
, where
A
(2)
=
_
_
_
_
_
_
_
a
(1)
11
a
(1)
12
. . . a
(1)
1n
0 a
(2)
22
. . . a
(2)
2n
.
.
.
.
.
.
.
.
.
0 a
(2)
n2
. . . a
(2)
nn
_
_
_
_
_
_
_
and b
(2)
=
_
_
_
_
_
_
_
b
(1)
1
b
(2)
2
.
.
.
b
(2)
n
_
_
_
_
_
_
_
.
Step 2: We consider the (n − 1) (n − 1) submatrix
˜
A
(2)
of A
(2)
deﬁned
by
˜
A
(2)
= a
(2)
ij
, 2 ≤ i, j ≤ n. We eliminate the ﬁrst column of
˜
A
(2)
in a manner identical to the procedure for A
(1)
. The result is system
A
(3)
x = b
(3)
where A
(3)
has the form
A
(3)
=
_
_
_
_
_
_
_
_
_
a
(1)
11
a
(1)
12
a
(1)
1n
0 a
(2)
22
a
(2)
2n
.
.
. 0 a
(3)
33
a
(3)
3n
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(3)
n3
a
(3)
nn
_
_
_
_
_
_
_
_
_
.
Linear Systems of Equations 83
Steps 3 to n −1: The process continues as above, where at the k-th stage
we have A
(k)
x = b
(k)
, 1 ≤ k ≤ n −1, where
A
(r)
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
a
(1)
11
a
(1)
1n
0 a
(2)
22
a
2
2n
.
.
. 0 a
33

.
.
.
.
.
.
.
.
.
.
.
.
0 a
(k)
kk
a
(k)
kn
.
.
.
.
.
.
0 0 a
(k)
nk
a
(k)
nn
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
and b
(r)
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
b
(1)
1
b
(2)
2
.
.
.
b
(k−1)
k−1
b
(k)
k
.
.
.
b
(k)
n
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
(3.1)
For every i, k + 1 ≤ i ≤ n, the k-th equation is multiplied by
m
ik
= a
(k)
ik
/a
(k)
kk
and subtracted from the i-th equation. (We assume, if necessary, a row
is interchanged so that a
(k)
kk
,= 0.) After step k = n − 1, the resulting
system is A
(n)
x = b
(n)
where A
(n)
is upper triangular.
On a computer, this algorithm can be programmed as:
ALGORITHM 3.1
(Gaussian elimination, forward phase)
INPUT: the n by n matrix A and the n-vector b ∈ R
n
.
OUTPUT: A
(n)
and b
(n)
.
FOR k = 1, 2, , n −1
FOR i = k + 1, , n
(a) m
ik
← a
ik
/a
kk
.
(b) FOR j = k, k + 1, , n
a
ij
← a
ij
−m
ik
a
kj
.
END FOR
(c) b
i
← b
i
−m
ik
b
k
.
END FOR
END FOR
END ALGORITHM 3.1.
84 Applied Numerical Methods
Note: In Step (b) of Algorithm 3.1, we need only do the loop for j = k + 1
to n, since we know that the resulting a
k+1,k
will equal 0.
Back solving can be programmed as:
ALGORITHM 3.2
(Gaussian elimination, back solving phase)
INPUT: A
(n)
and b
(n)
from Algorithm 3.1.
OUTPUT: x ∈ R
n
as a solution to Ax = b.
1. x
n
← b
n
/a
nn
.
2. FOR k = n −1, n −2, , 1
x
k
← (b
k
−
n

j=k+1
a
kj
x
j
)/a
kk
.
END FOR
END ALGORITHM 3.2.
Note: To solve Ax = b using Gaussian elimination requires
1
3
n
3
+ O(n
2
)
multiplications and divisions. (See Exercise 4 on page 142.)
3.2.2 The LU decomposition
We now explain how the Gaussian elimination process we have just pre-
sented can be viewed as ﬁnding a lower triangular matrix L (i.e. a matrix
with zeros above the diagonal) and an upper triangular matrix U such that
A = LU. Assume ﬁrst that no row interchanges are performed in Gaussian
elimination. Let
M
(1)
=
_
_
_
_
_
_
_
1 0 . . . 0
−m
21
−m
31
.
.
. I
n−1
−m
n1
_
_
_
_
_
_
_
,
where I
n−1
is the (n−1)(n−1) identity matrix, and where m
i1
, 2 ≤ i ≤ n are
deﬁned in Gaussian elimination. Then, A
(2)
= M
(1)
A
(1)
and b
(2)
= M
(1)
b
(1)
.
Linear Systems of Equations 85
At the r-th stage of the Gaussian elimination process,
M
(r)
= r-th row
_
_
_
_
_
_
_
I
r−1
0 0 . . . 0
0 1 0 . . . 0
0 −m
r+1,r
.
.
.
.
.
. I
n−r
0 −m
n,r
_
_
_
_
_
_
_
. (3.2)
Also,
(M
(r)
)
−1
=
_
_
_
_
_
_
_
I
r−1
0 0 . . . 0
0 1 0 . . . 0
0 m
r+1,r
.
.
.
.
.
. I
n−r
0 m
n,r
_
_
_
_
_
_
_
, (3.3)
where m
ir
, r + 1 ≤ i ≤ n are given in the Gaussian elimination process, and
A
(r+1)
= M
(r)
A
(r)
and b
(r+1)
= M
(r)
b
(r)
. (Note: We are assuming here that
a
(r)
rr
,= 0 and no row interchanges are required.) Collecting the above results,
we obtain A
(n)
x = b
(n)
, where
A
(n)
= M
(n−1)
M
(n−2)
M
(1)
A
(1)
and b
(n)
= M
(n−1)
M
(n−2)
M
(1)
b
(1)
.
Recalling that A
(n)
is upper triangular and setting A
(n)
= U, we have
A = (M
(n−1)
M
(n−2)
M
(1)
)
−1
U. (3.4)
Example 3.14
Following the Gaussian elimination process from Example 3.12, we have
M
(1)
=
_
_
1 0 0
−4 1 0
−7 0 1
_
_
, M
(2)
=
_
_
1 0 0
0 1 0
0 −2 1
_
_
,
(M
(1)
)
−1
=
_
_
1 0 0
4 1 0
7 0 1
_
_
, (M
(2)
)
−1
=
_
_
1 0 0
0 1 0
0 2 1
_
_
,
and A = LU, with
L = (M
(1)
)
−1
(M
(2)
)
−1
=
_
_
1 0 0
4 1 0
7 2 1
_
_
, U =
_
_
1 2 3
0 −3 −6
0 0 1
_
_
.
Applying the Gaussian elimination process to b can be viewed as solving
Ly = b for y then solving Ux = y for x. Solving Ly = b involves forming
M
1)
b, then forming M
(2)
(M
(1)
b), while solving Ux = y is simply the back-
substitution process.
86 Applied Numerical Methods
Note: The product of two lower triangular matrices is lower triangular and
the inverse of a nonsingular lower triangular matrix is lower triangular. Thus,
L = (M
(1)
)
−1
(M
(2)
)
−1
(M
(n−1)
)
−1
is lower triangular. Hence, A = LU,
i.e., A is expressed as a product of lower and upper triangular matrices. The
result is called the LU decomposition (also known as the LU factorization,
triangular factorization, or triangular decomposition of A. The ﬁnal matrices
L and U are given by:
L =
_
_
_
_
_
_
_
_
1 0 0
m
21
1 0 0
m
31
m
32
1 0
.
.
.
.
.
.
.
.
.
.
.
.
m
n1
m
n2
m
n,n−1
1
_
_
_
_
_
_
_
_
and U =
_
_
_
_
_
_
_
_
a
(1)
11
a
(1)
12
a
(1)
1n
0 a
(2)
22
a
(2)
2n
.
.
.
.
.
.
.
.
.
0 0 a
(n)
nn
_
_
_
_
_
_
_
_
.
(3.5)
This decomposition can be so formed when no row interchanges are required.
Thus, the original problem Ax = b is transformed into LUx = b.
Note: Since computing the LU decomposition of A is done by Gaussian
elimination, it requires O(n
3
) operations. However, if L and U are already
available, computing y with Ly = b then computing x with Ux = y requires
only O(n
2
) operations.
Note: In some software, the multiplying factors, that is, the nonzero oﬀ-
diagonal elements of L, are stored in the locations of corresponding entries
of A that are made equal to zero, thus obviating the need for extra storage.
Eﬀectively, such software returns the elements of L and U in the same array
that was used to store A.
3.2.3 Determinants and Inverses
Usually, the solution x of a system of equations Ax = b is desired, and the
determinant det(A) is not of interest, even though one method of computing
x, Cramer’s rule, involves ﬁrst computing determinants. (In fact, computing
x with Gaussian elimination with back substitution is more eﬃcient than
using Cramer’s rule, and is deﬁnitely more practical for large n.) However,
occasionally the determinant of A is desired for other reasons. An eﬃcient
way of computing the determinant of a matrix is with Gaussian elimination.
If A = LU, then
det(A) = det(L) det(U) =
n

j=1
a
(j)
jj
.
(Using expansion by minors to compute the determinant requires O(n!) mul-
tiplications.)
Similarly, even though we could in principle compute A
−1
, then compute
x = A
−1
b, computing A
−1
is less eﬃcient than applying Gaussian elimination
with back-substitution. However, if we need A
−1
for some other reason, we
Linear Systems of Equations 87
can compute it relatively eﬃciently by solving n systems Ax
(j)
= e
(j)
where
e
(j)
i
= δ
ij
, where δ
ij
is the Kronecker delta function deﬁned by
δ
ij
=
_
1 if i = j,
0 if i ,= j.
If A = LU, we perform n pairs of forward and backward solves, to obtain
A
−1
= (x
1
, x
2
, , x
n
).
Example 3.15
In Example 3.14, for
A =
_
_
1 2 3
4 5 6
7 8 10
_
_
,
we used Gaussian elimination to obtain A = LU, with
L =
_
_
1 0 0
4 1 0
7 2 1
_
_
and U =
_
_
1 2 3
0 −3 −6
0 0 1
_
_
.
Thus,
det(A) = u
11
u
22
u
33
= (1)(−3)(1) = −3.
We now compute A
−1
: Using L and U to solve
_
_
1 2 3
4 5 6
7 8 10
_
_
x
(1)
=
_
_
1
0
0
_
_
gives x
(1)
=
_
_
−2/3
−2/3
1
_
_
,
solving
_
_
1 2 3
4 5 6
7 8 10
_
_
x
(2)
=
_
_
0
1
0
_
_
gives x
(2)
=
_
_
−4/3
11/3
−2
_
_
,
and solving
_
_
1 2 3
4 5 6
7 8 10
_
_
x
(3)
=
_
_
0
0
1
_
_
gives x
(3)
=
_
_
1
−2
1
_
_
.
Thus,
A
−1
=
_
_
−2/3 −4/3 1
−2/3 11/3 −2
1 −2 1
_
_
, AA
−1
= I =
_
_
1 0 0
0 1 0
0 0 1
_
_
.
88 Applied Numerical Methods
3.2.4 Pivoting in Gaussian Elimination
In our explanation and examples of Gaussian elimination so far, we have
assumed that “no row interchanges are required.” In particular, we must have
a
kk
,= 0 in each step of Algorithm 3.1. Otherwise, we may need to do a “row
interchange,” that is, we may need to rearrange the order of the transformed
equations. We have two questions:
1. When can Gaussian elimination be performed without rowinterchanges?
2. If row interchanges are employed, can Gaussian elimination always be
employed?
THEOREM 3.3
(Existence of an LU factorization) Assume that nn matrix A is nonsingular.
Then A = LU if and only if all the leading principal submatrices of A are
nonsingular.
3
Moreover, the LU decomposition is unique, if we require that
the diagonal elements of L are all equal to 1.
REMARK 3.1 Two important types of matrices that have nonsingu-
lar leading principal submatrices are symmetric positive deﬁnite and strictly
diagonally dominant, i.e.,
[a
ii
[ >
n

j=1
j=i
[a
ij
[, for i = 1, 2, , n.
We now consider our second question, “If row interchanges are employed, can
Gaussian elimination be performed for any nonsingular A?” Switching the
rows of a matrix A can be done by multiplying A on the left by a permutation
matrix:
DEFINITION 3.14 A permutation matrix P is a matrix whose columns
consist of the n diﬀerent vectors e
j
, 1 ≤ j ≤ n, in any order.
3
The leading principal submatrices of A have the form
_
_
_
a
11
. . . a
1n
.
.
.
.
.
.
.
.
.
a
k1
. . . a
kk
_
_
_ for k = 1, 2, · · · , n.
Linear Systems of Equations 89
Example 3.16
P = (e
1
, e
3
, e
4
, e
2
) =
_
_
_
_
1 0 0 0
0 0 0 1
0 1 0 0
0 0 1 0
_
_
_
_
is a permutation matrix such that the ﬁrst row of PA is the ﬁrst row of A, the
second row of PA is the fourth row of A, the third row of PA is the second
row of A, and the fourth row of PA is the third row of A. Note that the
permutation of the columns of the identity matrix in P corresponds to the
permutation of the rows of A. For example,
_
_
0 1 0
0 0 1
1 0 0
_
_
_
_
1 2 3
4 5 6
7 8 10
_
_
=
_
_
4 5 6
7 8 10
1 2 3
_
_
.
Thus, by proper choice of P, any two or more rows can be interchanged.
Note: det P = ±1, since P is obtained from I by row interchanges.
Now, Gaussian elimination with row interchanges can be performed by the
following matrix operations:
4
A
(n)
= M
(n−1)
P
(n−1)
M
(n−2)
P
(n−2)
M
(2)
P
(2)
M
(1)
P
(1)
A.
b
(n)
= M
(n−1)
P
(n−1)
M
(2)
P
(2)
M
(1)
P
(1)
b.
It follows that U =
ˆ
LA
(1)
, where
ˆ
L is no longer lower triangular. However, if
we perform all the row interchanges ﬁrst, at once, then
M
(n−1)
M
(1)
PAx = M
(n−1)
M
(n−2)
M
(1)
Pb,
or
˜
LPAx =
˜
LPb,
so
˜
LPA = U.
Thus,
PA =
˜
L
−1
U = LU.
We can state these facts as follows.
4
When implementing Gaussian elimination, we usually don’t actually multiply full n by
n matrices together, since this is not eﬃcient. However, viewing the process as matrix
multiplications has advantages when we analyze it.
90 Applied Numerical Methods
THEOREM 3.4
If A is a nonsingular n n matrix, then there is a permutation matrix P
such that PA = LU, where L is lower triangular and U is upper triangular.
(Note: det(PA) = ±det(A) = det(L) det(U).)
We now examine the actual operations we do to complete the Gaussian
elimination process with row interchanges (known as pivoting).
Example 3.17
Consider the system
0.0001x
1
+x
2
= 1
x
1
+x
2
= 2.
The exact solution of this system is x
1
≈ 1.00010 and x
2
≈ 0.99990. Let
us solve the system using Gaussian elimination without row interchanges.
We will assume calculations are performed using three-digit rounding decimal
arithmetic. We obtain
m
21
←
a
(1)
21
a
(1)
11
≈ 0.1 10
5
,
a
(2)
22
← a
(1)
22
−m
21
a
(1)
12
≈ 0.1 10
1
−0.1 10
5
≈ −0.100 10
5
.
Also, b
(2)
≈ (0.1 10
1
, −0.1 10
5
)
T
, so the computed (approximate) upper
triangular system is
0.1 10
−3
x
1
+ 0.1 10
1
x
2
= 0.1 10
1
,
−0.1 10
5
x
2
= −0.1 10
5
,
whose solutions are x
2
= 1 and x
1
= 0. If instead, we ﬁrst interchange the
equations so that a
(1)
11
= 1, we ﬁnd that x
1
= x
2
= 1, correct to the accuracy
used.
Example 3.17 illustrates that small values of a
(r)
rr
in the r-th stage lead to
large values of the m
ir
’s and may result in a loss of accuracy. Therefore, we
want the pivots a
(r)
rr
to be large.
Two common pivoting strategies are:
Partial pivoting: In partial pivoting, the a
(r)
ir
for r ≤ i ≤ n, in the r-
th column of A
(r)
is searched to ﬁnd the element of largest absolute
value, and row interchanges are made to place that element in the pivot
position.
Full pivoting: In full pivoting, the pivot element is selected as the element
a
(r)
ij
, r ≤ i, j ≤ n of maximum absolute value among all elements of
Linear Systems of Equations 91
the (n −r) (n −r) submatrix of A
(r)
. This strategy requires row and
column interchanges.
In theory, full pivoting is required in general to assure that the process does
not result in excessive roundoﬀ error. However, partial pivoting is adequate
in most cases. For some classes of matrices, no pivoting strategy is required
for a stable elimination procedure. For example, no pivoting is required for a
real symmetric positive deﬁnite matrix or for a strictly diagonally dominant
matrix [41].
We now present a formal algorithm for Gaussian elimination with partial
pivoting. In reading this algorithm, recall that
a
11
x
1
+a
12
x
2
+ +a
1n
x
n
= b
1
a
21
x
1
+a
22
x
2
+ +a
2n
x
n
= b
2
.
.
.
a
n1
x
1
+a
n2
x
2
+ +a
nn
x
n
= b
n
.
ALGORITHM 3.3
(Solution of a linear system of equations with Gaussian elimination with par-
tial pivoting and back-substitution)
INPUT: The n by n matrix A and right-hand-side vector b.
OUTPUT: An approximate solution
5
x to Ax = b.
FOR k = 1, 2, , n −1
1. Find ℓ such that [a
ℓk
[ = max
k≤j≤n
[a
jk
[ (k ≤ ℓ ≤ n).
2. Interchange row k with row ℓ
_
_
_
c
j
← a
kj
a
kj
← a
ℓj
a
ℓj
← c
j
_
_
_
for j = 1, 2, . . . , n, and
_
_
_
d ← b
k
b
k
← b
ℓ
b
ℓ
← d
_
_
_
.
3. FOR i = k + 1, , n
(a) m
ik
← a
ik
/a
kk
.
(b) FOR j = k, k + 1, , n
a
ij
← a
ij
−m
ik
a
kj
.
END FOR
5
approximate because of roundoﬀ error
92 Applied Numerical Methods
(c) b
i
← b
i
−m
ik
b
k
.
END FOR
4. Back-substitution:
(a) x
n
← b
n
/a
nn
and
(b) x
k
←
_
b
k
−
n

j=k+1
a
kj
x
j
__
a
kk
, for k = n −1, n −2, , 1.
END FOR
END ALGORITHM 3.3.
REMARK 3.2 In Algorithm 3.3, the computations are arranged “seri-
ally,” that is, they are arranged so each individual addition and multiplica-
tion is done separately. However, it is eﬃcient on modern machines, that
have “pipelined” operations and usually also have more than one processor,
to think of the operations as being done on vectors. Furthermore, we don’t
necessarily need to change entire rows, but just keep track of a set of indices
indicating which rows are interchanged; for large systems, this saves a signif-
icant number of storage and retrieval operations. For views of the Gaussian
elimination process in terms of vector operations, see [16]. For an example of
software that takes account of the way machines are built, see [5].
REMARK 3.3 If U is the upper triangular matrix resulting from Gaus-
sian elimination with partial pivoting, we have
det(A) = (−1)
K
det(U) = (−1)
K
a
(1)
11
a
(2)
22
a
(n)
nn
,
where K is the number of row interchanges made.
3.2.5 Systems with a Special Structure
We now consider some special but commonly encountered kinds of matrices.
3.2.5.1 Symmetric, Positive Deﬁnite Matrices
We ﬁrst characterize positive deﬁnite matrices.
THEOREM 3.5
Let A be a real symmetric n n matrix. Then A is positive deﬁnite if and
only if there exists an invertible lower triangular matrix L such that A = LL
T
.
Furthermore, we can choose the diagonal elements of L, ℓ
ii
, 1 ≤ i ≤ n, to be
positive numbers.
Linear Systems of Equations 93
The decomposition with positive ℓ
ii
is called the Cholesky factorization of
A. It can be shown that this decomposition is unique. L can be computed
using a variant of Gaussian elimination. Set ℓ
11
=
√
a
11
and ℓ
j1
= a
j1
/
√
a
11
for 2 ≤ j ≤ n. (Note that x
T
Ax > 0, and the choice x = e
j
implies that
a
jj
> 0.) Then, for i = 1, 2, 3, n, set
ℓ
ii
=
_
a
ii
−
i−1

k=1
(ℓ
ik
)
2
_
1
2
ℓ
ji
=
1
ℓ
ii
_
a
ji
−
i−1

k=1
ℓ
ik
ℓ
jk
_
for i + 1 ≤ j ≤ n.
If A is real symmetric and L can be computed in this way, then A is positive
deﬁnite. (This is an eﬃcient way to show positive deﬁniteness.) To solve
Ax = b where A is real symmetric positive deﬁnite, L can be formed in this
way, and the pair Ly = b and L
T
x = y can be solved for x, analogously to
the way we use the LU decomposition to solve a system.
Note: The multiplication and division count for Cholesky decomposition is
n
3
/6 +O(n
2
).
Thus, for large n, about 1/2 the multiplications and divisions are required
compared to standard Gaussian elimination.
Example 3.18
Consider solving approximately
x
′′
(t) = −sin(πt), x(0) = x(1) = 0.
One technique of approximately solving this equation is to replace x
′′
in the
diﬀerential equation by
x
′′
(t) ≈
x(t +h) −2x(t) +x(t −h)
h
2
. (3.6)
If we subdivide the interval [0, 1] into four subintervals, then the end points of
these subintervals are t
0
= 0, t
1
= 1/4, t
2
= 1/2, t
3
= 3/4, and t
4
= 1. If we
require the approximate diﬀerential equation with x
′′
replaced using (3.6) to
be exact at t
1
, t
2
, and t
3
and take h = 1/4 to be the length of a subinterval,
we obtain:
at t
1
=
1
4
:
x
2
−2x
1
+x
0
1
16
= −sin(π/4),
at t
2
=
1
2
:
x
3
−2x
2
+x
1
1
16
= −sin(π/2),
at t
3
=
3
4
:
x
4
−2x
3
+x
2
1
16
= −sin(3π/4),
94 Applied Numerical Methods
with t
k
= k/4, k = 0, 1, 2, 3, 4. If we plug in x
0
= 0, x
4
= 0, we multiply
both sides of each of these three equations by −h
2
= −1/16, and we write the
equations in matrix form, we obtain
_
_
2 −1 0
−1 2 −1
0 −1 2
_
_
_
_
x
1
x
2
x
3
_
_
=
1
16
_
_
sin(π/4)
sin(π/2)
sin(3π/4)
_
_
.
The matrix for this system is symmetric. There is a matlab function chol
that performs a Cholesky factorization. We use it as follows:
>> A = [2 -1 0
-1 2 -1
0 -1 2]
A =
2 -1 0
-1 2 -1
0 -1 2
>> b = (1/16)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]
b =
0.0442
0.0625
0.0442
>> L = chol(A)’
L =
1.4142 0 0
-0.7071 1.2247 0
0 -0.8165 1.1547
>> L*L’-A
ans =
1.0e-015 *
0.4441 0 0
0 -0.4441 0
0 0 0
>> y = L\b
y =
0.0312
0.0691
0.0871
>> x = L’\y
x =
0.0754
0.1067
0.0754
>> A\b
ans =
0.0754
Linear Systems of Equations 95
0.1067
0.0754
>>
3.2.5.2 Tridiagonal Matrices
A tridiagonal matrix is a matrix of the form
A =
_
_
_
_
_
_
_
_
_
a
1
c
1
0 0
b
2
a
2
c
2
0 0
0 b
3
a
3
c
3
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 b
n−1
a
n−1
c
n−1
0 0 b
n
a
n
_
_
_
_
_
_
_
_
_
.
For example, the matrix from Example 3.18 is tridiagonal. In many cases im-
portant in applications, A can be decomposed into a product of two bidiagonal
matrices, that is,
A = LU =
_
_
_
_
_
_
α
1
0 0
b
2
α
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 b
n
α
n
_
_
_
_
_
_
_
_
_
_
_
_
1 γ
1
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
γ
n−1
0 0 1
_
_
_
_
_
_
. (3.7)
In such cases, multiplying the matrices on the right of (3.7) together and
equating the resulting matrix entries with corresponding entries of A gives
the following variant of Gaussian elimination:
α
1
= a
1
,
γ
1
= c
1
/α
1
,
_
α
i
= a
i
−b
i
γ
i−1
γ
i
= c
i
/α
i
_
for i = 2, , n −1,
α
n
= a
n
−b
n
γ
n−1
.
(3.8)
Thus, if α
i
,= 0, 1 ≤ i ≤ n, we can compute the decomposition (3.7). Fur-
thermore, we can compute the solution to Ax = f = (f
1
, f
2
, , f
n
)
T
by
successively solving Ly = f and Ux = y, i.e.,
y
1
= f
1
/α
1
,
y
i
= (f
i
−b
i
y
i−1
)/α
i
for i = 2, 3, , n,
x
n
= y
n
,
x
j
= (y
j
−γ
j
x
j+1
) for j = n −1, n −2, , 1.
(3.9)
Suﬃcient conditions to guarantee the decomposition (3.7) are as follows.
THEOREM 3.6
Suppose the elements a
i
, b
i
, and c
i
of A satisfy [a
1
[ > [c
1
[ > 0, [a
i
[ ≥ [b
i
[+[c
i
[,
and b
i
c
i
,= 0 for 2 ≤ i ≤ n − 1, and suppose [a
n
[ > [b
n
[ > 0. Then A is
96 Applied Numerical Methods
invertible and the α
i
’s are nonzero. (Consequently, the factorization (3.7) is
possible.)
Note: It can be veriﬁed that solution of a linear system having tridiago-
nal coeﬃcient matrix using (3.8) and (3.9) requires (5n − 4) multiplications
and divisions and 3(n − 1) additions and subtractions. (Recall that we need
n
3
/3+O(n
2
) multiplications and divisions for Gaussian elimination.) Storage
requirements are also drastically reduced to 3n locations, versus n
2
for a full
matrix.
Example 3.19
The matrix from Example 3.18 is tridiagonal, and satisﬁes the conditions
in Theorem 3.6. This holds true if we form the linear system of equations
in the same was as in Example 3.18, regardless of how small we make h,
and how large the resulting system is. Thus, we may solve such systems
with the forward substitution and back substitution algorithms represented
by (3.8) and (3.9). If we want less truncation error in the approximation to
the diﬀerential equation, we need to solve a larger system (with h smaller). It
is more practical to do so with (3.8) and (3.9) than with the general Gaussian
elimination algorithm, since the amount of work the computer has to do is
proportional to n, rather than n
3
.
3.2.5.3 Block Tridiagonal Matrices
We now consider brieﬂy block tridiagonal matrices, that is, matrices of the
form
A =
_
_
_
_
_
_
_
_
_
_
A
1
C
1
0 0 0
B
2
A
2
C
2
0 0
0 B
3
A
3
C
3
0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 B
n−1
A
n−1
C
n−1
0 0 0 0 B
n
A
n
_
_
_
_
_
_
_
_
_
_
,
where A
i
, B
i
, and C
i
are mm matrices. Analogous to the tridiagonal case,
we construct a factorization of the form
A =
_
_
_
_
_
_
ˆ
A
1
0 . . . 0
B
2
ˆ
A
2
. . . 0
0
.
.
.
.
.
.
.
.
.
0 . . . B
n
ˆ
A
n
_
_
_
_
_
_
_
_
_
_
_
_
I E
1
. . . 0
0 I
.
.
. 0
0
.
.
.
.
.
. E
n−1
0 . . . 0 I
_
_
_
_
_
_
.
Linear Systems of Equations 97
Provided the
ˆ
A
i
, 1 ≤ i ≤ n, are nonsingular, we can compute:
ˆ
A
1
= A
1
,
E
1
=
ˆ
A
−1
1
C
1
,
ˆ
A
i
= A
i
−B
i
E
i−1
for 2 ≤ i ≤ n,
E
i
=
ˆ
A
−1
i
C
i
, for 2 ≤ i ≤ n −1.
For eﬃciency, the
ˆ
A
−1
i
are generally not computed, but instead, the columns
of E
i
are computed by factoring
ˆ
A
i
and solving a pair of triangular systems.
That is,
ˆ
A
i
E
i
= C
i
with
ˆ
A
i
= L
i
U
i
becomes L
i
U
i
E
i
= C
i
.
Note: The number of operations for computing a block factorization of a
block tridiagonal system is proportional to nm
3
. This is signiﬁcantly less
than the number of operations, proportional to n
3
, for completing the general
Gaussian elimination algorithm, for m small relative to n. In such cases,
tremendous savings are achieved by taking advantage of the zero elements.
Now consider
Ax = b, x =
_
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
_
_
_
, b =
_
_
_
_
_
b
1
b
2
.
.
.
b
n
_
_
_
_
_
,
where x
i
, b
i
∈ R
m
. Then, with the factorization A = LU, Ax = b can be
solved as follows: Ly = b, Ux = y, with
ˆ
A
1
y
1
= b
1
,
ˆ
A
i
y
i
= (b
i
−B
i
y
i−1
) for i = 2, , n,
x
n
= y
n
,
x
j
= y
j
−E
j
x
j+1
for j = n −1, , 1.
Block tridiagonal systems arise in various applications, such as in equi-
librium models for diﬀusion processes in two and three variables, a simple
prototype of which is the equation
∂
2
u
∂x
2
+
∂
2
∂y
2
= −f(x, y),
when we approximate the partial derivatives in a manner similar to how we
approximated u
′′
in Example 3.18. In that case, not only is the overall system
block tridiagonal, but, depending on how we order the equations and variables,
the individual matrices A
i
, B
i
, and C
i
are tridiagonal, or contain mostly zeros.
Taking advantage of these facts is absolutely necessary, to be able to achieve
the desired accuracy in the approximation to the solutions of certain models.
3.2.5.4 Banded Matrices
A generalization of a tridiagonal matrix arising in many applications is a
banded matrix. Such matrices have non-zero elements only on the diagonal
98 Applied Numerical Methods
and p entries above and below the diagonal. For example, p = 1 for a tridi-
agonal matrix. The number p is called the semi-bandwidth of the matrix.
Example 3.20
_
_
_
_
_
_
_
_
_
3 −1 1 0 0
−1 3 −1 1.1 0
0.9 −1 3 −1 1.1
0 1.1 −1 3 −1
0 0 0.9 −1 3
_
_
_
_
_
_
_
_
_
is a banded matrix with semi-bandwidth equal to 2.
Provided Gaussian elimination without pivoting is applicable, banded ma-
trices may be stored and solved analogously to tridiagonal matrices. In partic-
ular, we may store the matrix in 2p +1 vectors, and we may use an algorithm
similar to (3.8) and (3.9), based on the general Gaussian elimination algorithm
(Algorithm 3.1 on page 83), but with the loop on i having an upper bound
equal to min k +p, n, rather than n, and with the a
i,j
replaced by appropriate
references to the n by 2p +1 matrix in which the non-zero entries are stored.
It is advantageous to handle a matrix as a banded matrix when its dimension
n is large relative to p.
3.2.5.5 General Sparse Matrices
Numerous applications, such as models of communications and transporta-
tion networks, give rise to matrices most of whose elements are zero, but do
not have an easily usable structure such as a block or banded structure. Ma-
trices most of whose elements are zero are called sparse matrices. Matrices
that are not sparse are called dense or full . Special, more sophisticated vari-
ants of Gaussian elimination, as well as iterative methods, which we treat
later in Section 3.5, may be used for sparse matrices.
Several diﬀerent schemes are used to store sparse matrices. One such scheme
is to store two integer vectors r and c and one ﬂoating point vector v, such
that the number of entries in r, c, and v is the total number of non-zero
elements in the matrix; r
i
gives the row index of the i-th non-zero element, c
i
gives the corresponding column index, and v
i
gives the value.
Linear Systems of Equations 99
Example 3.21
_
_
_
_
_
_
_
_
_
0 0 1 0 0
−3 0 0 0 1
−2 −1 0 1.1 0
0 0 0 5 −1
7 −8 0 0 0
_
_
_
_
_
_
_
_
_
may be stored with the vectors
r =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
2
3
5
3
5
1
3
4
2
4
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, c =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
1
1
2
2
3
4
4
5
5
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
, and v =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
−3
−2
7
−1
−8
1
1.1
5
1
−1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
Note that there are 25 entries in this matrix, but only 10 nonzero entries.
There is a question concerning whether or not a particular matrix should
be considered to be sparse, rather than treated as dense. In particular, if the
matrix has some elements that are zero, but many are not, it may be more
eﬃcient to treat the matrix as dense. This is because there is extra overhead
in the algorithms used to solve the systems with matrices that are stored as
sparse, and the elimination process can cause ﬁll-in, the introduction of non-
zeros in the transformed matrix into elements that were zero in the original
matrix. Whether a matrix should be considered to be sparse or not depends
on the application, the type of computer used to solve the system, etc. Sparse
systems that have a banded or block structure are more eﬃciently treated
with special algorithms for banded or block systems than with algorithms for
general sparse matrices.
There is extensive support for sparse matrices in matlab. This is detailed in
matlab’s help system. One method of describing a sparse matrix in matlab
is as we have done in Example 3.21.
100 Applied Numerical Methods
3.3 Roundoﬀ Error and Conditioning
On page 17, we deﬁned the condition number of a function in terms of the
ratio of the relative error in the function value to the relative error in its argu-
ment. Also, in Example 1.17 on page 16, we saw that one way of computing
a quantity can lead to a large relative error in the result, while another way
leads to an accurate result; that is, one algorithm can be numerically unstable
while another is stable.
Similar concepts hold for solutions to systems of linear equations. For ex-
ample, Example 3.17 on page 90 illustrated that Gaussian elimination without
partial pivoting can be numerically unstable for a system of equations where
Gaussian elimination with partial pivoting is stable. We also have a concept of
condition number of a matrix, which relates the relative change of components
of x to changes in the elements of the matrix A and right-hand-side vector
b for the system. To understand the most commonly used type of condition
number of a matrix, we introduce norms.
3.3.1 Norms
We use norms to describe errors in vectors and convergence of sequences of
vectors.
DEFINITION 3.15 A function that assigns a non-negative real number
|v| to a vector v is called a norm, provided it has the following properties.
1. |u| ≥ 0.
2. |u| = 0 if and only if u = 0.
3. |λu| = [λ[|u| for λ ∈ R (or λ ∈ C if v is a complex vector).
4. |u +v| ≤ |u| +|v| (triangle inequality).
Consider V = C
n
, the vector space of n-tuples of complex numbers. Note
that x ∈ C
n
has the form x = (x
1
, x
2
, , x
n
)
T
. Also,
x +y = (x
1
+y
1
, x
2
+y
2
, , x
n
+y
n
)
T
and
λx = (λx
1
, λx
2
, , λx
n
)
T
.
Important norms on C
n
are:
(a) |x|
∞
= max
1≤i≤n
[x
i
[: the ℓ
∞
or max norm (for z = a + ib, [z[ =
√
a
2
+b
2
=
√
zz.)
Linear Systems of Equations 101
(b) |x|
1
=
n

i=1
[x
i
[: the ℓ
1
norm
(c) |x|
2
=
_
n

i=1
[x
i
[
2
_
1/2
: the ℓ
2
norm (Euclidean norm)
(d) Scaled versions of the above norms, where we deﬁne |v|
a
= |a
T
v|,
where a = (a
1
, a
2
, , a
n
)
T
with a
i
> 0 for 1 ≤ i ≤ n.
A useful property relating the Euclidean norm and the dot product is:
THEOREM 3.7
(the Cauchy–Schwarz inequality)
[v ◦ w[ = [v
T
w[ ≤ |x|
2
|y|
2
.
We now introduce a concept and notation for describing errors in compu-
tations involving vectors.
DEFINITION 3.16 The distance from u to v is deﬁned as |u −v|.
The following concept and associated theorem are worth keeping in mind,
since they hint that, in many cases, it is not so important from the point of
view of size of the error, which norm we choose to describe the error in a
vector.
DEFINITION 3.17 Two norms | |
α
and | |
β
are called equivalent if
there exist positive constants c
1
and c
2
and such that
c
1
|x|
α
≤ |x|
β
≤ c
2
|x|
α
.
Hence, also,
1
c
2
|x|
β
≤ |x|
α
≤
1
c
1
|x|
β
.
THEOREM 3.8
Any two norms on C
n
are equivalent.
102 Applied Numerical Methods
The following are the constants associated with the 1-, 2-, and ∞-norms:
(a) |x|
∞
≤ |x|
2
≤
√
n[x|
∞
,
(b)
1
√
n
|x|
1
≤ |x|
2
≤ |x|
1
,
(c)
1
n
|x|
1
≤ |x|
∞
≤ |x|
1
.
(3.10)
The above relations are sharp in the sense that vectors can be found for which
the inequalities are actually inequalities. Thus, in a sense, the 1, 2, and ∞
norms of vectors become “less equivalent,” the larger the vector space.
Example 3.22
The matlab function norm computes norms of vectors. Consider the follow-
ing dialog.
>> x = [1;1;1;1;1]
x =
1
1
1
1
1
>> norm(x,1)
ans =
5
>> norm(x,2)
ans =
2.2361
>> norm(x,inf)
ans =
1
>> n=1000;
>> for i=1:n;x(i)=1;end;
>> norm(x,1)
ans =
1000
>> norm(x,2)
ans =
31.6228
>> norm(x,inf)
ans =
1
>> >>
This illustrates that, for a vector all of whose entries are equal to 1, the second
Linear Systems of Equations 103
inequality in (3.10)(a) is an equation, the ﬁrst inequality in (b) is an equation,
and the ﬁrst inequality in (c) is an equation.
To discuss the condition number of the matrix, we use the concept of the
norm of a matrix. In the following, A and B are arbitrary square matrices
and λ is a complex number.
DEFINITION 3.18 A matrix norm is a real-valued function of A, de-
noted by | | satisfying:
1. |A| ≥ 0.
2. |A| = 0 if and only if A = 0.
3. |λA| = [λ[ |A|.
4. |A+B| ≤ |A| +|B|.
5. |AB| ≤ |A| |B|.
REMARK 3.4 In contrast to vector norms, we have an additional ﬁfth
property, referred to as a submultiplicative property, dealing with the norm
of the product of two matrices.
Example 3.23
The quantity
|A|
E
=
_
_
n

i,j=1
[a
ij
[
2
_
_
1
2
is called the Frobenius norm. Since the Frobenius norm is the Euclidean
norm of the matrix when the matrix is viewed to be a single vector formed
by concatenating its columns (or rows), the Frobenius norm is a norm. It is
also possible to prove that the Frobenius norm is a matrix norm.
To relate norms of matrices to errors in the solution of linear systems, we
relate vector norms to matrix norms:
DEFINITION 3.19 A matrix norm |A| and a vector norm |x| are called
compatible if for all vectors x and matrices A we have |Ax| ≤ |A| |x|.
REMARK 3.5 A consequence of the Cauchy–Schwarz inequality is that
|Ax|
2
≤ |A|
E
|x|
2
, i.e., the Euclidean norm | |
E
for matrices is compatible
with the ℓ
2
-norm | |
2
for vectors.
104 Applied Numerical Methods
In fact, every vector norm has associated with it a sharply deﬁned compat-
ible matrix norm:
DEFINITION 3.20 Given a vector norm | |, we deﬁne a natural or
induced matrix norm associated with it as
|A| = sup
x =0
|Ax|
|x|
. (3.11)
It is straightforward to show that an induced matrix norm satisﬁes the ﬁve
properties required of a matrix norm. Also, from the deﬁnition of induced
norm, an induced matrix norm is compatible with the given vector norm,
that is,
|A| |x| ≥ |Ax| for all x ∈ C
n
. (3.12)
REMARK 3.6 Deﬁnition 3.20 is equivalent to
|A| = sup
y=1
|Ay|,
since
|A| = sup
x =0
|Ax|
|x|
= sup
x =0
_
_
_
_
A
x
|x|
_
_
_
_
= sup
y=1
|Ay|
(letting y = x/|x|).
We now present explicit expressions for |A|
∞
, |A|
1
, and |A|
2
.
THEOREM 3.9
(Formulas for common induced matrix norms)
(a) |A|
∞
= max
1≤i≤n
n

j=1
[a
ij
[ = ¦maximum absolute row sum¦.
(b) |A|
1
= max
1≤j≤n
n

i=1
[a
ij
[ = ¦maximum absolute column sum¦.
(c) |A|
2
=
_
ρ(A
H
A), where ρ(M) is the spectral radius of the matrix M,
that is, the maximum absolute value of an eigenvalue of M.
(We will study eigenvalues and eigenvectors in Chapter 5. This spectral
radius plays a fundamental role in a more advanced study of matrix
norms. In particular ρ(A) ≤ |A| for any square matrix A and any
matrix norm, and, for any square matrix A and any ǫ > 0, there is a
matrix norm | | such that |A| ≤ ρ(A) +ǫ. )
Linear Systems of Equations 105
Note that |A|
2
is not equal to the Frobenius norm.
Example 3.24
The norm function in matlab gives the induced matrix norm when its ar-
gument is a matrix. With the matrix A as in Example 3.13 (on page 81),
consider the following matlab dialog (edited for brevity):
>> A
A =
1 2 3
4 5 6
7 8 10
>> x’
ans = 1 1 1
> norm(A,1)
ans = 19
>> norm(x,1)
ans = 3
>> norm(A*x,1)
ans = 46
>> norm(A,1)*norm(x,1)
ans = 57
>> norm(A,2)
ans = 17.4125
>> norm(x,2)
ans = 1.7321
>> norm(A*x,2)
ans = 29.7658
>> norm(A,2)*norm(x,2)
ans = 30.1593
>> norm(A,inf)
ans = 25
>> norm(x,inf)
ans = 1
>> norm(A*x,inf)
ans = 25
>> norm(A,inf)*norm(x,inf)
ans = 25
>>
We are now prepared to discuss condition numbers of matrices.
3.3.2 Condition Numbers
We begin with the following:
106 Applied Numerical Methods
DEFINITION 3.21 If the solution x of Ax = b changes drastically when
A or b is perturbed slightly, then the system Ax = b is called ill-conditioned.
Because rounding errors are unavoidable with ﬂoating point arithmetic,
much accuracy can be lost during Gaussian elimination for ill-conditioned
systems. In fact, the ﬁnal solution may be considerably diﬀerent than the
exact solution.
Example 3.25
An ill-conditioned system is
Ax =
_
1 0.99
0.99 0.98
__
x
1
x
2
_
=
_
1.99
1.97
_
, whose exact solution is x =
_
1
1
_
.
However,
Ax =
_
1.989903
1.970106
_
has solution x =
_
3
−1.0203
_
.
Thus, a change of
δb =
_
−0.000097
0.000106
_
produces a change δx =
_
2.0000
−2.0203
_
.
We ﬁrst study the phenomenon of ill-conditioning, then study roundoﬀ error
in Gaussian elimination. We begin with
THEOREM 3.10
Let | |
β
be an induced matrix norm. Let x be the solution of Ax = b with
A an n n invertible complex matrix. Let x +δx be the solution of
(A +δA)(x +δx) = b +δb. (3.13)
Assume that
|δA|
β
|A
−1
|
β
< 1. (3.14)
Then
|δx|
β
|x|
β
≤ κ
β
(A)(1 −|δA|
β
|A
−1
|
β
)
−1
_
|δb|
β
|b|
β
+
|δA|
β
|A|
β
_
, (3.15)
where
κ
β
(A) = |A|
β
|A
−1
|
β
is deﬁned to be the condition number of the matrix A with respect to norm
| |
β
. There exist perturbations δx and δb for which (3.15) holds with equality.
That is, inequality (3.15) is sharp.
Linear Systems of Equations 107
(We supply a proof of Theorem 3.10 in [1].)
The condition number κ
β
(A) ≥ 1 for any induced matrix norm and any
matrix A, since
1 = |I|
β
= |A
−1
A|
β
≤ |A
−1
|
β
|A|
β
= κ
β
(A).
Example 3.26
Consider the system of equations from Example 3.25, and the following mat-
lab dialog.
>> A = [1 0.99
0.99 0.98]
A =
1.0000 0.9900
0.9900 0.9800
>> norm(A,1)*norm(inv(A),1)
ans =
3.9601e+004
>>>> b = [1.99;1.97]
b =
1.9900
1.9700
>> x = A\b
x =
1.0000
1.0000
>> btilde = [1.989903;1.980106]
btilde =
1.9899
1.9801
>> xtilde = A\btilde
xtilde =
102.0000
-101.0203
>> norm(x-xtilde,1)/norm(x,1)
ans =
101.5102
>> sol_error = norm(x-xtilde,1)/norm(x,1)
sol_error =
101.5102
>> data_error = norm(b-btilde,1)/norm(b,1)
data_error =
0.0026
>> data_error * cond(A,1)
ans =
102.0326
>> cond(A,1)
ans =
108 Applied Numerical Methods
3.9601e+004
>> cond(A,2)
ans =
3.9206e+004
>> cond(A,inf)
ans =
3.9601e+004
>>
This illustrates the deﬁnition of the condition number, as well as the fact that
the relative error in the norm of the solution can be estimated by the relative
error in the norms of the matrix and the right-hand-side vector multiplied
by the condition number of the matrix. Also, in this two-dimensional case,
the condition numbers in the 1-, 2-, and ∞-norms do not diﬀer by much.
The actual error in the solutions A¸x and A¸tilde x are small relative to the
displayed digits, in this case.
If δA = 0, we have
|δx|
β
|x|
β
≤ κ
β
(A)
|δb|
β
|b|
β
,
and if δb = 0, then
|δx|
β
|x|
β
≤
κ
β
(A)
1 −|A
−1
|
β
|δA|
β
|δA|
β
|A|
β
.
Note: In solving systems using Gaussian elimination with partial pivoting,
we can use the condition number as a rule of thumb in estimating the number
of digits correct in the solution. For example, if double precision arithmetic is
used, errors in storing the matrix into internal binary format and in each step
of the Gaussian elimination process are on the order of 10
−16
. If the condition
number is 10
4
, then we might expect 16 − 4 = 12 digits to be correct in the
solution. In many cases, this is close. (For more foolproof bounds on the
error, interval arithmetic techniques can sometimes be used.)
Note: For a unitary matrix U, i.e., U
H
U = I, we have κ
2
(U) = 1. Such a
matrix is called perfectly conditioned, since κ
β
(A) ≥ 1 for any β and A.
A classic example of an ill-conditioned matrix is the Hilbert matrix of order
n:
H
n
=
_
_
_
_
_
_
_
_
1
1
2
1
3

1
n
1
2
1
3
1
4

1
n+1
.
.
.
1
n
1
n+1
1
n+2

1
2n−1
_
_
_
_
_
_
_
_
.
Hilbert matrices and matrices that are approximately Hilbert matrices occur
in approximation of data and functions. Condition numbers for some Hilbert
Linear Systems of Equations 109
TABLE 3.1: Condition numbers of some Hilbert matrices
n 3 5 6 8 16 32 64
κ2(Hn) 5 ×10
2
5 ×10
5
15 ×10
6
15 ×10
9
2.0 ×10
22
4.8 ×10
46
3.5 ×10
95
matrices appear in Table 3.1. The reader may verify entries in this table,
using the following matlab dialog as an example.
>> hilb(3)
ans =
1.0000 0.5000 0.3333
0.5000 0.3333 0.2500
0.3333 0.2500 0.2000
>> cond(hilb(3))
ans =
524.0568
>> cond(hilb(3),2)
ans =
524.0568
REMARK 3.7 Consider Ax = b. Ill-conditioning combined with round-
ing errors can have a disastrous eﬀect in Gaussian elimination. Sometimes,
the conditioning can be improved (κ decreased) by scaling the equations. A
common scaling strategy is to row equilibrate the matrix A by choosing a
diagonal matrix D, such that premultiplying A by D causes max
1≤j≤n
[a
ij
[ = 1
for i = 1, 2, , n. Thus, DAx = Db becomes the scaled system with maxi-
mum elements in each row of DA equal to unity. (This procedure is generally
recommended before Gaussian elimination with partial pivoting is employed
[19]. However, there is no guarantee that equilibration with partial pivoting
will not suﬀer greatly from eﬀects of roundoﬀ error.)
Example 3.27
The condition number does not give the entire story in Gaussian elimination.
In particular, if we multiply an entire equation by a non-zero number, this
changes the condition number of the matrix, but does not have an eﬀect on
Gaussian elimination. Consider the following matlab dialog.
>> A = [1 1
-1 1]
A =
1 1
-1 1
>> cond(A)
ans =
110 Applied Numerical Methods
1.0000
>> A(1,:) = 1e16*A(1,:)
A =
1.0e+016 *
1.0000 1.0000
-0.0000 0.0000
>> cond(A)
ans =
1.0000e+016
>>
However, the strange scaling in the ﬁrst row of the matrix will not cause
serious roundoﬀ error when Gaussian elimination proceeds with ﬂoating point
arithmetic, if the right-hand-sides are scaled accordingly.
3.3.3 Roundoﬀ Error in Gaussian Elimination
Consider the solution of Ax = b. On a computer, elements of A and b
are represented by ﬂoating point numbers. Solving this linear system on a
computer only produces an approximate solution ˆ x.
There are two kinds of rounding error analysis. In backward error analysis,
one shows that the computed solution ˆ x is the exact solution of a perturbed
system of the form (A + F)ˆ x = b. (See, for example, [30] or [42].) Then we
have
Ax −Aˆ x = −F ˆ x,
that is,
x − ˆ x = −A
−1
F ˆ x,
from which we obtain
|x − ˆ x|
∞
|ˆ x|
∞
≤ |A
−1
|
∞
|F|
∞
= κ
∞
(A)
|F|
∞
|A|
∞
. (3.16)
Thus, assuming that we have estimates for κ
∞
(A) and |F|
∞
, we can use
(3.16) to estimate the error |x − ˆ x|
∞
.
In forward error analysis, one keeps track of roundoﬀ error at each step of
the elimination procedure. Then, x − ˆ x is estimated in some norm in terms
of, for example, A, κ(A), and θ =
p
2
β
1−t
[37, 38].
The analyses are lengthy and are not given here. The results, however, are
useful to understand. Basically, it is shown that
|F|
∞
|A|
∞
≤ c
n
gθ, (3.17)
where
c
n
is a constant that depends on size of the n n matrix A,
Linear Systems of Equations 111
g is a growth factor, g =
max
i,j,k
[a
(k)
ij
[
max
i,j
[a
ij
[
, and
θ is the unit roundoﬀ error, θ =
p
2
β
1−t
.
Note: Using backward error analysis, c
n
= 1.01n
3
+ 5(n + 1)
2
, and using
forward error analysis, c
n
=
1
6
(n
3
+ 15n
2
+ 2n −12).
Note: The growth factor g depends on the pivoting strategy: g ≤ 2
n−1
for
partial pivoting,
6
while g ≤ n
1/2
(2 3
1/2
4
1/3
n
1/n−1
)
1/2
for full pivoting.
(Wilkinson conjectured that this can be improved to g ≤ n.) For example,
for n = 100, g ≤ 2
99
≈ 10
30
for partial pivoting and g ≤ 3300 for full pivoting.
Note: Thus, by (3.16) and (3.17), the relative error |x − ˆ x|
∞
/|ˆ x|
∞
depends
directly on κ
∞
(A), θ, n
3
, and the pivoting strategy.
REMARK 3.8 The factor of 2
n−1
discouraged numerical analysts in
the 1950’s from using Gaussian elimination, and spurred study of iterative
methods for solving linear systems. However, it was found that, for most
matrices, the growth factor is much less, and Gaussian elimination with partial
pivoting is usually practical.
3.3.4 Interval Bounds
In many instances, it is practical to obtain rigorous bounds on the solution
x to a linear system Ax = b. The algorithm is a modiﬁcation of the gen-
eral Gaussian elimination algorithm (Algorithm 3.1) and back substitution
(Algorithm 3.2), as follows.
ALGORITHM 3.4
(Interval bounds for the solution to a linear system)
INPUT: The n by n matrix A and n-vecttor b ∈ R
n
.
OUTPUT: an interval vector x such that the exact solution to Ax = b must
be within the bounds x.
1. Use Algorithm 3.1 and Algorithm 3.2 (that is, Gaussian elimination with
back substitution, or any other technique) and ﬂoating point arithmetic
to compute an approximation Y to A
−1
.
2. Use interval arithmetic, with directed rounding, to compute interval en-
closures to Y A and Y b. That is,
6
It cannot be improved, since g = 2
n−1
for certain matrices.
112 Applied Numerical Methods
(a)
˜
A ← Y A (computed with interval arithmetic),
(b)
˜
b ← Y b (computed with interval arithmetic).
3. FOR k = 1, 2, , n −1 (forward phase using interval arithmetic)
FOR i = k + 1, , n
(a) m
ik
← ˜ a
ik
/˜ a
kk
.
(b) ˜ a
ik
← [0, 0].
(c) FOR j = k + 1, , n
˜ a
ij
← ˜ a
ij
−m
ik
˜ a
kj
.
END FOR
(d)
˜
b
i
←
˜
b
i
−m
ik
˜
b
k
.
END FOR
END FOR
4. x
n
←
˜
b
n
/˜ a
nn
.
5. FOR k = n −1, n −2, , 1 (back substitution)
x
k
← (
˜
b
k
−

n
j=k+1
˜ a
kj
x
j
)/˜ a
kk
.
END FOR
END ALGORITHM 3.4.
Note: We can explicitly set a
ik
to zero without loss of mathematical rigor,
even though, using interval arithmetic, a
ik
−m
ik
a
kk
may not be exactly [0, 0].
In fact, this operation does not even need to be done, since we need not ref-
erence a
ik
in the back substitution process.
Note: Obtaining the rigorous bounds x in Algorithm 3.2 is more costly
than computing an approximate solution with ﬂoating point arithmetic using
Gaussian elimination with back substitution, because an approximate inverse
Y must explicitly be computed to precondition the system. However, both
computations take O(n
3
) operations for general systems.
The following theorem clariﬁes why we may use Algorithm 3.4 to obtain
mathematically rigorous bounds.
THEOREM 3.11
Deﬁne the solution set to
˜
Ax =
˜
b to be
Σ(
˜
A,
˜
b) =
_
x [
˜
Ax =
˜
b for some
˜
A ∈
˜
A and
˜
b ∈
˜
b
_
.
Linear Systems of Equations 113
If Ax
∗
= b, then x
∗
∈ Σ(
˜
A,
˜
b). Furthermore, if x is the output to Algo-
rithm 3.4, then Σ(
˜
A,
˜
b) ⊆ x.
For facts enabling a proof of Theorem 3.11, see [29] or other references on
interval analysis.
Example 3.28
_
_
_
3.3330 15920. −10.333
2.2220 16.710 9.612
1.5611 5.1791 1.6852
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_ =
_
_
_
15913.
28.544
8.4254
_
_
_
For this problem, κ
∞
(A) ≈ 16000 and the exact solution is x = [1, 1, 1]
T
. We
will use the matlab(providing IEEE double precision ﬂoating point arith-
metic) to compute Y , and we will use the intlab interval arithmetic toolbox
for matlab (based on IEEE double precision). Rounded to 14 decimal digits,
7
we obtain
Y ≈
_
_
−0.00012055643706 −0.14988499865822 0.85417095741675
0.00006278655296 0.00012125786211 −0.00030664438576
−0.00008128244868 0.13847464088044 −0.19692507695527
_
_
.
Using outward rounding in both the computation and the decimal display, we
obtain
˜
A ⊆
_
[1.00000000000000, 1.00000000000000] [−0.00000000000012, −0.00000000000011]
[0.00000000000000, 0.00000000000001] [1.00000000000000, 1.00000000000001]
[0.00000000000000, 0.00000000000001] [0.00000000000013, 0.00000000000014]
[−0.00000000000001, −0.00000000000000]
[−0.00000000000001, −0.00000000000000]
[0.99999999999999, 1.00000000000001]
_
,
and
˜
b ⊆
_
[0.99999999999988, 0.99999999999989]
[1.00000000000000, 1.00000000000001]
[1.00000000000013, 1.00000000000014]
_
.
Completing the remainder of Algorithm 3.4 then gives
x
∗
∈ x ⊆
_
_
[0.99999999999999, 1.00000000000001]
[0.99999999999999, 1.00000000000001]
[0.99999999999999, 1.00000000000001]
_
_
.
The actual matlab dialog is as follows:
>> format long
>> intvalinit(’DisplayInfsup’)
===> Default display of intervals by infimum/supremum (e.g. [ 3.14 , 3.15 ])
>> x = interval_Gaussian_elimination(A,b)
7
as matlab displays it
114 Applied Numerical Methods
x =
1.000000000000000
1.000000000000000
1.000000000000001
>> IA = [intval(3.3330) intval(15920.) intval(-10.333)
intval(2.2220) intval(16.710) intval(9.612)
intval(1.5611) intval(5.1791) intval(1.6852)]
intval IA =
1.0e+004 *
Columns 1 through 2
[ 0.00033330000000, 0.00033330000001] [ 1.59200000000000, 1.59200000000000]
[ 0.00022219999999, 0.00022220000000] [ 0.00167100000000, 0.00167100000001]
[ 0.00015610999999, 0.00015611000000] [ 0.00051791000000, 0.00051791000001]
Column 3
[ -0.00103330000001, -0.00103330000000]
[ 0.00096120000000, 0.00096120000001]
[ 0.00016851999999, 0.00016852000001]
>> Ib = [intval(15913.);intval(28.544);intval(8.4254)]
intval Ib =
1.0e+004 *
[ 1.59130000000000, 1.59130000000000]
[ 0.00285440000000, 0.00285440000001]
[ 0.00084253999999, 0.00084254000000]
>> YA = Y*IA
intval YA =
Columns 1 through 2
[ 0.99999999999999, 1.00000000000001] [ -0.00000000000100, -0.00000000000099]
[ -0.00000000000001, 0.00000000000001] [ 1.00000000000000, 1.00000000000001]
[ -0.00000000000001, 0.00000000000001] [ 0.00000000000013, 0.00000000000014]
Column 3
[ 0.00000000000000, 0.00000000000001]
[ -0.00000000000001, -0.00000000000000]
[ 0.99999999999999, 1.00000000000001]
>> Yb = Y*Ib
intval Yb =
[ 0.99999999999900, 0.99999999999901]
[ 1.00000000000000, 1.00000000000001]
[ 1.00000000000013, 1.00000000000014]
>> x = interval_Gaussian_elimination(A,b)
x =
1.000000000000000
1.000000000000000
1.000000000000001
Here, we need to use the intlab function intval to convert the decimal
strings representing the matrix and right-hand side vector elements to small
intervals containing the actual decimal values. This is because, even though
the original system did not have interval entries, the elements cannot all be
represented exactly as binary ﬂoating point numbers, so we must enclose
the exact values in ﬂoating point intervals to be certain that the bounds we
compute contain the actual solution. This is not necessary in computing the
ﬂoating point preconditioning matrix Y , since Y need not be an exact inverse.
The function interval Gaussian elimination, not a part of intlab, is as
follows:
function [x] = interval_Gaussian_elimination(A, b)
% [x] = interval_Gaussian_elimination(A, b)
% returns the result of Algorithm 3.5 in the book.
% The matrix A and vector b should be intervals,
% although they may be point intervals (i.e. of width zero).
Linear Systems of Equations 115
n = length(b);
Y = inv(mid(A));
Atilde = Y*A;
btilde = Y*b;
error_occurred = 0;
for k=1:n
for i=k+1:n
m_ik = Atilde(i,k)/Atilde(k,k);
for j=k+1:n
Atilde(i,j) = Atilde(i,j) - m_ik*Atilde(k,j);
end
btilde(i) = btilde(i) -m_ik*btilde(k);
end
end
x(n) = btilde(n)/Atilde(n,n);
for k=n-1:-1:1
x(k) = btilde(k);
for j=k+1:n
x(k) = x(k) - Atilde(k,j)*x(j);
end
x(k) = x(k)/Atilde(k,k);
end
x = x’;
Note: There are various ways of using interval arithmetic to obtain rigorous
bounds on the solution set to linear systems of equations. Some of these are
related mathematically to the interval Newton method introduced in ¸2.4 on
page 56, while others are related to the iterative techniques we discuss later in
this section. The eﬀectiveness and practicality of a particular such technique
depend on the condition of the system, and whether the entries in the matrix
A and right hand side vector b are points to start, or whether there are larger
uncertainties in them (that is, whether or not these coeﬃcients are wide or
narrow intervals). A good theoretical reference is [29] and some additional
practical detail is given in our monograph [20].
We now consider another method for computing the solution of a linear
system Ax = b. This method is particularly appropriate for various statistical
computations, such as least squares ﬁts, when there are more equations than
unknowns.
116 Applied Numerical Methods
3.4 Orthogonal Decomposition (QR Decomposition)
This method for computing the solution of Ax = b is based on orthogonal
decomposition, also known as the QR decomposition or QR factorization. In
addition to solving linear systems, the QR factorization is also useful in least
squares problems and eigenvalue computations.
We will use the following concept heavily in this section, as well as when
we study the singular value decomposition.
DEFINITION 3.22 Two vectors u and v are called orthogonal provided
the dot product u ◦ v = 0. A set of vectors v
(i)
is said to be orthonormal,
provided v
(i)
◦ v
(j)
= δ
ij
, where δ
ij
is the Kronecker delta function
δ
ij
=
_
1 if i = j,
0 if i ,= j.
A matrix Q whose columns are orthonormal vectors is called an orthogonal
matrix.
In QR-decompositions, we compute an orthogonal matrix Q and an upper
triangular matrix
8
R such that A = QR. Advantages of the QRdecomposition
include the fact that systems involving an upper triangular matrix R can be
solved by back substitution, the fact that Q is perfectly conditioned (with
condition number in the 2-norm equal to 1), and the fact that the solution to
Qy = b is Q
T
y.
There are several ways of computing QR-decompositions. These are de-
tailed, for example, in our graduate-level text [1]. Here, we focus on the
properties of the decomposition and its use.
Note: The QR decomposition is not unique. Hence, diﬀerent software may
come up with diﬀerent QR decompositions for the same matrix.
3.4.1 Properties of Orthogonal Matrices
The following two properties, easily provable, make the QR decomposition
a numerically stable way of dealing with systems of equations.
THEOREM 3.12
Suppose Q is an orthogonal matrix. Then Q has the following properties.
8
also known as a “right triangular” matrix. This is the reason for the notation “R”.
Linear Systems of Equations 117
1. Q
T
Q = I, that is, Q
T
= Q
−1
. Thus, solving the system Qy = b can be
done with a matrix multiplication.
9
2. |Q|
2
= |Q
T
|
2
= 1.
3. Hence, κ
2
(Q) = 1, where κ
2
(Q) is the condition number of Q in the 2-
norm. That is, Q is perfectly conditioned with respect to the 2-norm (and
working with systems of equations involving Q will not lead to excessive
roundoﬀ error accumulation).
4. |Qx|
2
= |x|
2
for every x ∈ R
n
. Hence |QA|
2
= |A|
2
for every n by
n matrix A.
3.4.2 Least Squares and the QR Decomposition
Overdetermined linear systems (with more equations than unknowns) oc-
cur frequently in data ﬁtting, in mathematical modeling and statistics. For
example, we may have data of the form ¦(t
i
, y
i
)¦
m
i=1
, and we wish to model
the dependence of y on t by a linear combination of n basis functions ¦ϕ
j
¦
n
j=1
,
that is,
y ≈ f(t) =
n

i=1
x
i
ϕ
i
(t), (3.18)
where m > n. Setting f(t
i
) = y
i
, 1 ≤ i ≤ m, gives the overdetermined linear
system
_
_
_
_
_
_
_
_
ϕ
1
(t
1
) ϕ
2
(t
1
) ϕ
n
(t
1
)
ϕ
1
(t
2
) ϕ
2
(t
2
) ϕ
n
(t
2
)
.
.
.
ϕ
1
(t
m
) ϕ
2
(t
m
) ϕ
n
(t
m
)
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
y
1
y
2
.
.
.
y
m
_
_
_
_
_
_
_
_
, (3.19)
that is,
Ax = b, where A ∈ L(R
n
, R
m
), a
ij
= ϕ
j
(t
i
), and b
i
= y
i
. (3.20)
Perhaps the most common way of ﬁtting data is with least squares, in which
we ﬁnd x
∗
such that
1
2
|Ax
∗
−b|
2
2
= min
x∈R
n
ϕ(x), where ϕ(x) =
1
2
|Ax −b|
2
2
. (3.21)
(Note that x
∗
minimizes the 2-norm of the residual vector r(x) = Ax − b,
since the function g(u) = u
2
is increasing.)
9
With the usual way of multiplying matrices, this is n
2
multiplications, more than with
back-substitution, but still O(n
2
). Furthermore, it can be done with n dot products,
something that is eﬃcient on many machines.
118 Applied Numerical Methods
The naive way of ﬁnding x
∗
is to set the gradient ∇ϕ(x) = 0 and simplify.
Doing so gives the normal equations:
A
T
Ax = A
T
b. (3.22)
(See Exercise 11 on page 143.) However, the normal equations tend to be
very ill-conditioned. For example, if m = n, κ
2
(A
T
A) = κ
2
(A)
2
. Fortunately,
the least squares solution x
∗
may be computed with a QR decomposition. In
particular,
|Ax −b|
2
= |QRx −b|
2
= |Q
T
(QRx −b)|
2
= |Rx −Q
T
b|
2
.
(Above, we used |Ux|
2
= |x|
2
when U is orthogonal.) However,
|Rx −Q
T
b|
2
2
=
n

i=1
_
_
_
_
_
i

j=1
r
ij
x
j
_
_
_
−(Q
T
b)
i
_
_
2
+
n

i=m+1
(Q
T
b)
2
i
. (3.23)
Observe now:
1. All m terms in the sum in (3.23) are nonnegative.
2. The ﬁrst n terms can be made exactly zero.
3. The last m−n terms are constant.
Therefore,
min
x∈R
n
|Ax −b|
2
=
n

i=m+1
(Q
T
b)
2
i
,
and the minimizer x
∗
can be computed by backsolving the square triangular
system consisting of the ﬁrst n rows of Rx = Q
T
b.
We summarize these computations in the following algorithm.
ALGORITHM 3.5
(Least squares ﬁts with a QR decomposition)
INPUT: the m by n A, m ≥ n, and b ∈ R
m
.
OUTPUT: the least squares ﬁt x ∈ R
n
such that |Ax − b|
2
is minimized, as
well as the square of the residual norm |Ax −b|
2
2
.
1. Compute Q and R such that Q is an m by m orthogonal matrix, R is an
m by n upper triangular (or “right triangular”) matrix, and A = QR.
2. form y = Q
T
b.
Linear Systems of Equations 119
3. Solve the upper triangular system n by n system R
1:n,1:n
x = y
1:n
using
Algorithm 3.2 (the back-substitution algorithm). Here, R
1:n,1:n
corre-
sponds to A
(n)
and y
1:n
corresponds to b
(n)
.
4. Set the residual norm |Ax −b|
2
to
_

m
i=n+1
y
2
i
= |y
n+1:m
|
2
.
END ALGORITHM 3.5.
Example 3.29
Consider ﬁtting the data
t y
0 1
1 4
2 5
3 8
in the least squares sense with a polynomial of the form
p
2
(x) = x
0
ϕ
0
(x) +x
1
ϕ
1
(x) +x
2
ϕ
2
(x),
where ϕ
0
(x) ≡ 1, ϕ
1
(x) ≡ x, and ϕ
2
(x) ≡ x
2
. The overdetermined system
(3.19) becomes
_
_
_
_
1 0 0
1 1 1
1 2 4
1 3 9
_
_
_
_
_
_
x
0
x
1
x
2
_
_
=
_
_
_
_
1
4
5
8
_
_
_
_
.
We use matlab to perform a QR decomposition and ﬁnd the least squares
solution:
>> format short
>> clear x
>> A = [1 0 0
1 1 1
1 2 4
1 3 9]
A =
1 0 0
1 1 1
1 2 4
1 3 9
>> b = [1;4;5;8]
b =
1
4
5
8
>> [Q,R] = qr(A)
Q =
-0.5000 0.6708 0.5000 0.2236
-0.5000 0.2236 -0.5000 -0.6708
-0.5000 -0.2236 -0.5000 0.6708
-0.5000 -0.6708 0.5000 -0.2236
R =
-2.0000 -3.0000 -7.0000
120 Applied Numerical Methods
0 -2.2361 -6.7082
0 0 2.0000
0 0 0
>> Qtb = Q’*b;
>> x(3) = Qtb(3)/R(3,3);
>> x(2)=(Qtb(2) - R(2,3)*x(3))/R(2,3);
>> x(1) = (Qtb(1) - R(1,2)*x(2) - R(1,3)*x(3))/R(1,1)
x =
3.4000 0.7333 0.0000
>> x=x’;
>> resid = A*x - b
resid =
2.4000
0.1333
-0.1333
-2.4000
>> tt = linspace(0,3);
>> yy = x(1) + x(2)*tt + x(3)*tt.^2;
>> axis([-0.1,3.1,0.9,8.1])
>> hold
Current plot held
>> plot(A(:,2),b,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(tt,yy)
>> y = Q’*b
y =
-9.0000
-4.9193
0.0000
-0.8944
>> x = R(1:n,1:n)\y(1:n)
x =
1.2000
2.2000
0.0000
>> resid_norm = norm(y(n+1:m),2)
resid_norm =
0.8944
>> norm(A*x-b,2)
ans =
0.8944
>>
>>
This dialog results in the following plot, illustrating the data points as stars
and the quadratic ﬁt (which in this case happens to be linear) as a blue curve.
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
Note that the ﬁt does not approximate the ﬁrst and fourth data points well.
(The portion of the dialog following the plot commands illustrates alternative
Linear Systems of Equations 121
views of the computation of x and the residual norm.)
Although working with the QR decomposition is a stable process, care
should be taken when computing Q and R. We discuss actually computing Q
and R in [1].
We now turn to iterative techniques for linear systems of equations.
3.5 Iterative Methods for Solving Linear Systems
Here, we study iterative solution of linear systems
Ax = b, i.e.
n

k=1
a
jk
x
k
= b
j
, j = 1, 2, . . . , n. (3.24)
Example 3.30
Consider Example 3.18 (on page 93), where we replaced a second derivative
in a diﬀerential equation by a diﬀerence approximation, to obtain the system
_
_
2 −1 0
−1 2 −1
0 −1 2
_
_
_
_
x
1
x
2
x
3
_
_
=
1
16
_
_
sin(π/4)
sin(π/2)
sin(3π/4)
_
_
.
In other words, the equations are
2x
1
− x
2
=
1
16
sin(
π
4
),
−x
1
+ 2x
2
− x
3
=
1
16
sin(
π
2
),
− x
2
+ 2x
3
=
1
16
sin(
3π
4
).
Solving the ﬁrst equation for x
1
, the second equation for x
2
, and the third
equation for x
3
, we obtain
x
1
=
1
2
_
1
16
sin(
π
4
) + x
2
¸
,
x
2
=
1
2
_
1
16
sin(
π
2
) + x
1
+ x
3
¸
,
x
3
=
1
2
_
1
16
sin(
3π
4
) + x
2
¸
,
which can be written in matrix form as
_
_
_
x
1
x
2
x
3
_
_
_ =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_+
1
32
_
_
_
sin(
π
4
)
sin(
π
2
)
sin(
3π
4
)
_
_
_,
122 Applied Numerical Methods
that is,
x = Gx +c, (3.25)
with
x =
_
_
_
x
1
x
2
x
3
_
_
_, G =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_, and c =
1
32
_
_
_
sin(
π
4
)
sin(
π
2
)
sin(
3π
4
)
_
_
_.
Equation (3.25) can form the basis of an iterative method:
x
(k+1)
= Gx
(k)
+c. (3.26)
Starting with x
(0)
= (0, 0, 0)
T
, we obtain the following in matlab:
>> x = [0,0,0]’
x =
0
0
0
>> G = [0 1/2 0
1/2 0 1/2
0 1/2 0]
G =
0 0.5000 0
0.5000 0 0.5000
0 0.5000 0
>> c = (1/32)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]
c =
0.0221
0.0313
0.0221
>> x = G*x + c
x =
0.0221
0.0313
0.0221
>> x = G*x + c
x =
0.0377
0.0533
0.0377
>> x = G*x + c
x =
0.0488
0.0690
0.0488
>> x = G*x + c
x =
0.0566
0.0800
0.0566
>> x = G*x + c
x =
0.0621
0.0878
0.0621
>> x = G*x + c
x =
0.0660
0.0934
Linear Systems of Equations 123
0.0660
>> x = G*x + c
x =
0.0688
0.0973
0.0688
>> x = G*x + c
x =
0.0707
0.1000
0.0707
>> x = G*x + c
x =
0.0721
0.1020
0.0721
>>
Comparing with the solution in Example 3.18, we see that the components
of x tend to the components of the solution to Ax = b as we iterate (3.26).
This is an example of an iterative method (namely, the Jacobi method) for
solving the system of equations Ax = b.
Good references for iterative solution of linear systems are [23, 30, 39, 44].
Why may we wish to solve (3.24) iteratively? Suppose that n = 10, 000 or
more, which is not unreasonable for many problems. Then A has 10
8
elements,
making it diﬃcult to store or solve (3.24) directly using, for example, Gaussian
elimination.
To discuss iterative techniques involving vectors and matrices, we use:
DEFINITION 3.23 A sequence of vectors ¦x
k
¦
∞
k=1
is said to converge
to a vector x ∈ C
n
if and only if |x
k
−x| → 0 as k → ∞ for some norm | |.
Deﬁnition 3.23 implies that a sequence of vectors ¦x
k
¦ ⊂ R
n
(or ⊂ C
n
)
converges to x if and only if x
k
i
→ x
i
as k → ∞ for all i.
Note: Iterates deﬁned by (3.26) can be viewed as ﬁxed point iterates that
under certain conditions converge to the ﬁxed point.
DEFINITION 3.24 The iterative method deﬁned by (3.26) is called con-
vergent if, for all initial values x
(0)
, we have x
(k)
→ A
−1
b as k → ∞.
We now take a closer look at the Jacobi method, as well as the related
Gauss–Seidel method and SOR method.
3.5.1 The Jacobi Method
We can think of the Jacobi method illustrated in the above example in
matrix form as follows. Let L be the lower triangular part of the matrix A,
U the upper triangular part, and D the diagonal part.
124 Applied Numerical Methods
Example 3.31
In Example 3.30,
L =
_
_
0 0 0
−1 0 0
0 −1 0
_
_
, U =
_
_
0 −1 0
0 0 −1
0 0 0
_
_
, and D =
_
_
2 0 0
0 2 0
0 0 2
_
_
.
Then the Jacobi method may be written in matrix form as
G = −D
−1
(L +U) ≡ J. (3.27)
J is called the iteration matrix for the Jacobi method. The iterative method
becomes:
x
(k+1)
= −D
−1
(L +U)x
(k)
+D
−1
b, k = 0, 1, 2, . . . (3.28)
Generally, one uses the following equations to solve for x
(k+1)
:
x
(0)
i
is given,
x
(k+1)
i
=
1
a
ii
_
_
_
b
i
−
i−1

j=1
a
ij
x
(k)
j
−
n

j=i+1
a
ij
x
(k)
j
_
_
_
,
(3.29)
for k ≥ 0 and 1 ≤ i ≤ n (where a sum is absent if its lower limit on j is larger
than its upper limit). Equations (3.29) are easily programmed.
3.5.2 The Gauss–Seidel Method
We now discuss the Gauss–Seidel method, or successive relaxation method.
If in the Jacobi method, we use the new values of x
j
as they become available,
then
x
(0)
i
is given,
x
(k+1)
i
=
1
a
ii
_
_
_
b
i
−
i−1

j=1
a
ij
x
(k+1)
j
−
n

j=i+1
a
ij
x
(k)
j
_
_
_
,
(3.30)
for k ≥ 0 and 1 ≤ i ≤ n. (We continue to assume that a
ii
,= 0 for i =
1, 2, . . . , n.) The iterative method (3.30) is called the Gauss–Seidel method,
and can be written in matrix form with
G = −(L +D)
−1
U ≡ (,
so
x
(k+1)
= −(L +D)
−1
Ux
(k)
+ (L +D)
−1
b for k ≥ 0. (3.31)
Linear Systems of Equations 125
Note: The Gauss–Seidel method only requires storage of
(x
(k+1)
1
, x
(k+1)
2
, . . . , x
(k+1)
i−1
, x
(k)
i
, x
(k)
i+1
, . . . , x
(k)
n
)
T
to compute x
(k+1)
i
. The Jacobi method requires storage of x
(k)
as well as
x
(k+1)
. Also, the Gauss–Seidel method generally converges faster. This gives
an advantage to the Gauss–Seidel method. However, on some machines, sep-
arate rows of the iteration equation may be processed simultaneously in par-
allel, while the Gauss–Seidel method requires the coordinates be processed
sequentially (with the equations in some speciﬁed order).
Example 3.32
_
2 1
−1 3
__
x
1
x
2
_
=
_
3
2
_
, that is,
2x
1
+ x
2
= 3
−x
1
+ 3x
2
= 2.
(The exact solution is x
1
= x
2
= 1.) The Jacobi and Gauss–Seidel methods
have the forms
Jacobi:
_
¸
¸
_
¸
¸
_
x
(k+1)
1
=
3
2
−
1
2
x
(k)
2
x
(k+1)
2
=
2
3
+
1
3
x
(k)
1
_
¸
¸
_
¸
¸
_
,
Gauss–Seidel:
_
¸
¸
_
¸
¸
_
x
(k+1)
1
=
3
2
−
1
2
x
(k)
2
x
(k+1)
2
=
2
3
+
1
3
x
(k+1)
1
_
¸
¸
_
¸
¸
_
.
The results in Table 3.2 are obtained with x
(0)
= (0, 0)
T
. Observe that the
Gauss–Seidel method converges roughly twice as fast as the Jacobi method.
This behavior is provable.
3.5.3 Successive Overrelaxation
We now describe Successive OverRelaxation (SOR). In the SOR method,
one computes x
(k+1)
i
to be a weighted mean of x
(k)
i
and the Gauss–Seidel
iterate for that element. Speciﬁcally, for σ ,= 0 a real parameter, the SOR
method is given by
x
(0)
i
is given
x
(k+1)
i
= (1 −σ)x
(k)
i
+
σ
a
ii
_
_
_
b
i
−
i−1

j=1
a
ij
x
(k+1)
j
−
n

j=i+1
a
ij
x
(k)
j
_
_
_
,
(3.32)
126 Applied Numerical Methods
TABLE 3.2: Iterates of the Jacobi and
Gauss–Seidel methods, for Example 3.32
k x
(k)
1
Jacobi x
(k)
2
Jacobi x
(k)
1
G–S x
(k)
2
G–S
0 0 0 0 0
1 1.5 0.667 1.5 1.167
2 1.167 1.167 0.917 0.972
3 0.917 1.056 1.014 1.005
4 0.972 0.972 0.998 0.999
5 1.014 0.991 1.000 1.000
6 1.005 1.005
7 0.998 1.002
8 0.999 0.999
9 1.000 1.000
for 1 ≤ i ≤ n and for k ≥ 0. The parameter σ is called a relaxation factor. If
σ < 1, we call σ an underrelaxation factor and if σ > 1, we call σ an overre-
laxation factor. Note that if σ = 1, the Gauss–Seidel method is obtained.
Note: For certain classes of matrices and certain σ between 1 and 2, the SOR
method converges faster than the Gauss–Seidel method.
We can write (3.32) in the matrix form:
_
L +
1
σ
D
_
x
(k+1)
= −
_
U + (1 −
1
σ
)D
_
x
(k)
+ b (3.33)
for k = 0, 1, 2, . . . , with x
(0)
given. Thus,
G = (σL +D)
−1
[(1 −σ)D −σU] ≡ o
σ
,
and
x
(k+1)
= o
σ
x
(k)
+
_
L +
1
σ
D
_
−1
b. (3.34)
The matrix o
σ
is called the SOR matrix. Note that σ = 1 gives (, the
Gauss–Seidel matrix.
A classic reference on iterative methods, and the SOR method in particular,
is [44].
3.5.4 Convergence of Iterative Methods
The general iteration equation (3.26) (on page 122) gives
x
(k+1)
= Gx
(k)
+c and x
(k)
= Gx
(k−1)
+c.
Linear Systems of Equations 127
Subtracting these equations and using properties of vector addition and matrix-
vector multiplication gives
x
(k+1)
−x
(k)
= G(x
(k)
−x
(k−1)
). (3.35)
Furthermore, similar rearrangements give
(I −G)(x
(k)
−x) = x
(k)
−x
(k+1)
(3.36)
because
x = Gx +c and x
(k+1)
= Gx
(k)
+c.
Combining (3.35) and (3.36) gives
x
(k)
−x = −(I −G)
−1
G(x
(k)
−x
(k−1)
) = −(I −B)
−1
B
2
(x
(k−1)
−x
(k−2)
) ,
and taking norms gives
|x
(k)
−x| ≤ |(I −G)
−1
| |G| |x
(k)
−x
(k−1)
|
= |x
(k)
−x| ≤ |(I −G)
−1
| |G| |G(x
(k−1)
−x
(k−2)
)|
≤ |(I −G)
−1
| |G| |G| |x
(k−1)
−x
(k−2)
|
.
.
.
.
.
.
≤ |(I −G)
−1
| |G|
k
|x
(1)
−x
(0)
|.
It is not hard to show that, for any induced matrix norm,
|(I −G)
−1
| ≤
1
1 −|G|
.
Therefore,
|x
(k)
−x| ≤
|G|
k
1 −|G|
|x
(1)
−x
(0)
|. (3.37)
The practical importance of this error estimate is that we can expect linear
convergence of our iterative method when |G| < 1.
Example 3.33
We revisit Example 3.30, with the following matlab dialog:
>> x = [0,0,0]’
x =
0
0
0
>> G = [0 1/2 0
1/2 0 1/2
0 1/2 0]
G =
0 0.5000 0
0.5000 0 0.5000
0 0.5000 0
128 Applied Numerical Methods
>> c = (1/32)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]
c =
0.0221
0.0313
0.0221
>> exact_solution = (eye(3)-G)\c
exact_solution =
0.0754
0.1067
0.0754
>> normG = norm(G)
normG =
0.7071
>> for i=1:5;
old_norm = norm(x-exact_solution);
x = G*x + c;
new_norm = norm(x-exact_solution);
ratio = new_norm/old_norm
end
ratio =
0.7071
ratio =
0.7071
ratio =
0.7071
ratio =
0.7071
ratio =
0.7071
>> x
x =
0.0660
0.0934
0.0660
>>
We thus see linear convergence with the Jacobi method, with convergence
factor |G| ≈ 0.7071, just as we discussed in Section 1.1.3 (page 7) and our
study of the ﬁxed point method for solving a single nonlinear equation (Sec-
tion 2.2, starting on page 47).
Example 3.34
We examine the norm of the iteration matrix for the Gauss–Seidel method
for Example 3.30:
>> L = [0 0 0
-1 0 0
0 -1 0]
L =
0 0 0
-1 0 0
0 -1 0
>> U = [0 -1 0
0 0 -1
0 0 0]
U =
0 -1 0
0 0 -1
0 0 0
>> D = [2 0 0
0 2 0
0 0 2]
D =
Linear Systems of Equations 129
2 0 0
0 2 0
0 0 2
>> GS = -inv(L+D)*U
GS =
0 0.5000 0
0 0.2500 0.5000
0 0.1250 0.2500
>> norm(GS)
ans =
0.6905
We see that this norm is less than the norm of the iteration matrix for
the Jacobi method, so we may expect the Gauss–Seidel method to converge
somewhat faster.
The error estimates hold if | | is any norm. Furthermore, it is possible to
prove the following.
THEOREM 3.13
Suppose
ρ(G) < 1,
where ρ(G) is the spectral radius of G, that is,
ρ(G) = max¦[λ[ : λ is an eigenvalue of G¦.
Then the iterative method
x
(k+1)
= Gx
(k)
+c
converges.
In particular, the Jacobi method and Gauss–Seidel method for matrices
of the form in Example 3.30 all converge, although |G| becomes nearer to 1
(and hence, the convergence is slower), the ﬁner we subdivide the interval [0, 1]
(and hence the larger n becomes). There is some theory relating the spectral
radius of various iteration matrices, and matrices arising from discretizations
such as in Example 3.30 have been analyzed extensively.
One criterion that is easy to check is diagonal dominance, as deﬁned in
Remark 3.1 on page 88:
THEOREM 3.14
Suppose
[a
ii
[ ≥
n

j=1
j=i
[a
ij
[, for i = 1, 2, , n,
and suppose that the inequality is strict for at least one i. Then the Jacobi
method and Gauss–Seidel method for Ax = b converge.
We present a more detailed analysis in [1].
130 Applied Numerical Methods
3.5.5 The Interval Gauss–Seidel Method
The interval Gauss–Seidel method is an alternative method
10
for using
ﬂoating point arithmetic to obtain mathematically rigorous lower and upper
bounds to the solution to a system of linear equations. The interval Gauss–
Seidel method has several advantages, especially when there are uncertainties
in the right-hand-side vector b that are represented in the form of relatively
wide intervals [b
i
, b
i
], and when there are also uncertainties [a
ij
, a
ij
] in the co-
eﬃcients of the matrix A. That is, we assume that the matrix is A ∈ IR
n×n
,
b ∈ IR
n
, and we wish to ﬁnd an interval vector (or “box”) x that bounds
Σ(A, b) = ¦x [ Ax = b for some A ∈ A and some b ∈ b¦ , (3.38)
where IR
n×n
denotes the set of all n by n matrices whose entries are in-
tervals, IR
n
denotes the set of all n-vectors whose entries are intervals, and
A ∈ A means that each element of the point matrix A is contained in the
corresponding element of the interval matrix A (and similarly for b ∈ b).
The interval Gauss–Seidel method is similar to the point Gauss–Seidel
method as deﬁned in (3.30) on page 124, except that, for general systems,
we almost always precondition. In particular, let
˜
A = Y A and
˜
b = Y b,
where Y is a preconditioning matrix. We then have the preconditioned system
Y Ax = Y b, i.e.
˜
Ax =
˜
b. (3.39)
We have
THEOREM 3.15
(The solution set for the preconditioned system contains the solution set for
the original system.) Σ(A, b) ⊆ Σ(Y A, Y b) = Σ(
˜
A,
˜
b).
This theorem is a fairly straightforward consequence of the subdistributivity
(Equation (1.4) on page 26) of interval arithmetic. For a proof of this and
other facts concerning interval linear systems, see, for example, [29].
Analogously to the noninterval version of Gauss–Seidel iteration (3.30), the
interval Gauss–Seidel method is given as
x
(k+1)
i
←
1
˜ a
ii
_
_
_
˜
b
i
−
i−1

j=1
˜ a
ij
x
(k+1)
j
−
n

j=i+1
˜ a
ij
x
(k)
j
_
_
_
(3.40)
for i = 1, 2, . . . , n, where a sum is interpreted to be absent if its lower index
is greater than its upper index, and with x
(0)
i
given for 1 = 1, 2, . . . , n.
10
to the interval version of Gaussian elimination of Section 3.3.4 on page 111
Linear Systems of Equations 131
REMARK 3.9 As with the interval version of Gaussian elimination (Al-
gorithm 3.4 on page 111), a common preconditioner Y for the interval Gauss–
Seidel method is the inverse midpoint matrix Y = (m(A))
−1
, where m(A)
is the matrix whose elements are midpoints of corresponding elements of the
interval matrix A. However, when the elements of A have particularly large
widths, specially designed preconditioners
11
may be more appropriate.
REMARK 3.10 Point iterative methods, are often preconditioned. How-
ever, computing an inverse of a point matrix A leads to Y A ≈ I, where I is
the identity matrix, so the system will already have been solved (except for,
possibly, iterative reﬁnement). Moreover, such point iterative methods are
usually employed for very large systems of equations, with matrices with “0”
for many elements. Although the elements that are 0 need not be stored,
the inverse generally does not have 0’s in any of its elements [13], so it may
be impractical to even store the inverse, let alone compute it.
12
Thus, spe-
cial approximations are used for these preconditioners.
13
Preconditioners for
the point Gauss–Seidel method, conjugate gradient method (explained in our
graduate text [1]), etc. are often viewed as operators that increase the sepa-
ration between the largest eigenvalue of A and the remaining eigenvalues of
A, rather than computing an approximate inverse.
The following theorem tells us that the interval Gauss–Seidel method can
be used to prove existence and uniqueness of a solution of a system of linear
equations.
THEOREM 3.16
Suppose (3.40) is used, starting with initial interval vector x
(0)
, and obtaining
interval vector x
(k)
after a number of iterations. Then, if x
(k)
⊆ x
(0)
, for each
A ∈ A and each b ∈ b, there is an x ∈ x
(k)
such that Ax = b.
The proof of Theorem 3.16 can be found in many places, such as in [20] or
[29].
Example 3.35
Consider Ax = b, where
A =
_
[0.99, 1.01] [1.99, 2.01]
[2.99, 3.01] [3.99, 4.01]
_
, b =
_
[−1.01, −0.99]
[0.99, 1.01]
_
, x
(0)
=
_
[−10, 10]
[−10, 10]
_
.
11
See [20, Chapter 3].
12
Of course, the inverse could be computed one row at a time, but this may still be im-
practical for large systems.
13
Much work has appeared in the research literature on such preconditioners
132 Applied Numerical Methods
Then,
14
m(A) =
_
1 2
3 4
_
, Y ≈ m(A)
−1
=
_
−2.0 1.0
1.5 −0.5
_
,
˜
A = Y A ⊆
_
[0.97, 1.03] [−0.03, 0.03]
[−0.02, 0.02] [0.98, 1.02]
_
,
˜
b = Y b ⊆
_
[2.97, 3.03]
[−2.02, −1.98]
_
.
We then have
x
(1)
1
←
1
[0.97, 1.03]
_
[2.97, 3.03] −[−0.03, 0.03][−10, 10]
_
⊆ [2.5922, 3.4330],
x
(1)
2
←
1
[0.98, 1.02]
_
[−2.02, −1.98] −[−0.02, 0.02][2.5922, 3.4330]
_
⊆ [−2.1313, −1.8738].
If we continue this process, we eventually obtain
x
(4)
= ([2.8215, 3.1895], [−2.1264, −1.8786])
T
,
which, to four signiﬁcant ﬁgures, is the same as x
(3)
. Thus, we have found
mathematically rigorous bounds on the set of all solutions to Ax = b such
that A ∈ A and b ∈ b.
In Example 3.35, uncertainties of ±0.01 are present in each element of the
matrix and right-hand-side vector. Although the bounds produced with the
preconditioned interval Gauss–Seidel method are not guaranteed to be the
tightest possible with these uncertainties, they will be closer to the tightest
possible when the uncertainties are smaller.
Convergence of the interval Gauss–Seidel method is related closely to con-
vergence of the point Gauss–Seidel method, through the concept of diagonal
dominance. We give a hint of this convergence theory here.
DEFINITION 3.25 If a = [a, a] is an interval, then the magnitude of a
is deﬁned to be
mag(a) = max¦[a[, [a[¦.
Similarly, the mignitude of a is deﬁned to be
mig(a) = min
a∈a
[a[.
14
These computations were done with the aid of intlab, a matlab toolbox available free
of charge for non-commercial use.
Linear Systems of Equations 133
Given the matrix
˜
A, form the matrix H = (h
ij
) such that
h
ij
=
_
mag(a
ij
) if i ,= j,
mig(a
ij
) if i = j.
Then, basically, the interval Gauss–Seidel method will be convergent if H is
diagonally dominant.
For a careful review of convergence theory for the interval Gauss–Seidel
method and other interval methods for linear systems, see [29]. Also, see [32].
3.6 The Singular Value Decomposition
The singular value decomposition, which we will abbreviate “SVD,” is not
always the most eﬃcient way of analyzing a linear system, but is extremely
ﬂexible, and is sometimes used in signal processing (smoothing), sensitivity
analysis, statistical analysis, etc., especially if a large amount of information
about the numerical properties of the system is desired. The major libraries
for programmers (e.g. Lapack) and software systems (e.g. matlab, Mathe-
matica) have facilities for computing the SVD. The SVD is often used in the
same context as a QR factorization, but the component matrices in an SVD
are computed with an iterative technique related to techniques for computing
eigenvalues and eigenvectors (in Chapter 5 of this book).
The following theorem deﬁnes the SVD.
THEOREM 3.17
Let A be an m by n real matrix, but otherwise arbitrary. Then there are
orthogonal matrices U and V and a an m by n matrix Σ = [Σ
ij
] such that
Σ
ij
= 0 for i ,= j, Σ
i,i
= σ
i
≥ 0 for 1 ≤ i ≤ p = min¦m, n¦, and σ
1
≥ σ
2
≥
≥ σ
p
, such that
A = UΣV
T
.
For a proof and further explanation, see G. W. Stewart, Introduction to
Matrix Computations [35] or G. H. Golub
15
and C. F. van Loan, Matrix
Computations [16].
Note: The SVD for a particular matrix is not necessarily unique.
Note: The SVD is deﬁned similarly for complex matrices A ∈ L(C
n
, C
m
).
15
Gene Golub, a famous numerical analyst, a professor of Computer Science and, for many
years, department chairman, at Stanford University, invented the eﬃcient algorithm used
today for computing the singular value decomposition.
134 Applied Numerical Methods
REMARK 3.11 A simple algorithm to ﬁnd the singular-value decom-
position is: (1) ﬁnd the nonzero eigenvalues of A
T
A, i.e., λ
i
, i = 1, 2, . . . , r,
(2) ﬁnd the orthogonal eigenvectors of A
T
A and arrange them in n n ma-
trix V , (3) form the m n matrix Σ with diagonal entries σ
i
=
√
λ
i
, (4)
let u
i
= σ
−1
i
Av
i
, i = 1, 2, . . . r and compute u
i
, i = r + 1, r + 2, . . . , m using
Gram-Schmidt orthogonalization. However, a well-known eﬃcient method
for computing the SVD is the Golub-Reinsch algorithm [36] which employs
Householder bidiagonalization and a variant of the QR method.
Example 3.36
Let A =
_
_
1 2
3 4
5 6
_
_
. Then
U ≈
_
_
−0.2298 0.8835 0.4082
−0.5247 0.2408 −0.8165
−0.8196 −0.4019 0.4082
_
_
, Σ ≈
_
_
9.5255 0
0 0.5143
0 0
_
_
, and
V ≈
_
−0.6196 −0.7849
−0.7849 0.6196
_
is a singular value decomposition of A. This approximate singular value de-
composition was obtained with the following matlab dialog.
>> A = [1 2;3 4;5 6]
A =
1 2
3 4
5 6
>> [U,Sigma,V] = svd(A)
U =
-0.2298 0.8835 0.4082
-0.5247 0.2408 -0.8165
-0.8196 -0.4019 0.4082
Sigma =
9.5255 0
0 0.5143
0 0
V =
-0.6196 -0.7849
-0.7849 0.6196
>> U*Sigma*V’
ans =
1.0000 2.0000
3.0000 4.0000
5.0000 6.0000
>>
Linear Systems of Equations 135
Note: If A = UΣV
T
represents a singular value decomposition of A, then,
for
˜
A = A
T
,
˜
A = V Σ
T
U
T
represents a singular value decomposition for
˜
A.
DEFINITION 3.26 The vectors V (:, i), 1 ≤ i ≤ p are called the right
singular vectors of A, while the corresponding U(:, i) are called the left singular
vectors of A corresponding to the singular values σ
i
.
The singular values are like eigenvalues, and the singular vectors are like
eigenvectors. In fact, we have
THEOREM 3.18
Let the n by n matrix A be symmetric and positive deﬁnite. Let ¦λ
i
¦
n
i=1
be
the eigenvalues of A, ordered so that λ
1
≥ λ
2
≥ ≥ λ
n
, and let v
i
be the
eigenvector corresponding to λ
i
. Furthermore, choose the v
i
so ¦v
i
¦
n
i=1
is an
orthonormal set, and form V = [v
1
, , v
n
] and Λ = diag(λ
1
, , λ
n
). Then
A = V ΛV
T
represents a singular value decomposition of A.
This theorem follows directly from the deﬁnition of the SVD. We also have
THEOREM 3.19
Let the n by n matrix A be invertible, and let A = UΣV
T
represent a singular
value decomposition of A. Then the 2-norm condition number of A is κ
2
(A) =
σ
1
/σ
n
.
Thus, the condition number of a matrix is obtainable directly from the
SVD, but the SVD gives us more useful information about the sensitivity of
solutions than just that single number, as we’ll see shortly.
The singular value decomposition is related directly to the Moore–Penrose
pseudo-inverse. In fact, the pseudo-inverse can be deﬁned directly in terms
of the singular value decomposition.
DEFINITION 3.27 Let A ∈ L(R
n
, R
m
), let A = UΣV
T
represent a
singular value decomposition of A, and assume r ≤ p is such that σ
1
≥ σ
2
≥
σ
r
> 0, and σ
r+1
= σ
r+2
= = σ
p
= 0. Then the Moore–Penrose pseudo-
inverse of A is deﬁned to be
A
+
= V Σ
+
U
T
,
where Σ
+
=
_
Σ
+
ij
_
∈ L(R
m
, R
n
) is such that
_
Σ
+
ij
= 0 if i ,= j or i > r, and
Σ
+
ii
= 1/σ
i
if 1 ≤ i ≤ r.
136 Applied Numerical Methods
Part of the power of the singular value decomposition comes from the fol-
lowing.
ace-2pt
THEOREM 3.20
Suppose A ∈ L(R
n
, R
m
) and we wish to ﬁnd approximate solutions to Ax = b,
where b ∈ R
m
. Then,
• If Ax = b is inconsistent, then x = A
+
b represents the least squares
solution of minimum 2-norm.
• If A is consistent (but possibly underdetermined) then x = A
+
b repre-
sents the solution of minimum 2-norm.
• In general, x = A
+
b represents the least squares solution to Ax = b of
minimum norm.
The proof of Theorem 3.20 is left as an exercise (on page 144).
ace-2pt
REMARK 3.12 If m < n, one would expect the system to be underde-
termined but full rank. In that case, A
+
b gives the solution x such that |x|
2
is minimum; however, if A were also inconsistent, then there would be many
least squares solutions, and A
+
b would be the least squares solution of min-
imum norm. Similarly, if m > n, one would expect there to be a single least
squares solution; however, if the rank of A is r < p = n, then there would
be many such least squares solutions, and A
+
b would be the least squares
solution of minimum norm.
Example 3.37
Consider Ax = b, where A =
_
_
1 2 3
4 5 6
7 8 9
_
_
and b =
_
_
−1
0
1
_
_
. Then
U ≈
_
_
−0.2148 0.8872 0.4082
−0.5206 0.2496 −0.8165
−0.8263 −0.3879 0.4082
_
_
, Σ ≈
_
_
16.8481 0 0
0 1.0684 0
0 0 0.0000
_
_
,
V ≈
_
_
−0.4797 −0.7767 −0.4082
−0.5724 −0.0757 0.8165
−0.6651 0.6253 −0.4082
_
_
, and Σ
+
≈
_
_
0.0594 0 0
0 0.9360 0
0 0 0
_
_
.
Since σ
3
= 0, we note that the system is not of full rank, so it could be either
inconsistent or underdetermined. We compute x ≈ [0.9444, 0.1111, −0.7222]
T
,
Linear Systems of Equations 137
and we obtain
16
|Ax−b|
2
≈ 2.5 10
−15
. Thus, Ax = b, although apparently
underdetermined, is apparently consistent, and x represents that solution of
Ax = b which has minimum 2-norm.
As with other methods for computing solutions, we usually do not form the
pseudo-inverse A
+
to compute A
+
x, but we use the following.
ALGORITHM 3.6
(Computing A
+
b)
INPUT:
(a) the m by n matrix A ∈ L(R
n
, R
m
),
(b) the right-hand-side vector b ∈ R
m
,
(c) a tolerance ǫ such that a singular value σ
i
is considered to be equal to
0 if σ
i
/σ
1
< ǫ.
OUTPUT: an approximation x to A
+
b.
1. Compute the SVD of A, that is, compute approximations to U ∈ L(R
m
),
Σ ∈ L(R
n
, R
m
), and V ∈ L(R
n
) such that A = UΣV
T
.
2. p ← min¦m, n¦.
3. r ← p.
4. FOR i = 1 to p.
IF σ
i
/σ
1
> ǫ THEN
σ
+
i
← 1/σ
i
.
ELSE
i. r ← i −1.
ii. EXIT FOR
END IF
END FOR
5. Compute w = (w
1
, , w
r
)
T
∈ R
r
, w ← U(:, 1 : r)
T
b, where U(:, 1 :
r) ∈ R
n×r
is the matrix whose columns are the ﬁrst r columns of U.
6. FOR i = 1 to r: w
i
← σ
+
i
w
i
.
16
The computations in this example were done using matlab, and were thus done in IEEE
double precision. The digits displayed here are the results from that computation, rounded
to four signiﬁcant decimal digits with matlab’s intrinsic display routines.
138 Applied Numerical Methods
7. x ←
r

i=1
w
i
V (:, i).
END ALGORITHM 3.6.
REMARK 3.13 Ill-conditioning (i.e., sensitivity to roundoﬀ error) in the
computations in Algorithm 3.6 occurs when small singular values σ
i
are used.
For example, suppose σ
i
/σ
1
≈ 10
−6
, and there is an error δU(:, i) in the vector
b, that is, b =
˜
b−δU(:, i) (that is, we perturb b by δ in the direction of U(:, i)).
Then, instead of A
+
b,
A
+
(b +δU(:, i)) = A
+
b +A
+
δU(:, i) = A
+
b +δ
1
σ
i
V (:, i). (3.41)
Thus, the norm of the error δU(:, i) is magniﬁed by 1/σ
i
. Now, if, in addition,
b happened to be in the direction of U(:, 1), that is, b = δ
1
U(:, 1), then
|A
+
b|
2
= |δ
1
(1/σ
1
)V (:, 1)|
2
= (1/σ
1
)|b|
2
. Thus, the relative error, in this
case, would be magniﬁed by σ
1
/σ
i
.
In view of Remark 3.13, we are led to consider modifying the problem
slightly to reduce the sensitivity to roundoﬀ error. For example, suppose that
we are data ﬁtting, with m data points (t
i
, y
i
) (as in Section 3.4 on page 117),
and A is the matrix as in Equation (3.19), where m ≫ n. Then we assume
there is some error in the right-hand-side vector b. However, since ¦U(:, i)¦
forms an orthonormal basis for R
m
,
b =
m

i=1
β
i
U(:, i) for some coeﬃcients ¦β
i
¦
m
i=1
.
Therefore, U
T
b = (β
1
, . . . , β
m
)
T
, and we see that x will be more sensitive to
changes in components of b in the direction of the β
i
with larger indices. If we
know that typical errors in the data are on the order of ǫ, then, intuitively, it
makes sense not to use components of b in which the magniﬁcation of errors
will be larger than that. That is, it makes sense in such cases to choose ǫ = ǫ
in Algorithm 3.6.
Use of ǫ ,= 0 in Algorithm 3.6 can be viewed as replacing the smallest
singular values of the matrix A by 0. In the case that A ∈ L(R
n
) is square and
only σ
n
is replaced by zero, this amounts to replacing an ill-conditioned matrix
A by a matrix that is exactly singular. One (of many possible) theorems
dealing with this replacement process is
THEOREM 3.21
Suppose A is an n by n matrix, and suppose we replace σ
n
,= 0 in the singular
value decomposition of A by 0, then form
˜
A = U
˜
ΣV
T
, where A = UΣV
T
rep-
resents the singular value decomposition of A, and
˜
Σ = diag(σ
1
, , σ
n−1
, 0).
Linear Systems of Equations 139
Then
|A−
˜
A|
2
= min
B∈L(R
n
)
rank(B)<n
|A−B|
2
,
Suppose now that
˜
A has been obtained from A by replacing the smallest
singular values of A by 0, so the nonzero singular values of
˜
A are σ
1
≥ σ
2
≥
≥ σ
r
> 0, and deﬁne x =
˜
A
+
b. Then, perturbations of size |∆b| in b
result in perturbations of size at most (σ
1
/σ
r
)|∆b| in x. This prompts us to
deﬁne a generalization of condition number as follows.
DEFINITION 3.28 Let A, be an m by n matrix with m and n arbitrary,
and assume the nonzero singular values of A are σ
1
≥ σ
2
≥ ≥ σ
r
> 0.
Then the generalized condition number of A is σ
1
/σ
r
.
Example 3.38
Consider A =
_
_
1 2 3
4 5 6
7 8 10
_
_
, whose singular value decomposition is approxi-
mately
U ≈
_
_
0.2093 0.9644 0.1617
0.5038 0.0353 −0.8631
0.8380 −0.2621 0.4785
_
_
,
Σ ≈
_
_
17.4125 0 0
0 0.8752 0
0 0 0.1969
_
_
, and
V ≈
_
_
0.4647 −0.8333 0.2995
0.5538 0.0095 −0.8326
0.6910 0.5528 0.4659
_
_
.
Suppose we want to solve the system Ax = b, where b = [1, −1, 1]
T
, but that,
due to noise in the data, we do not wish to deal with any system of equations
with condition number equal to 25 or greater. How can we describe the set of
solutions, based on the best information we can obtain from the noisy data?
We ﬁrst observe that κ
2
(A) = σ
1
/σ
n
≈ 88.4483. However, σ
1
/σ
2
≈
19.8963 < 25. We may thus form a new matrix
˜
A = U
˜
ΣV
T
, where
˜
Σ
is obtained from Σ by replacing σ
3
by 0. This is equivalent to projecting
A onto the set of singular matrices according to Theorem 3.21. We then
use Algorithm 3.6 (applied to
˜
A) to determine x as x =
˜
A
+
b. We obtain
x ≈ (−0.6205, 0.0245, 0.4428)
T
. Thus, to within the accuracy of 1/25 = 4%,
140 Applied Numerical Methods
we can only determine that the solution lies along the line
_
_
−0.6205
0.0245
0.4428
_
_
+y
3
V
:,3
, y
3
∈ R.
This technique is a common type of analysis in data ﬁtting. The parameter
y
3
(or multiple parameters, in case of higher-order rank deﬁciency) needs to
be chosen through other information available with the application.
3.7 Applications
Consider the following diﬀerence equation model [3], which describes the
dynamics of a population divided into three stages.
_
¸
_
¸
_
J(t + 1) = (1 −γ
1
)s
1
J(t) +bB(t)
N(t + 1) = γ
1
s
1
J(t) + (1 −γ
2
)s
2
N(t)
B(t + 1) = γ
2
s
2
N(t) +s
3
B(t)
(3.42)
The variables J(t), N(t) and B(t) represents the number of juveniles, non-
breeders, and breeders, respectively, at time t. The parameter b > 0 is the
birth rate, while γ
1
, γ
2
∈ (0, 1) represent the fraction (in one time unit) of
juveniles that become non-breeders and non-breeders that become breeders,
respectively. Parameters s
1
, s
2
, s
3
∈ (0, 1) are the survivor rates of juveniles,
non-breeders and breeders, respectively.
To analyze the model numerically, we let b = 0.6, γ
1
= 0.8, γ
2
= 0.7, s
1
=
0.7, s
2
= 0.8, s
3
= 0.9. Also notice the model can be written as
_
_
J(t + 1)
N(t + 1)
B(t + 1)
_
_
=
_
_
0.14 0 0.6
0.56 0.24 0
0 0.56 0.9
_
_
_
_
J(t)
N(t)
B(t)
_
_
or a matrix form
X(t + 1) = AX(t),
where X(t) = (J(t), N(t), B(t))
T
, and A =
_
_
0.14 0 0.6
0.56 0.24 0
0 0.56 0.9
_
_
.
Suppose we know all the eigenvectors v
i
, i = 1, 2, 3 and their associated
eigenvalues λ
i
, i = 1, 2, 3 of the matrix A. By the knowledge of linear alge-
bra, any initial vector X(0) can be expressed as a linear combination of the
eigenvectors
X(0) = c
1
v
1
+c
2
v
2
+c
3
v
3
,
Linear Systems of Equations 141
then
X(1) = AX(0) = A(c
1
v
1
+c
2
v
2
+c
3
v
3
)
= c
1
Av
1
+c
2
Av
2
+c
3
Av
3
= c
1
λ
1
v
1
+c
2
λ
2
v
2
+c
3
λ
3
v
3
.
Applying the same techniques, we get
X(2) = AX(1) = A(c
1
λ
1
v
1
+c
2
λ
2
v
2
+c
3
λ
3
v
3
)
= c
1
λ
2
1
v
1
+c
2
λ
2
2
v
2
+c
3
λ
2
3
v
3
.
Continuing the above will lead to the general solution of the population dy-
namical model (3.42)
X(t) =
3

i=1
c
i
λ
t
i
v
i
.
Now, to compute the eigenvalues and eigenvectors of A, we could simply type
the following in matlab command window
>> A=[0.14 0 0.6; 0.56 0.24 0; 0 0.56 0.9]
A =
0.1400 0 0.6000
0.5600 0.2400 0
0 0.5600 0.9000
>> [v,lambda]=eig(A)
v =
-0.1989 + 0.5421i -0.1989 - 0.5421i 0.4959
0.6989 0.6989 0.3160
-0.3728 - 0.1977i -0.3728 + 0.1977i 0.8089
lambda =
0.0806 + 0.4344i 0 0
0 0.0806 - 0.4344i 0
0 0 1.1188
From the result, we see that the spectral radius of A is λ
3
= 1.1188 and its
corresponding eigenvector is v
3
= (0.4959, 0.3160, 0.8089)
T
. Hence, X(t) =
A
t
X(0) ≈ c
3
(1.1188)
t
v
3
. This shows the population size will increase geo-
metrically as time increases.
3.8 Exercises
1. Let
A =
_
5 −2
−4 7
_
.
Find |A|
1
, |A|
∞
, |A|
2
, and ρ(A). Verify that ρ(A) ≤ |A|
1
, ρ(A) ≤
|A|
∞
and ρ(A) ≤ |A|
2
.
142 Applied Numerical Methods
2. Show that back solving for Gaussian elimination (that is, show that
completion of Algorithm 3.2) requires (n
2
+ n)/2 multiplications and
divisions and (n
2
−n)/2 additions and subtractions.
3. Consider Example 3.14 (on page 85).
(a) Fill in the details of the computations. In particular, by multiplying
the matrices together, show that M
−1
1
and M
−1
2
are as stated, that
A = LU, and that L = M
−1
1
M
−1
2
.
(b) Solve Ax = b as mentioned in the example, by ﬁrst solving Ly = b,
then solving Ux = y. (You may use matlab, but print the entire
dialog.)
4. Show that performing the forward phase of Gaussian elimination for
Ax = b (that is, completing Algorithm 3.1) requires
1
3
n
3
+ O(n
2
) mul-
tiplications and divisions.
5. Show that the inverse of a nonsingular lower triangular matrix is lower
triangular.
6. Explain why A = LU, where L and U are as in Equation (3.5) on
page 86.
7. Verify the details in Example 3.15 by actually computing the solutions
to the three linear systems, and by multiplying A and A
−1
. (If you use
matlab, print the details.)
8. Program the tridiagonal version of Gaussian elimination represented by
equations (3.8) and (3.9) on page 95. Use your program to approxi-
mately solve
u
′′
= −1, u(0) = u(1) = 0
using the technique from Example 3.18 (on page 93), with h = 1/4, 1/8,
1/64, and 1/4096. Compare with the exact solution u(x) =
1
2
x(1 −x).
9. Store the matrices from Problem 8 in matlab’s sparse matrix format,
and solve the systems from Problem 8 in matlab, using the sparse
matrix format. Compare with the results you obtained from your tridi-
agonal system solver.
10. Let A =
_
_
1 2 3
4 5 6
7 8 10
_
_
and b =
_
_
−1
0
1
_
_
.
(a) Compute κ
∞
(A) approximately.
(b) Use ﬂoating point arithmetic with β = 10 and t = 3 (3-digit deci-
mal arithmetic), rounding-to-nearest, and Algorithms 3.1 and 3.2
to ﬁnd an approximation to the solution x to Ax = b.
Linear Systems of Equations 143
(c) Execute Algorithm 3.4 by hand, using t = 3, β = 10, and out-
wardly rounded interval arithmetic (and rounding-to-nearest for
computing Y ).
(d) Find the exact solution to Ax = b by hand.
(e) Compare the results you have obtained.
11. Derive the normal equations (3.22) from (3.21).
12. Let
A =
_
_
2 1 1
4 4 1
6 −5 8
_
_
.
(a) Find the LU factorization of A, such that L is lower triangular and
U is unit upper triangular.
(b) Perform back solving then forward solving to ﬁnd a solution x for
the system of equations Ax = b = [4 7 15]
T
.
13. Find the Cholesky factorization of
A =
_
_
1 −1 2
−1 5 4
2 4 29
_
_
.
Also explain why A is positive deﬁnite.
14. Let A =
_
0.1α 0.1α
1.0 1.5
_
. Determine α such that κ
∞
(A), the condition
number in the induced ∞-norm, is minimized.
15. Let A be n n lower triangular matrix with elements
a
ij
=
_
_
_
1 if i = j,
−1 if i = j + 1,
0 otherwise.
Determine the condition number of A using the matrix norm | |
∞
.
16. Consider the matrix system Au = b given by
_
_
_
_
_
_
_
_
_
_
1
2
0 0 0
1
4
1
2
0 0
1
8
1
4
1
2
0
1
16
1
8
1
4
1
2
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
u
1
u
2
u
3
u
4
_
_
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
1
0
0
1
_
_
_
_
_
_
_
_
_
_
.
144 Applied Numerical Methods
(a) Determine A
−1
by hand.
(b) Determine the inﬁnity-norm condition number of the matrix A.
(c) Let ˜ u be the solution when the right-hand side vector b is perturbed
to
˜
b = (1.01 0 0 0.99)
T
. Estimate |u − ˜ u|
∞
, without computing
˜ u.
17. Complete the computations, to check that x
(4)
is as given in Exam-
ple 3.35 on page 131. (You may use intlab. Also see the code
gauss seidel step.m available from
http://www.siam.org/books/ot110.)
18. Repeat Example 3.28, but with the interval Gauss–Seidel method, in-
stead of interval Gaussian elimination, starting with x
(0
i
= [−10, 10],
1 ≤ i ≤ 3. Compare the results.
19. Let A be the n n tridiagonal matrix with,
a
ij
=
_
_
_
4 if i = j,
−1 if i = j + 1 or i = j −1,
0 otherwise.
Prove that the Gauss-Seidel and Jacobi methods converge for this ma-
trix.
20. Consider the linear system,
_
3 2
2 4
__
x
1
x
2
_
=
_
7
10
_
.
Using the starting vector x
(0)
= (0, 0)
T
, carry out two iterations of the
Gauss–Seidel method to solve the system.
21. Prove Theorem 3.20 on page 136. (Hint: You may need to consider
various cases. In any case, you’ll probably want to use the properties of
orthogonal matrices, as in the proof of Theorem 3.19.)
22. Given U, Σ, and V as given in Example 3.37 (on page 136), compute
A
+
b by using Algorithm 3.6. How does the x that you obtain compare
with the x reported in Example 3.37?
23. Find the singular value decomposition of the matrix A =
_
_
1 2
1 1
1 3
_
_
.
Chapter 4
Approximating Functions and Data
4.1 Introduction
A fundamental task in scientiﬁc computing is to approximate a function or
data set by a simpler function. For example, to evaluate functions such as
sin, cos, exp, etc., developers of a programming language or even designers of
computer chip circuitry reduce computing the function value to a combination
of additions, subtractions, multiplications, divisions, comparisons, and table
look-up. We have seen approximation of a general function by a polynomial
in Example 1.3 on page 4. There, where we approximated sin(x) to a speciﬁed
accuracy over a small interval by a Taylor polynomial.
Approximation of data sets by functions that are easy to evaluate occurs
throughout computer science (such as computer graphics), statistics, engineer-
ing, and the sciences. We have seen an example of this (approximating a data
set in the least squares sense by a polynomial of degree 2) in Example 3.29
on page 119.
In this chapter, we study several techniques for approximating functions and
data sets by polynomials, piecewise polynomials (functions deﬁned by diﬀerent
polynomials over diﬀerent subintervals), and trigonometric functions.
4.2 Taylor Polynomial Approximations
Recall from Chapter 1 (Taylor’s Theorem, on page 3) that if f ∈ C
n
[a, b]
and f
(k+1)
(x) exists on [a, b], then for x
0
∈ [a, b] there exists a ξ(x) between
x
0
and x such that f(x) = P
n
(x) +R
n
(x), where
P
n
(x) =
n

k=0
f
(k)
(x
0
)
k!
(x −x
0
)
k
,
and
R
n
(x) =
_
x
x0
(x −t)
n
n!
f
(n+1)
(t)dt =
f
(n+1)
(ξ(x))(x −x
0
)
n+1
(n + 1)!
.
145
146 Applied Numerical Methods
P
n
(x) is the Taylor polynomial of f(x) about x = x
0
and R
n
(x) is the re-
mainder term. Taylor polynomials provide good approximations near x = x
0
.
However, away from x = x
0
, Taylor polynomials can be poor approximations.
In addition, Taylor series require smooth functions. Nonetheless, automatic
diﬀerentiation techniques, as explained in Section 6.2 on page 215, can be
used to obtain high-order derivatives for complicated but smooth functions.
4.3 Polynomial Interpolation
Given n + 1 data points
¦(x
i
, y
i
)¦
n
i=0
,
polynomial interpolation is the process of ﬁnding a polynomial p
n
(x) of degree
n or less such that p
n
(x
i
) = y
i
for i = 0, 1, . . . n. We describe here several
ways of ﬁnding and representing the interpolating polynomial.
4.3.1 The Vandermonde System
Example 4.1
Consider the data set from Example 3.29, namely,
i x
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
However, instead of computing a least squares ﬁt as in Example 3.29, we will
pass a polynomial through each of the data points exactly. The polynomial
must obey p(x
i
) = y
i
, i = 0, 1, 2, 3. Since there are four equations, we expect
that we should have four unknowns for the system of equations and unknowns
to be well-determined. If we write the polynomial in power form, that is,
p
n
(x) = a
0
+a
1
x + +a
n
x
n
, (4.1)
we see that n = 3 for there to be four unknowns a
j
. Explicitly, the four
equations are thus
p
3
(x
0
) = y
0
: a
0
= 1,
p
3
(x
1
) = y
1
: a
0
+ a
1
+ a
2
+ a
3
= 4,
p
3
(x
2
) = y
2
: a
0
+ 2a
1
+ 4a
2
+ 8a
3
= 5,
p
3
(x
3
) = y
3
: a
0
+ 3a
1
+ 9a
2
+ 27a
3
= 8,
Approximating Functions and Data 147
or, in matrix form:
_
_
_
_
1 0 0 0
1 1 1 1
1 2 4 8
1 3 9 27
_
_
_
_
_
_
_
_
a
0
a
1
a
2
a
3
_
_
_
_
=
_
_
_
_
1
4
5
8
_
_
_
_
.
With a matlab dialog as in Example 3.29, we have:
>> A = [1 0 0 0
1 1 1 1
1 2 4 8
1 3 9 27]
A =
1 0 0 0
1 1 1 1
1 2 4 8
1 3 9 27
>> b = [1;4;5;8]
b =
1
4
5
8
>> a = A\b
a =
1.0000
5.3333
-3.0000
0.6667
>> tt = linspace(0,3);
>> yy = a(1) + a(2)*tt + a(3)*tt.^2 + a(4)*tt.^3;
>> axis([-0.1,3.1,0.9,8.1])
>> hold
>> plot(A(:,2),b,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(tt,yy)
This dialog results in the following plot of the data points and interpolating
polynomial.
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
The system of equations in Example 4.1 is called the Vandermonde system,
and the matrix is called a Vandermonde matrix. The general form for a
148 Applied Numerical Methods
Vandermonde system is
Aa =
_
_
_
_
_
_
1 x
0
x
2
0
x
3
0
x
n
0
1 x
1
x
2
1
x
n
1
.
.
.
.
.
.
1 x
n
x
2
n
x
n
n
_
_
_
_
_
_
_
_
_
_
_
_
a
0
a
1
.
.
.
a
n
_
_
_
_
_
_
=
_
_
_
_
_
_
y
0
y
1
.
.
.
y
n
_
_
_
_
_
_
. (4.2)
It can be shown that, if the points ¦x
i
¦
n
i=0
are distinct, the corresponding
Vandermonde matrix is non-singular, and it follows that the coeﬃcients of
the interpolating polynomial are unique:
THEOREM 4.1
For any n + 1 distinct real numbers x
0
, x
1
, . . . , x
n
and for arbitrary real
numbers y
0
, y
1
, . . . , y
n
, there exists a unique interpolating polynomial of
degree at most n such that p(x
j
) = y
j
, j = 0, 1, . . . , n.
Although the interpolating polynomial is in general unique, the power
form (4.1) may not be the easiest form with which to work nor the most
numerically stable to evaluate in a particular application. We now study
some alternative forms.
4.3.2 The Lagrange Form
We ﬁrst deﬁne a useful set of polynomials of degree n denoted by ℓ
0
, ℓ
1
,
. . . , ℓ
n
for points x
0
, x
1
, . . . , x
n
∈ R as
ℓ
k
(x) =
n

i=0
i=k
x −x
i
x
k
− x
i
, k = 0, 1, . . . , n. (4.3)
Notice that
(i) ℓ
k
(x) is of degree n for each k = 0, 1, . . . , n.
(ii) ℓ
k
(x
j
) =
_
0 if j ,= k,
1 if j = k,
that is, ℓ
k
(x
j
) = δ
kj
.
Now let
p(x) =
n

k=0
y
k
ℓ
k
(x).
Then
p(x
j
) =
n

k=0
y
k
ℓ
k
(x
j
) =
n

k=0
y
k
δ
jk
= y
j
for j = 0, 1, . . . , n.
Approximating Functions and Data 149
Thus,
p(x) =
n

k=0
y
k
ℓ
k
(x)
is a polynomial of degree at most n that passes through points (x
j
, y
j
), j =
0, 1, 2, . . . , n. This is called the Lagrange form of the interpolating polynomial,
and the set of functions ¦ℓ
k
¦
n
k=0
is called the Lagrange basis for the space of
polynomials of degree n associated with the set of points ¦x
i
¦
n
i=0
.
Summarizing, we obtain the Lagrange form of the (unique) interpolating
polynomial:
p(x) =
n

k=0
y
k
ℓ
k
(x) with ℓ
k
(x) =
n

j=0
j=k
x −x
j
x
k
−x
j
. (4.4)
An important feature of the Lagrange basis is that it is collocating. The
salient property of a collocating basis is that the matrix of the system of
equations to be solved for the coeﬃcients is the identity matrix. That is, the
matrix in the system of equations
¦p(x
i
) = y
i
¦
n
i=0
to solve for the c
i
in the representation
p(x) =
n

k=0
c
k
ℓ
k
(x)
is
_
_
_
_
_
_
_
_
ℓ
0
(x
0
) ℓ
1
(x
0
) ℓ
n
(x
0
)
ℓ
0
(x
1
) ℓ
1
(x
1
) ℓ
n
(x
1
)
.
.
.
.
.
.
.
.
.
ℓ
0
(x
n
) ℓ
1
(x
n
) ℓ
n
(x
n
)
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
c
0
c
1
.
.
.
c
n
_
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
y
0
y
1
.
.
.
y
n
_
_
_
_
_
_
_
_
(4.5)
is the identity matrix. (Contrast this to the Vandermonde matrix, where we
use x
k
instead of ℓ
k
(x). The Vandermonde matrix becomes ill-conditioned for
n moderately sized, while the identity matrix is perfectly conditioned; indeed,
we need do no work to solve (4.5).)
Example 4.2
For the data set as in Example 4.1, that is, for
i x
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
150 Applied Numerical Methods
we have
ℓ
0
(x) =
(x −1)(x −2)(x −3)
(0 −1)(0 −2)(0 −3)
= −
1
6
(x −1)(x −2)(x −3),
ℓ
1
(x) =
x(x −2)(x −3)
(1)(1 −2)(1 −3)
=
1
2
x(x −2)(x −3),
ℓ
2
(x) =
x(x −1)(x −3)
(2)(2 −1)(2 −3)
= −
1
2
x(x −1)(x −3),
ℓ
3
(x) =
x(x −1)(x −2)
(3)(3 −1)(3 −2)
=
1
6
x(x −1)(x −3),
and the interpolating polynomial in Lagrange form is
p
3
(x) = 1
_
−
1
6
(x −1)(x −2)(x −3)
_
+ 4
_
1
2
x(x −2)(x −3)
_
+ 5
_
−
1
2
x(x −1)(x −3)
_
+ 8
_
1
6
x(x −1)(x −3)
_
.
4.3.3 The Newton Form
Although the Lagrange polynomials form a collocating basis, useful in the-
ory for symbolically deriving formulas such as for numerical integration, the
Lagrange representation (4.4) is generally not used to numerically evaluate
interpolating polynomials. This is because
(1) it requires many operations to evaluate p(x) for many diﬀerent values
of x, and
(2) all ℓ
k
’s change if another point (x
n+1
, y
n+1
) is added.
These problems are alleviated in the Newton form of the interpolating poly-
nomial. To describe this form we use the following.
DEFINITION 4.1 y[x
j
, x
j+1
] =
y
j+1
−y
j
x
j+1
−x
j
is the ﬁrst divided diﬀerence.
DEFINITION 4.2
y[x
j
, x
j+1
, . . . , x
j+k
] =
y[x
j+1
, . . . , x
j+k
] −y[x
j
, . . . , x
j+k−1
]
x
j+k
−x
j
is the k-th order divided diﬀerence (and is deﬁned iteratively).
Approximating Functions and Data 151
Consider ﬁrst the linear interpolant through (x
0
, y
0
), and (x
1
, y
1
):
p
1
(x) = y
0
+ (x −x
0
)y[x
0
, x
1
],
since p
1
(x) is of degree 1, p
1
(x
0
) = y
0
, and
p
1
(x
1
) = y
0
+ (x
1
−x
0
)
y
1
−y
0
x
1
−x
0
= y
1
.
Consider now the quadratic interpolant through (x
0
, y
0
), (x
1
, y
1
), and (x
2
, y
2
).
We have
p
2
(x) = p
1
(x) + (x −x
0
)(x −x
1
)y[x
0
, x
1
, x
2
],
since p
2
(x) is of degree 2, p
2
(x
0
) = y
0
, p
2
(x
1
) = y
1
, and
p
2
(x
2
) = p
1
(x
2
) + (x
2
−x
0
)(x
2
−x
1
)
y[x
1
, x
2
] −y[x
0
, x
1
]
x
2
−x
0
= y
0
+ (x
2
−x
0
)
y
1
−y
0
x
1
−x
0
+ (x
2
−x
1
)
y
2
−y
1
x
2
−x
1
−(x
2
−x
1
)
y
1
−y
0
x
1
−x
0
= y
0
+y
2
−y
1
+
1
x
1
−x
0
[(x
2
−x
0
−x
2
+x
1
)y
1
+ (−x
2
+x
0
−x
1
+x
2
)y
0
]
= y
2
.
Continuing this process, one obtains:
p
n
(x) = p
n−1
(x) + (x −x
0
)(x −x
1
) . . . (x −x
n−1
)y[x
0
, x
1
, . . . , x
n
]
= y
0
+ (x −x
0
)y[x
0
, x
1
] + (x −x
0
)(x −x
1
)y[x
0
, x
1
, x
2
] +. . .
+
_
n−1

i=0
(x −x
i
)
_
y[x
0
, . . . , x
n
].
This is called Newton’s divided-diﬀerence formula for the interpolating poly-
nomial through the points ¦(x
j
, y
j
)¦
n
j=0
. This is a computationally eﬃcient
form, because the divided diﬀerences can be rapidly calculated using the fol-
lowing tabular arrangement, which is easily implemented on a computer.
j xj yj y[xj, xj+1] y[xj, xj+1, xj+2] y[xj, xj+1, xj+2, xj+3]
0 x0 y0
1 x1 y1
y
1
−y
0
x
1
−x
0
= y[x0, x1]
2 x2 y2
y
2
−y
1
x
2
−x
1
= y[x1, x2]
y[x
1
,x
2
]−y[x
0
,x
1
]
x
2
−x
0
= y[x0, x1, x2]
3 x3 y3
y
3
−y
2
x
3
−x
2
= y[x2, x3]
y[x
2
,x
3
]−y[x
1
,x
2
]
x
3
−x
1
= y[x1, x2, x3]
y[x
1
,x
2
,x
3
]−y[x
0
,x
1
,x
2
]
x
3
−x
0
4 x4 y4
y
4
−y
3
x
4
−x
3
= y[x3, x4]
y[x
3
,x
4
]−y[x
2
,x
3
]
x
4
−x
2
= y[x2, x3, x4]
y[x
2
,x
3
,x
4
]−y[x
1
,x
2
,x
3
]
x
4
−x
1
Example 4.3
Consider y(x) =
_
x
−∞
1
√
2π
e
−
1
2
x
2
dx (standard normal distribution)
152 Applied Numerical Methods
j xj yj y[xj, xj+1] y[xj, xj+1, xj+2] y[xj, xj+1, xj+2, xj+3]
0 1.4 0.9192
1 1.6 0.9452
0.9452−0.9192
0.2
= 0.130
2 1.8 0.9641 0.0945
0.0945−0.130
0.4
= −0.08875
3 2.0 0.9772 0.0655 -0.0725
−0.0725+0.08875
0.6
= 0.02708
Thus,
p
1
(x) = 0.9192 + (x −1.4)0.130
is the line through (1.4, 0.9192) and (1.6, 0.9452). Hence, y(1.65) ≈ p
1
(1.65) ≈
0.9517. Also,
p
2
(x) = 0.9192 + (x −1.4)0.130 + (x −1.4)(x −1.6)(−0.08875)
is a quadratic polynomial through (x
0
, y
0
), (x
1
, y
1
), and (x
2
, y
2
). Hence,
y(1.65) ≈ p
2
(1.65) ≈ 0.9506. Finally,
p
3
(x) = p
2
(x) + (x −1.4)(x −1.6)(x −1.8)(0.027083)
is the cubic polynomial through all four points and y(1.65) ≈ p
3
(1.65) ≈
0.9505, which is accurate to four digits.
If the points are equally spaced, i.e., x
j+1
− x
j
= ∆x for all j, Newton’s
divided diﬀerence formula can be simpliﬁed. (See, e.g., [7].) The resulting
formula, called Newton’s forward diﬀerence formula, is
y[x
j
, x
j+1
, . . . , x
j+k
] =
1
k!h
k
_
y[x
j+1
, , x
j+k
] −y[x
j
, , x
j+k−1
]
_
. (4.6)
If the points x
0
, . . . , x
n
are reordered to x
n
, x
n−1
, . . . , x
0
, the there is an anal-
ogous formula, called Newton’s backward diﬀerence formula.
Example 4.4
In Example 4.1, the points are equally spaced. Since h = 1 here, the compu-
tations become simply
k!y[x
j
, x
j+1
, . . . , x
j+k
] = y[x
j+1
, , x
j+k
] −y[x
j
, , x
j+k−1
].
and the Newton forward diﬀerence table becomes
j x
j
y
j
y[x
j
, x
j+1
] 2y[x
j
, x
j+1
, x
j+2
] 6y[x
j
, x
j+1
, x
j+2
, x
j+3
]
0 0 1 3 −2 4
1 1 4 1 2 —
2 2 5 3 — —
3 3 8 — — —
Approximating Functions and Data 153
and the Newton form for the interpolating polynomial is
p
3
(x) = 1 N
0
(x) + 3N
1
(x) +
1
2!
(−2)N
2
(x) +
1
3!
4N
3
(x)
= 1 + 3x −x(x −1) +
2
3
x(x −1)(x −2),
Where N
0
(x) ≡ 1, N
1
(x) ≡ x, N
2
(x) ≡ x(x−1), and N
3
(x) ≡ x(x−1)(x−2).
An alternative viewpoint is that taken in Example 4.1, where we explicitly
form a system of equations:
p
3
(0) = 1 : d
0
N
0
(0) + d
1
N
1
(0) + d
2
N
2
(0) + d
3
N
3
(0) = 1,
p
3
(1) = 4 : d
0
N
0
(1) + d
1
N
1
(1) + d
2
N
2
(1) + d
3
N
3
(1) = 4,
p
3
(2) = 5 : d
0
N
0
(2) + d
1
N
1
(2) + d
2
N
2
(2) + d
3
N
3
(2) = 5,
p
3
(3) = 8 : d
0
N
0
(3) + d
1
N
1
(3) + d
2
N
2
(3) + d
3
N
3
(3) = 8,
where d
i
= y[x
0
, . . . , x
i
]. In matrix form, this system is
_
_
_
_
1 0 0 0
1 1 0 0
1 2 2 0
1 3 6 6
_
_
_
_
_
_
_
_
d
0
d
1
d
2
d
3
_
_
_
_
=
_
_
_
_
1
4
5
8
_
_
_
_
,
with solution equal to
_
_
_
_
d
0
d
1
d
2
d
3
_
_
_
_
=
_
_
_
_
1
3
−1
2/3
_
_
_
_
=
_
_
_
_
y[0]
y[0, 1]
y[0, 1, 2]
y[0, 1, 2, 3]
_
_
_
_
.
This example illustrates that the coeﬃcient matrix for the Newton interpo-
lating polynomial is lower triangular. In fact, the back-substitution process
for solving this lower triangular system results in the same computations as
computing the divided diﬀerences.
4.3.4 An Error Formula for the Interpolating Polynomial
We now consider the error in approximating a given function f(x) by an
interpolating polynomial p(x) that passes through the n+1 points (x
j
, f(x
j
)),
j = 0, 1, 2, . . . , n.
THEOREM 4.2
If x
0
, x
1
, . . . , x
n
are n+1 distinct points in [a, b] and f has n+1 continuous
derivatives on the interval [a, b], then for each x ∈ [a, b], there exists a number
154 Applied Numerical Methods
ξ = ξ(x) ∈ (a, b) such that the interpolating polynomial p
n
to f through these
points obeys
f(x) = p
n
(x) +
f
(n+1)
(ξ(x))
n

j=0
(x −x
j
)
(n + 1)!
(4.7)
A proof of this theorem can be found in our graduate-level text [1] and
other references.
Formula 4.7 may be used analogously to the representation of a function as
a Taylor polynomial with remainder term (Taylor’s theorem, on page 3).
Example 4.5
We will approximate sin(x) on [−0.1, 0.1], as in Example 1.3 (on page 4), ex-
cept we will approximate by an interpolating polynomial with equally spaced
points. As in Example 1.3, we will ﬁnd degree of polynomial (i.e. a number
of equally spaced points) that will suﬃce to ensure that the error of approx-
imation is at most 10
−16
. If we use n + 1 such points, [−0.1, 0.1] will be
divided into n subintervals, and each sub-interval will have length h = 0.2/n.
Furthermore, we have
[f(x) −p
n
(x)[ =
¸
¸
f
(n+1)
(ξ(x))
¸
¸
n

j=0
[x −x
j
[
(n + 1)!
≤
1
(n + 1)!
n

j=0
[x −x
j
[,
since f
(n+1)
is either a sine or cosine. To bound the factor
n

j=0
[x−x
j
[, observe
that, if x ∈ [−0.1, 0.1], x is in some interval [x
j
, x
j+1
] of length h, so the largest
[x − x
j
[ can be is h. In adjacent intervals, the largest [x − x
j
[ can be is 2h,
etc. Observing this in the context of the product, one sees that, if we bound
each factor in the product in this way, the largest the product of the bounds
can be is
n

j=0
[x −x
j
[ ≤ h(h)(2h)(3h) (nh) = n! h
n+1
= n!
_
0.2
n
_
n+1
.
Thus,
[f(x) −p
n
(x)[ ≤
1
(n + 1)!
n!
_
0.2
n
_
n+1
=
1
n + 1
_
0.2
n
_
n+1
.
We compute this bound for various n with the following matlab dialog (com-
pressed for brevity).
Approximating Functions and Data 155
>> for n=1:15
n,(1/(n+1))*(0.2/n)^(n+1)
end
n = 1, ans = 0.0200
n = 2, ans = 3.3333e-004
n = 3, ans = 4.9383e-006
n = 4, ans = 6.2500e-008
n = 5, ans = 6.8267e-010
n = 6, ans = 6.5321e-012
n = 7, ans = 5.5509e-014
n = 8, ans = 4.2386e-016
n = 9, ans = 2.9368e-018
.
.
.
We see that it is suﬃcient
1
to take 9 subintervals, corresponding to 10
equally spaced points, for the error to be at most 10
−16
. (It is possible that a
smaller n would work, since the bounds we substituted for the actual values
may be overestimates.)
One would expect that, if we are approximating a function f by an inter-
polating polynomial, the graph of the interpolating polynomial will get closer
to the graph of the function as we take more and more points, and as h gets
smaller. However, this is not always the case.
Example 4.6
Consider Runge’s function:
f(x) =
1
1 +x
2
.
We will compute and graph the interpolating polynomials (with a graph of
Runge’s function itself) using 5, 9, and 17 equally spaced points in the interval
[−5, 5]. We use our matlab functions Lagrange interp poly coeffs.m and
Lagrange interp poly val.m, that we have posted on the web page http:
//interval.louisiana.edu/Classical-and-Modern-NA/:
xpts = linspace(-5,5,200);
z1 = 1./(1+xpts.^2);
[a] = Lagrange_interp_poly_coeffs(4,’runge’,-5,5);
z2 = Lagrange_interp_poly_val(a,xpts);
[a] = Lagrange_interp_poly_coeffs(8,’runge’,-5,5);
z3 = Lagrange_interp_poly_val(a,xpts);
[a] = Lagrange_interp_poly_coeffs(16,’runge’,-5,5);
z4 = Lagrange_interp_poly_val(a,xpts);
plot(xpts,z1,xpts,z2,xpts,z3,xpts,z4);
1
Since ﬂoating point arithmetic was used, this is not a mathematically rigorous proof. The
expression could be evaluated using interval arithmetic to make the result mathematically
rigorous.
156 Applied Numerical Methods
The result is as follows.
−5 0 5
−16
−14
−12
−10
−8
−6
−4
−2
0
2
We see that, the higher the degree, the worse the approximation near the
ends of the interval. A clue to what is happening are the observations that
the (n+1)-st derivative f
(n+1)
(0) increases like (n+1)! and that the maximum
of
¸
¸
n

j=0
(x −x
j
)
¸
¸
also increases as n increases.
A similar phenomenon as in Example 4.6 will occur if we try to pass a
high-degree interpolating polynomial through a large number of data points,
and there are small errors (such as measurement errors) in the values. The
mathematical eﬀect of the errors is the same as if the supposed underlying
function has very large higher-order derivatives.
In the next section, we examine a way of choosing the points x
i
to reduce
the error error term in (4.7).
4.3.5 Optimal Points of Interpolation: Chebyshev Points
Reviewing the error estimate (4.7), we have
max
a≤x≤b
[f(x) −p(x)[ ≤
1
(n + 1)!
max
x∈[a,b]
[f
(n+1)
(x)[ max
x∈[a,b]
¸
¸
¸
n

j=0
(x −x
j
)
¸
¸
¸. (4.8)
In Example 4.6, we saw that the last factor does not tend to 0 as we increase
the number of points, if the points are equally spaced. In that example, we
saw that [f(x) −p
n
(x)[ was largest if
Approximating Functions and Data 157
THEOREM 4.3
(Chebyshev points) max
x∈[−1,1]
¸
¸
n

i=0
(x −x
i
)
¸
¸
is minimized on [−1, 1] when
x
i
= cos
_
2i + 1
n + 1

π
2
_
0 ≤ i ≤ n,
and the minimum value is 2
−n
. Furthermore, if we approximate f(y), y ∈
[a, b] ,= [−1, 1], and we take
y
i
=
b −a
2
x
i
+
b +a
2
,
then
max
y∈[a,b]
¸
¸
¸
n

i=0
(y −y
i
)
¸
¸
¸ = 2
−2n−1
(b −a)
n+1
.
Theorem 4.3 is based on the Chebyshev equi-oscillation property. The points
x
i
are the roots of the Chebyshev polynomials:
T
n
(x) = cos(narccos(x)). (4.9)
We present a detailed explanation and proof of this theorem in our graduate
text [1].
Example 4.7
We will recompute the interpolating polynomials of degree 4, 8, and 16, as
in Example 4.6, except we use Chebyshev points instead of equally spaced
points. We modify Lagrange interp poly coeffs.m to use the Chebyshev
points. To do so, one may simply replace the lines
for i =1:np1;
x(i) = a + (i-1)*h;
end
by
for i =1:np1;
t = cos((2*i-1)/np1 *pi/2),0;
x(i) = 0.5 *((b-a)*t + (b+a));
end
We obtain the following graph.
158 Applied Numerical Methods
−5 0 5
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
With these better points of interpolation, there now appears to be convergence
towards the actual function as we increase n. (In fact, this can be formally
shown to be so.)
REMARK 4.1 We have seen through Example 4.6 (approximating Runge’s
function) that the approximation to a function f does not necessarily get bet-
ter as we take more and more equally spaced points. However, with Exam-
ple 4.7, we saw that we could make the interpolating polynomial approximate
Runge’s function as closely as we want by taking suﬃciently many points
Chebyshev points. However, it can be shown that some other functions can-
not be approximated well by a high degree interpolating polynomial with
Chebyshev points. In fact, it can be shown no matter how the interpolat-
ing points are distributed, there exists a continuous function f for which
max
x∈[a,b]
[p
n
(x) −f(x)[ → ∞ as n → ∞.
This observation motivates us to consider other methods for approximating
functions using polynomials, in particular piecewise polynomial interpolation.
The piecewise polynomial concept, analogous to composite integration in-
volves dividing the interval of approximation into subintervals, and using a
diﬀerent polynomial over each subinterval. We enforce the resulting polyno-
mial to have a particular degree of smoothness by requiring derivatives to
match between separate subintervals.
Approximating Functions and Data 159
4.4 Piecewise Polynomial Interpolation
Piecewise polynomials are commonly used approximations. They are easy
to work with, they can provide good approximations, and they are widely used
in computer graphics. In addition, piecewise polynomials are employed, for
example, in ﬁnite element methods. Good references for piecewise polynomial
approximation are [24] and [34]. Here, we study two commonly used piecewise
interpolants: linear splines (piecewise linear interpolants) and cubic splines.
A uniﬁed treatment, based on elementary functional analysis, appears in our
graduate text [1].
4.4.1 Piecewise Linear Interpolation
As with polynomial interpolation, we start with a set of points ¦x
i
¦
n
i=0
that
subdivides the interval [a, b]:
a = x
0
< x
1
< < x
n−1
< x
n
= b,
and we draw lines between the points (x
i
, f(x
i
). This is the graph of the
piecewise linear interpolant to f (or, if we have a ﬁnite data set ¦(x
i
, y
i
)¦, the
piecewise linear interpolant to the data). More formally, we have:
DEFINITION 4.3 The piesewise linear interpolant to the data
¦(x
i
, y
i
)¦
is the function ϕ(x) such that
1. ϕ is linear on each [x
i
, x
i+1
], and
2. ϕ(x
i
) = y
i
.
Graphically, ϕ may look as in Figure 4.1.
x
y
y = ϕ(x)
+
a
x
0
+
x
1
+ +
x
2
+
x
3
... ...
+
x
N
+
b
FIGURE 4.1: An example of a piecewise linear function.
160 Applied Numerical Methods
Example 4.8
The piecewise linear interpolant to the data set as in Example 4.1, that is,
to the data set
i x
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
is
ϕ(x) =
_
¸
_
¸
_
1 + 3(x −0) for 0 ≤ x ≤ 1,
4 + (x −1) for 1 ≤ x ≤ 2,
5 + 3(x −2) for 2 ≤ x ≤ 3.
matlab has the function interp1 to do piecewise linear and other inter-
polants. We may thus use the following dialog:
>> x = [0,1,2,3]
x =
0 1 2 3
>> y = [1,4,5,8]
y =
1 4 5 8
>> xi = linspace(0,3);
>> yi = interp1(x,y,xi,’linear’);
>> axis([-0.1,3.1,0.9,8.1])
>> hold
Current plot held
>> plot(x,y,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(xi,yi)
>>
This dialog produces the following plot.
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
Analogously to the Lagrange functions for polynomial interpolation, there
is a commonly used collocating basis to represent piecewise linear interpolants.
Approximating Functions and Data 161
This basis consists of the hat functions ϕ
i
(x), 0 ≤ i ≤ n, deﬁned as follows.
ϕ
0
(x) =
_
_
_
x
1
−x
x
1
− x
0
, x
0
≤ x ≤ x
1
,
0, otherwise,
ϕ
n
(x) =
_
_
_
x −x
n−1
x
n
−x
n−1
, x
n−1
≤ x ≤ x
n
,
0, otherwise,
ϕ
i
(x) =
_
¸
_
¸
_
x −x
i−1
x
i
−x
i−1
, x
i−1
≤ x ≤ x
i
,
x
i+1
−x
x
i+1
−x
i
, x
i
≤ x ≤ x
i+1
,
_
¸
_
¸
_
for 1 ≤ i ≤ n.
These hat functions are depicted graphically in Figure 4.2.
x
y
ϕ
0
(x) ϕ
i
(x) ϕ
n
(x)
+ 1
+
a
x
0
+
x
1
+
x
i−1
+
x
i
x
i+1
+ +
x
n−1
+
b
x
n
FIGURE 4.2: Graphs of the “hat” functions ϕ
i
(x).
Example 4.9
The hat functions for the abscissas x
0
= 0, x
1
= 1, x
2
= 2, x
3
= 3 are
ϕ
0
(x) =
_
1 −x for 0 ≤ x ≤ 1,
0 for 1 ≤ x ≤ 3,
ϕ
1
(x) =
_
_
_
x for 0 ≤ x ≤ 1,
2 −x for 1 ≤ x ≤ 2,
0 for 2 ≤ x ≤ 3,
ϕ
2
(x) =
_
_
_
0 for 0 ≤ x ≤ 1,
x −1 for 1 ≤ x ≤ 2,
3 −x for 2 ≤ x ≤ 3,
ϕ
3
(x) =
_
0 for 0 ≤ x ≤ 2,
x −2 for 2 ≤ x ≤ 3.
162 Applied Numerical Methods
In terms of these hat functions, the piecewise linear interpolant to the data
i x
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
is
ϕ(x) = 1ϕ
0
(x) + 4ϕ
1
(x) + 5ϕ
2
(x) + 8ϕ
3
(x).
Here is an estimate for the error when using the piecewise linear interpolant
to approximate a function:
THEOREM 4.4
Let
a = x
0
< x
1
< < x
n−1
< x
n
= b,
let h = max(x
i+1
−x
i
) denote the maximum length of a sub-interval, suppose
f has two continuous derivatives on the interval [a, b], and let ϕ(x) = I
n
(f)(x)
denote the piecewise linear interpolant to f over this point set. Then,
max
x∈[a,b]
[f(x) −I
n
f(x)[ ≤
1
8
h
2
max
x∈[a,b]
[f
′′
(x)[.
PROOF We use the error term (4.7) (on page 154) for polynomial inter-
polation. In particular, on each subinterval [x
i
, x
i+1
], f is interpolated by a
degree-1 polynomial, so (4.7) gives
f(x) = I
n
(f)(x) +
f
′′
(ξ(x))(x −x
i
)(x −x
i+1
)
2
. (4.10)
However, the quadratic
g(x) = (x −x
i
)(x −x
i+1
)
has a vertex at ˇ x = (x
i
+x
i+1
)/2, g(x
i
) = g(x
i+1
) = 0, and
g(ˇ x) = −
(x
i+1
−x
i
)
2
4
≥ −
h
2
4
. (4.11)
Combining (4.10) and (4.11) gives
[f(x) −I
n
(f)(x)[ ≤
[f
′′
(ξ(x))[
2

h
2
4
,
Approximating Functions and Data 163
from which the error bound follows.
Example 4.10
Consider f(x) = ln x on the interval [2, 4]. We want to ﬁnd h that will
guarantee that the piecewise linear interpolant of f(x) on [2, 4] has an error
of at most 10
−4
. We will assume that [x
i+1
−x
i
[ = h for 0 ≤ i ≤ n−1. Then
|f −I
n
f|
∞
≤
h
2
8
|f
′′
|
∞
=
h
2
8
max
2≤x≤4
[
1
x
2
[ =
h
2
32
≤ 10
−4
.
Thus h
2
≤ 32 10
−4
, h ≤ 0.056, giving n = (4 −2)/h ≥ 36.
Although hat functions and piecewise linear functions are frequently used
in practice, it is desirable in some applications, such as computer graphics, for
the interpolant to be smoother (say, to have one, two, or even more continuous
derivatives) at the mesh points x
i
. Special piecewise cubic polynomials, which
we consider next, are commonly used for this purpose.
4.4.2 Cubic Spline Interpolation
DEFINITION 4.4 Suppose we have a point set ∆ = ¦x
i
¦
n
i=0
that subdi-
vides the interval [a, b]:
a = x
0
< x
1
< < x
n−1
< x
n
= b.
Then ϕ is said to be a cubic spline with respect to ∆ provided
1. ϕ(x) has two continuous derivatives at every x ∈ [a, b], and
2. ϕ(x) is a cubic polynomial on each subinterval [x
i
, x
i+1
], 0 ≤ i ≤ n −1.
Such cubic splines are mathematical analogs of the old-fashioned drafts-
man’s spline. The draftsman’s spline was a ﬂexible piece of long, thin wood
used to draw curves. The draftsman’s spline had weights that could be set at
points (x
i
, y
i
) on the paper. The resulting curve that the draftsman’s spline
made satisﬁed, to a high degree of approximation, a diﬀerential equation
whose solution is a cubic spline.
Just as we can represent interpolating polynomials in terms of Lagrange
functions and piecewise linear polynomials in terms of hat functions, we can
represent a cubic spline s(x) as linear combinations of special “B-splines,”
which we now deﬁne. For convenience, we assume here a uniform mesh, i.e.,
164 Applied Numerical Methods
x
j+1
−x
j
= h for all j, and let
s
j
(x) =
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
0, x > x
j+2
1
6h
3
(x
j+2
−x)
3
, x
j+1
≤ x ≤ x
j+2
,
1
6
+
1
2h
(x
j+1
−x) +
1
2h
2
(x
j+1
−x)
2
−
1
2h
3
(x
j+1
−x)
3
, x
j
≤ x ≤ x
j+1
,
2
3
−
1
h
2
(x −x
j
)
2
−
1
2h
3
(x −x
j
)
3
, x
j−1
≤ x ≤ x
j
,
1
6h
3
(x −x
j−2
)
3
x
j−2
≤ x ≤ x
j−1
,
0, x < x
j−2
.
(4.12)
DEFINITION 4.5 The function s
j
(x) deﬁned by (4.12) is called a B-
spline centered at x = x
j
with respect to the partition ∆ with a uniform mesh.
We now introduce two extra points x
−1
= x
0
− h and x
n+1
= x
n
+ h and
also consider the B-splines s
−1
(x) and s
n+1
(x) centered at x
−1
and x
n+1
. The
s
j
’s are depicted graphically in Figure 4.3.
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
1
2
3
4
5
x
−1 x
0
x
1
x
2
x
3
x
4
x
j−2
x
j−1
x
j
x
j+1
x
j+2 x
n−4
x
n−3
x
n−2
x
n−1
xn
x
n+1
s
−1
s
0
s
1
s
2
s
j s
n−2
s
n−1
sn
s
n+1
FIGURE 4.3: B-spline basis functions.
It is straightforward to show that s
′
j
(x) and s
′′
j
(x) are continuous, so each
s
j
(x) is indeed a cubic spline It follows, since linear combinations of contin-
uous functions are continuous, that any linear combination of the s
j
, as in
Formula (4.13) in the following theorem, is a cubic spline.
Approximating Functions and Data 165
THEOREM 4.5
Let s be any cubic spline with respect to a point set ∆ = ¦x
i
¦
n
i=0
. Then there
is a unique set of coeﬃcients ¦c
j
¦
n+1
j=−1
such that
s(x) =
n+1

i=−1
c
j
s
j
(x). (4.13)
For a complete treatment and proof of this theorem, see our graduate text
[1].)
Cubic splines can be used in a variety of ways to approximate functions
(e.g. interpolation, least squares ﬁts). Here, we will limit the discussion to
interpolation at the points in ∆.
A consequence of Theorem 4.5 is that there are n +3 unknown coeﬃcients
determining a spline with respect to ∆, whereas, if we require s(x
j
) = y
j
,
0 ≤ j ≤ n, we only have n + 1 conditions, so we have two “free” conditions.
We now consider this in the context of interpolation, where there are two com-
monly used ways of specifying the two extra conditions: clamped boundary
conditions and “natural” conditions.
DEFINITION 4.6 The clamped boundary spline interpolant Φ
c
∈ S
∆
of a function f ∈ C
1
[a, b] satisﬁes
(c)
_
_
_
Φ
c
(x
i
) = f(x
i
), i = 0, 1, . . . , n,
Φ
′
c
(x
0
) = f
′
(x
0
),
Φ
′
c
(x
n
) = f
′
(x
n
).
DEFINITION 4.7 The natural spline interpolant Φ
n
∈ S
∆
of a function
f ∈ C[a, b] satisﬁes
(n)
_
_
_
Φ
n
(x
i
) = f(x
i
), i = 0, 1, . . . , n,
Φ
′′
n
(x
0
) = 0,
Φ
′′
n
(x
n
) = 0.
We how set up the system of equations for computing the coeﬃcients of
the clamped and natural spline interpolants when we express these in terms
of B-splines. Let
Φ
c
(x) =
n+1

j=−1
c
j
s
j
(x).
166 Applied Numerical Methods
The requirements (c) then lead to the system
_
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
_
n+1

j=−1
s
j
(x
i
)c
j
= f(x
i
), i = 0, 1, . . . , n,
c
−1
s
′
−1
(x
0
) +c
1
s
′
1
(x
0
) = f
′
(x
0
),
c
n−1
s
′
n−1
(x
n
) +c
n+1
s
′
n+1
(x
n
) = f
′
(x
n
),
(4.14)
since s
′
0
(x
0
) = s
′
n
(x
n
) = 0. The above system can be written in matrix form
as
_
_
_
_
_
_
_
4 2
1 4 1 0
.
.
.
0 1 4 1
2 4
_
_
_
_
_
_
_
_
_
_
_
_
_
_
c
0
c
1
.
.
.
c
n−1
c
n
_
_
_
_
_
_
_
=
_
_
_
_
_
_
_
6f(x
0
) + 2hf
′
(x
0
)
6f(x
1
)
.
.
.
6f(x
n−1
)
6f(x
n
) −2hf
′
(x
n
)
_
_
_
_
_
_
_
, (4.15)
where
c
−1
= c
1
−2hf
′
(x
0
), and
c
n+1
= c
n−1
+ 2hf
′
(x
n
).
The system (4.15) has a unique solution ¦c
j
¦
n+1
j=−1
because the matrix A is
strictly diagonally dominant (and hence nonsingular).
Now consider
Φ
n
(x) =
n+1

j=−1
d
j
s
j
(x).
Conditions (n) lead to the system
_
_
_
_
_
_
_
_
6 0 0
1 4 1
.
.
.
.
.
.
.
.
.
1 4 1
0 0 6
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
d
0
d
1
.
.
.
d
n−1
d
n
_
_
_
_
_
_
_
_
= 6
_
_
_
_
_
_
_
_
f(x
0
)
f(x
1
)
.
.
.
f(x
n−1
)
f(x
n
)
_
_
_
_
_
_
_
_
, (4.16)
where
d
−1
= −d
1
+ 2d
0
d
n+1
= −d
n−1
+ 2d
n
(You will derive this system in Exercise 8 at the end of this chapter.)
Observe that the matrix for the system is not the identity matrix, as with
Lagrange functions for polynomial interpolation or hat functions for piecewise
linear interpolation. However, it is tridiagonal, so, with a tridiagonal system
Approximating Functions and Data 167
solver, the coeﬃcients may be found in O(n) time, rather than O(n
3
) time
for a general system of equations.
Example 4.11
We will use matlab to compute a cubic spline interpolant to the data set as
in Example 4.1, that is, to the data set
i x
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
We have n = 3, h = 1. The system of equations for the clamped cubic spline
requires f
′
(x
0
) and f
′
(x
n
), which we don’t have for this point data, so we will
just ﬁnd the coeﬃcients for the natural cubic spline. The system of equations
(4.16) for this example is
_
_
_
_
6 0 0 0
1 4 1 0
0 1 4 1
0 0 0 6
_
_
_
_
_
_
_
_
_
d
0
d
1
d
2
d
3
_
_
_
_
_
= 6
_
_
_
_
1
4
5
8
_
_
_
_
.
Solving this in matlab:
>> A = [6 0 0 0;
1 4 1 0
0 1 4 1
0 0 0 6]
A =
6 0 0 0
1 4 1 0
0 1 4 1
0 0 0 6
>> b = 6*[1;4;5;8]
b =
6
24
30
48
>> d = A\b
d =
1.0000
4.6667
4.3333
8.0000
>> d_minus_1 = -d(2) + 2*d(1)
d_minus_1 =
-2.6667
>> d_4 = -d(3) + 2*d(4)
d_4 =
11.6667
>>
168 Applied Numerical Methods
Thus, the cubic spline interpolant is given approximately as
s(x) ≈ −2.6667s
−1
(x) + 1.0000s
0
(x) + 4.6667s
1
(x)
+4.3333s
2
(x) + 8.0000s
3
(x) + 11.6667s
4
(x)
≈
8
3
s
−1
(x) +s
0
(x) +
14
3
s
1
(x) +
13
3
s
2
(x) + 8s
3
(x) +
35
3
s
4
(x),
where the s
j
(x) are given by (4.5). (You will write down the s
j
explicitly for
this example in Exercise 10 at the end of this chapter.)
In fact, the matlab function interp1 will compute values of a cubic spline
interpolation. We proceed analogously to Example 4.8:
>> x = [0,1,2,3]
x =
0 1 2 3
>> y = [1,4,5,8]
y =
1 4 5 8
>> xi = linspace(0,3);
>> yi = interp1(x,y,xi,’spline’);
>> axis([-0.1,3.1,0.9,8.1])
>> hold
Current plot held
>> plot(x,y,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(xi,yi)
>>
This dialog produces the following plot.
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
In this case, the plot appears to be virtually identical to the cubic interpolating
polynomial from Example 4.1 (on page 146).
We have the following error estimate.
THEOREM 4.6
Let f have four continuous derivatives on the interval [a, b] and let Φ
c
(x) be
the clamped boundary cubic interpolant (uniform mesh). Then
max
x∈[a,b]
[f(x) − Φ
c
(x)[ ≤
5
384
h
4
max
x∈[a,b]
¸
¸
¸
¸
d
4
f
dx
4
(x)
¸
¸
¸
¸
.
Approximating Functions and Data 169
Furthermore, [Φ
c
[ also approximates the ﬁrst, second, and third derivatives
of f well. See [34] for a proof. A similar result holds for natural boundary
cubic interpolants; see [9]. Also, similar results hold for a nonuniform mesh.
Example 4.12
Consider f(x) = ln x. We wish to determine how small h should be to ensure
that the cubic spline interpolant Φ
c
(x) of f(x) on the interval [2, 4] has error
less than 10
−4
. We have
|f −Φ
c
|
∞
≤
5
384
h
4
|D
4
f|
∞
=
5
384
h
4
max
2≤x≤4
¸
¸
¸
¸
6
x
4
¸
¸
¸
¸
=
_
5
384
__
6
16
_
h
4
≤ 10
−4
.
Thus, h
4
≤ (1/30)(384)(16)10
−4
, h ≤ 0.38, and n ≥ 2/0.38 = 6. (Recall
that we required n ≥ 36 to achieve the same error with piecewise linear
interpolants.)
Example 4.13
We return to Runge’s function: We saw in Example 4.6 that the approx-
imations by interpolating polynomials with equally spaced points got worse
as we took more and more points. We saw in Example 4.7 that, if we took
Chebyshev points, the approximations got better as we took more points, but
the graphs of the interpolating polynomials still “wiggled,” without approx-
imating the derivatives well. We’ll now try cubic spline interpolation with
equally spaced points, using matlab’s interp1 routine:
xpts = linspace(-5,5,200);
z1 = 1./(1+xpts.^2);
x = linspace(-5,5,5);
y = 1./(1+x.^2);
z2 = interp1(x,y,xpts,’spline’);
x = linspace(-5,5,9);
y = 1./(1+x.^2);
z3 = interp1(x,y,xpts,’spline’);
x = linspace(-5,5,17);
y = 1./(1+x.^2);
z4 = interp1(x,y,xpts,’spline’);
plot(xpts,z1,xpts,z2,xpts,z3,xpts,z4);
This dialog produces the following plot.
170 Applied Numerical Methods
−5 0 5
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
We see that the spline with 5 points (corresponding to a degree 4 interpolating
polynomial) is somewhat similar to the degree 4 interpolating polynomial with
Chebyshev points, except it does not undershoot as much. In contrast, the
spline with 9 points tracks the actual function very closely, while the corre-
sponding interpolating polynomial with Chebyshev points still has signiﬁcant
overshoot and undershoot. In this graph, we cannot see the spline with 17
equally spaced points, since its graph is indistinguishable from the graph of
the function, while the interpolating polynomial of degree 16 with Chebyshev
points still has discernable wiggles in it.
REMARK 4.2 Satisfaction of Φ
′
c
(x
0
) = f
′
(x
0
), Φ
′
c
(x
n
) = f
′
(x
n
) may be
diﬃcult to achieve if f(x) is not explicitly known. Approximations of order
h
4
can then be used. Examples of such approximations are:
f
′
(x
0
) =
1
12h
_
− 25f(x
0
) + 48f(x
0
+h)
− 36f(x
0
+ 2h) + 16f(x
0
+ 3h) −3f(x
0
+ 4h)
_
+ ¸error),
where ¸error) =
h
4
5
f
(5)
(ξ), x
0
≤ ξ ≤ x
0
+ 4h,
f
′
(x
n
) =
1
12h
_
25f(x
n
) −48f(x
n
−h) + 36f(x
n
−2h)
− 16f(x
n
−3h) + 3f(x
n
−4h)
_
+ ¸error),
where ¸error) =
h
4
5
f
(5)
(
ˆ
ξ), x
n
≤
ˆ
ξ ≤ x
n
−4h.
Approximating Functions and Data 171
REMARK 4.3 It can be shown that if u is any function on [a, b] with
two continuous derivatives such that u interpolates f in the manner
_
_
_
u(x
i
) = f(x
i
), 0 ≤ i ≤ n,
u
′
(x
0
) = f
′
(x
0
),
u
′
(x
n
) = f
′
(x
n
),
then
_
b
a
(Φ
′′
c
(x))
2
dx ≤
_
b
a
(u
′′
(x))
2
dx.
That is, among all clamped C
2
-interpolants of f, the clamped spline in-
terpolant is the smoothest in the sense of minimizing
_
b
a
(u
′′
(x))
2
dx. Such
smoothness properties are useful, e.g. in computer graphics, where we want
the rendered image to look smooth. They also are important in automated
machining, where the manufactured part should have a smooth surface, and
where smooth motions of the manufacturing robot lead to less wear and tear.
In the next section, we consider approximation by polynomials in such a way
that the graph does not necessarily go through the data exactly. This type of
approximation is appropriate, for example, when there is much data, and the
data contains small errors, or when we need to approximate an underlying
function with a low-degree polynomial.
4.5 Approximation Other Than by Interpolation
So far in this chapter, we have looked at approximation of functions by
Taylor polynomials and by polynomials that pass through speciﬁed data points
exactly, that is, by interpolating polynomials. In fact, a Taylor polynomial
of degree n centered at x
0
for a function can be thought of as a limit of an
interpolating polynomial with n + 1 equally spaced points for that function
over an interval [x
0
− ǫ, x
0
+ ǫ] as we let ǫ tend to 0. Here, we mention
alternatives.
172 Applied Numerical Methods
4.5.1 Least Squares Approximation
We have already seen an alternative to polynomial interpolation in Chap-
ter 3: In Example 3.29 (on page 119), we ﬁt the data set
i t
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
with a polynomial p
2
of degree 2 in such a way that
_
p
2
(0) −1
_
2
+
_
p
2
(1) −4
_
2
+
_
p
2
(2) −5
_
2
+
_
p
2
(3) −8
_
2
was minimized. The result was a polynomial of degree 2 that approximated
the data set, but did not ﬁt exactly. Such approximations are appropriate
when we already suspect the form of the underlying function (for example, if
we have reason to believe that the function is indeed a polynomial of degree
2), and if there are errors in the data. This approximation is least squares
approximation, deﬁned by Equations (3.18) (on page 117) and (3.21), which
we repeat here in a somewhat diﬀerent form: To ﬁt data ¦(x
i
, y
i
)¦
m
i=1
, we
assume a function of the form
y ≈ f(t) =
n

j=0
a
j
ϕ
j
(t), (4.17)
where we ﬁnd the coeﬃcients a
j
, 0 ≤ j ≤ n by solving the minimization
problem
min
{aj}
n
j=0
m

i=1
(y
i
−f(t
i
))
2
. (4.18)
In Example 3.29, n = 2, m = 4, and ϕ
i
= x
i
, i = 0, 1, 2. We saw in Sec-
tion 3.4.2 that, if f is of the form (4.17), the minimization problem (4.18) can
be solved with a QR-decomposition. In some models, the a
j
occur nonlinearly
in the expression for f, in which case techniques we introduce in Chapter 8
may be used.
4.5.2 Minimax Approximation
In minimax approximation, also known as ℓ
∞
-approximation, instead of
minimizing the function in (4.18), we do the following minimization:
min
{aj}
n
j=0
max
1≤i≤m
[y
i
−f(t
i
)[. (4.19)
(In other words, we minimize the maximum deviation from the data.) We dis-
cuss this problem for various cases in our graduate text [1]. Using Lemar´echal’s
Approximating Functions and Data 173
technique, the problem can be posed as the following constrained optimization
problem:
min
{aj}
n
j=0
v
subject to v ≥ y
i
−f(t
i
), 1 ≤ i ≤ m,
v ≥ −
_
y
i
−f(t
i
)), 1 ≤ i ≤ m.
(4.20)
Example 4.14
If we ﬁt a quadratic to the data from Example 3.29, that is, if we ﬁt the data
i t
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
we have f(t
i
) = a
0
+a
1
t
i
+a
2
t
2
i
, and the optimization problem (4.20) becomes
min
a0,a1,a2
v
subject to v ≥ 1 − ( a
0
),
v ≥ 4 − ( a
0
+ a
1
+ a
2
),
v ≥ 5 − ( a
0
+ 2a
1
+ 4a
2
),
v ≥ 8 − ( a
0
+ 3a
1
+ 9a
2
),
v ≥ −1 + a
0
,
v ≥ −4 + a
0
+ a
1
+ a
2
,
v ≥ −5 + a
0
+ 2a
1
+ 4a
2
,
v ≥ −8 + a
0
+ 3a
1
+ 9a
2
,
Identifying v with a
3
, we recognize this optimization problem as the linear
programming problem minv subject to
_
_
_
_
_
_
_
_
_
_
_
_
−1 0 0 −1
−1 −1 −1 −1
−1 −2 −4 −1
−1 −3 −9 −1
1 0 0 −1
1 1 1 −1
1 2 4 −1
1 3 9 −1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
a
0
a
1
a
2
v
_
_
_
_
≤
_
_
_
_
_
_
_
_
_
_
_
_
−1
−4
−5
−8
1
4
5
8
_
_
_
_
_
_
_
_
_
_
_
_
.
If we have the matlab optimization toolbox, we may use linprog to solve
this problem, as follows:
>> M = [-1 0 0 -1
-1 -1 -1 -1
-1 -2 -4 -1
-1 -3 -9 -1
1 0 0 -1
174 Applied Numerical Methods
1 1 1 -1
1 2 4 -1
1 3 9 -1]
M =
-1 0 0 -1
-1 -1 -1 -1
-1 -2 -4 -1
-1 -3 -9 -1
1 0 0 -1
1 1 1 -1
1 2 4 -1
1 3 9 -1
>> b = [-1;-4;-5;-8;1;4;5;8]
b =
-1
-4
-5
-8
1
4
5
8
>> f = [0;0;0;1]
f =
0
0
0
1
>> a = linprog(f,M,b)
Optimization terminated.
a =
1.5000
2.0000
0.0000
0.5000
>> tt = linspace(0,3);
>> yy = a(1) + a(2)*tt + a(3)*tt.^2;
>> ti = [0 1 2 3]
ti =
0 1 2 3
>> yi = [1 4 5 8]
yi =
1 4 5 8
>> axis([-0.1,3.1,0.9,8.1])
>> hold
Current plot held
>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(tt,yy)
This dialog results in the following plot:
Approximating Functions and Data 175
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
Just as in the least squares ﬁt, the minimax “quadratic” is also a line for this
particular example. However, the minimax line seems to ﬁt the data better
than the least squares ﬁt, for this particular case.
We observe in Example 4.14 that the deviations of the ﬁt from the actual
data alternate in sign but all have the same absolute value. This is a general
property of minimax approximations, that is desirable if the data do not have
errors and if we want the maximum error in the approximation to be small.
However, minimax ﬁts are sensitive to large errors in single data points (what
statisticians call outliers). The next type of ﬁt is not so sensitive to this kind
of data error.
4.5.3 Sum of Absolute Values Approximation
In this type approximation, also known as ℓ
1
-approximation, instead of
minimizing the function in (4.18), we do the following minimization:
min
{aj}
n
j=0
m

i=1
[y
i
−f(t
i
)[. (4.21)
As in minimax optimization, we may use Lemar´echal’s technique to pose the
problem as the following constrained optimization problem:
min
{aj}
n
j=0
m

i=1
v
i
subject to v
i
≥ y
i
−f(t
i
), 1 ≤ i ≤ m,
v
i
≥ −
_
y
i
−f(t
i
)), 1 ≤ i ≤ m.
(4.22)
Example 4.15
We will ﬁt a quadratic to the data from Example 3.29, just as we did in the
minimax example (Example 4.14, starting on page 173). The optimization
176 Applied Numerical Methods
problem (4.20) becomes
min
a0,a1,a2
v
1
+v
2
+v
3
+v
4
subject to v
1
≥ 1 − ( a
0
),
v
2
≥ 4 − ( a
0
+ a
1
+ a
2
),
v
3
≥ 5 − ( a
0
+ 2a
1
+ 4a
2
),
v
4
≥ 8 − ( a
0
+ 3a
1
+ 9a
2
),
v
1
≥ −1 + a
0
,
v
2
≥ −4 + a
0
+ a
1
+ a
2
,
v
3
≥ −5 + a
0
+ 2a
1
+ 4a
2
,
v
4
≥ −8 + a
0
+ 3a
1
+ 9a
2
,
Identifying v
1
, v
2
, v
3
, and v
4
with a
3
, a
4
, a
5
, and a
6
, we recognize this opti-
mization problem as the linear programming problem minv
1
+ v
2
+ v
3
+ v
4
subject to
_
_
_
_
_
_
_
_
_
_
_
_
−1 0 0 −1 0 0 0
−1 −1 −1 0 −1 0 0
−1 −2 −4 0 0 −1 0
−1 −3 −9 0 0 0 −1
1 0 0 −1 0 0 0
1 1 1 0 −1 0 0
1 2 4 0 0 −1 0
1 3 9 0 0 0 −1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
a
0
a
1
a
2
v
1
v
2
v
3
v
4
_
_
_
_
_
_
_
_
_
_
≤
_
_
_
_
_
_
_
_
_
_
_
_
−1
−4
−5
−8
1
4
5
8
_
_
_
_
_
_
_
_
_
_
_
_
.
If we have the matlab optimization toolbox, we may use linprog to solve
this problem using linprog, analogously to what we did in Example 4.14. We
obtain the following ﬁt:
a =
1.0000
2.3523
-0.0063
0.0000
0.6540
0.6793
0.0000
which has the following graph:
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
Approximating Functions and Data 177
Least absolute value (that is ℓ
1
) ﬁts are a type of ﬁt statisticians call robust .
This means that, if we add a large amount of error to just one point of many,
it will not aﬀect the approximating function much (or at least as much as it
would if we were doing, say, an ℓ
∞
ﬁt).
4.5.4 Weighted Fits
There are endless variations on least-squares, ℓ
∞
, and ℓ
1
ﬁts. In particular
applications or models, we may have a large number of data points, but we
may judge some data points to be more important than others. We express
this importance with weights
¦w
i
¦
m
i=1
, w
i
> 0, 1 ≤ i ≤ m
associated with each data point. The corresponding weighted least squares
problem would then become
min
m

i=1
w
i
(y
i
−f(t
i
))
2
,
while the weighted minimax problem would become
min
_
max
1≤i≤m
w
i
[y
i
−f(t
i
)[
_
,
and the weighted ℓ
1
problem would become
min
m

i=1
w
i
[y
i
−f(t
i
)[.
Example 4.16
Let us continue with the data
i t
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
Suppose we have decided to do a least squares ﬁt, but we have either deter-
mined that the data for i = 1 and i = 2 was scaled ﬁve times larger than
that for i = 0 and i = 3, or we have determined that the data corresponding
to i = 0 and i = 3 is ﬁve times as important. If we desire to ﬁt a quadratic
178 Applied Numerical Methods
polynomial f(t) = a
0
+ a
1
t + a
2
t
2
, we would formulate the corresponding
weighted least squares problem as
min
a0,a1,a2
5(1−a
0
)
2
+(4−a
0
−a
1
−a
2
)
2
+(5−a
0
−2a
1
−4a
2
)
2
+5(8−a
0
−3a
1
−9a
2
)
2
.
Unfortunately, an oﬀ-the-shelf QR-decomposition will not work directly
2
. An
easy option, provided ill-conditioning is not judged to be a problem, is to
form the normal equations directly. For this particular example, the normal
equations are
_
_
12 18 50
18 50 144
50 144 422
_
_
_
_
a
0
a
1
a
2
_
_
=
_
_
54
134
384
_
_
.
Solving this system with matlab gives
3
:
AtA =
12 18 50
18 50 144
50 144 422
>> Atb = [54;134;384]
Atb =
54
134
384
>> a = AtA\Atb
a =
1.0435
2.3043
0
>> yy = a(1) + a(2)*tt + a(3)*tt.^2;
>> axis([-0.1,3.1,0.9,8.1])
>> hold
Current plot held
>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(tt,yy)
>>
The corresponding plot is:
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
2
However, we may design special QR-decomposition routines for weighted least squares
problems. These routines would be based on a weighted deﬁnition of orthogonality.
3
within the environment of our previous examples
Approximating Functions and Data 179
We see that the ﬁt approximates the highly weighted points much more closely
than the unweighted ﬁt (which you see on page 120).
4.6 Approximation Other Than by Polynomials
So far, we have examined approximation of data and functions by functions
of the form
ϕ(x) =
n

j=0
a
j
ϕ
j
(x), (4.23)
where we have chosen ϕ
j
(x) = x
j
for polynomial interpolation by forming
the Vandermonde system, as well as for ℓ
1
and minimax ﬁts. We also chose,
as alternatives, ϕ
j
to be the Lagrange function corresponding to the sample
point x
j
, for the Lagrange form of the interpolating polynomial, and where
we chose ϕ
j
(x) =

j−1
i=0
(x − x
j
) for the Newton form of the interpolating
polynomial. For splines, we chose ϕ
j
to be the j-th B-spline associated with
the point set ¦x
j
¦
n
j=0
.
Often, non-polynomial ϕ
j
are used, and sometimes, even more general forms
than (4.23) are used. For example, rational approximation, that is, approxi-
mation by functions of the form
ϕ(x) =
n1

j=0
a
j
x
j
_
n2

j=0
b
j
x
j
can be eﬀective in various contexts. Some facts about rational approximation
appear in our graduate text [1] and elsewhere.
Another very common type of approximation is of the form (4.23), where we
choose ϕ
j
(x) to be cos(jx) or sin(jx) (or e
ijx
, where i denotes the imaginary
unit here). Approximation by such trigonometric polynomials is ubiquitous
throughout signal processing and elsewhere, and also leads to the branch of
mathematics termed Fourier analysis. In fact, a special associated algorithm,
the Fast Fourier Transform, or FFT, is the basis of digital transmission of
audio and video signals.
We may also approximate by sums of exponentials. The approximation
may be linear, that is, of the form (4.23), or nonlinear.
180 Applied Numerical Methods
Example 4.17
Let us approximate the data
i t
i
y
i
0 0 1
1 1 4
2 2 5
3 3 8
in the least squares sense
1. by ϕ(t) = a
0
+a
1
e
t
, and
2. by ϕ(t) = a
0
e
a1t
.
In the ﬁrst case, the approximation is of the form (4.23), with ϕ
0
(t) = 1 and
ϕ
1
(t) = e
t
, and we may use the same techniques as with polynomial least
squares. The overdetermined system as in (3.19) (on page 117) is
_
_
_
_
1 1
1 e
1 e
2
1 e
3
_
_
_
_
_
a
0
a
1
_
=
_
_
_
_
1
4
5
8
_
_
_
_
,
and, using a computation similar to that in Example 3.29 (on page 119), we
obtain the following result and plot:
a = 2.0842 0.3098
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
9
We see that this particular form does not seem to ﬁt the data well.
For the nonlinear exponential form ϕ(t) = a
0
e
a1t
, we need to minimize the
function
f(a
0
, a
1
) = (1 −a
0
)
2
+ (4 −a
0
e
a1
)
2
+ (5 −a
0
e
2a1
)
2
+ (8 − a
0
e
3a1
)
2
.
Since this function is nonlinear, we cannot use just a single linear compu-
tation such as a QR-decomposition. However, we may use techniques from
Approximating Functions and Data 181
nonlinear optimization, such as setting the gradient equal to zero and solv-
ing the resulting nonlinear system using techniques from Chapter 8 In fact,
however, special techniques have been developed for solving nonlinear least
squares problems, such as those embodied in the routine lsqcurvefit from
matlab’s optimization toolbox. To use this routine, we need to program ϕ(t),
which we do in the following matlab “m” ﬁle:
function [y] = exponential_fit(a,t)
y = a(1) * exp(a(2)*t);
Assuming we are continuing the dialog from the previous examples, we use
exponential fit.m in the following matlab dialog to compute and plot the
ﬁt:
>> x0=rand(2,1)
x0 =
0.3529
0.8132
>> a = lsqcurvefit(’exponential_fit’,x0,ti,yi)
Optimization terminated: relative function value
changing by less than OPTIONS.TolFun.
a =
1.9311
0.4781
>> yy = exponential_fit(a,tt);
>> hold
Current plot held
>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)
>> plot(tt,yy)
>>
0 0.5 1 1.5 2 2.5 3
1
2
3
4
5
6
7
8
9
Caution: Routines such as lsqcurvefit use heuristics, and may not re-
turn with mathematically correct ﬁts; sometimes the ﬁts that they return are
not near the actual best ﬁts. One way of gathering evidence that the ﬁt is
correct is to try diﬀerent starting points. Also, the routine lsqcurvefit has
various options, including an option to supply partial derivatives of the ﬁtting
function; trying diﬀerent options may either give one more conﬁdence in the
ﬁt or provide evidence that the ﬁt is not good. Yet another possibility is to
use global optimization software, and, in particular, software with automatic
veriﬁcation, such as interval-arithmetic-based software, such as we describe in
[1, Section 9.6,3] or in [26]. In fact, we used our GlobSol package [21] to verify
182 Applied Numerical Methods
that the ﬁt we have displayed is correct. To see if lsqcurvefit was trust-
worthy in this case, we also tried lsqcurvefit with various starting points,
and found that each time it gave the same ﬁt
4
that we have displayed.
In general, if we are conjecturing an underlying model to the data, we
choose a form that we think corresponds to the underlying processes that
produced the data. For example, sometimes the coeﬃcients a
j
are coeﬃcients
in a diﬀerential equation.
4.7 Interval (Rigorous) Bounds on the Errors
Whether we are considering Taylor polynomial approximation, interpola-
tion, least squares, or minimax approximation, the error term in the approx-
imation can be put in the general form
f(x) = p(x) +K(x)M(f; x) for x ∈ [a, b], (4.24)
where K becomes small as we increase the number of points, and M depends
on a derivative of f. We list K and M for various approximations
5
in Ta-
ble 4.1. In such cases, p and K can be evaluated explicitly, while M(f; x)
can be estimated using interval arithmetic. We illustrated how to do this for
f(x) = e
x
, using a degree-5 Taylor polynomial, in Example 1.22 on page 29.
We elaborate here: In addition to bounding particular values of the function,
a maximum error of approximation and rigorous bounds valid for all of [a, b]
can be inferred. In particular, the polynomial part p(x) is evaluated at a point
(but using outwardly rounded interval arithmetic to maintain mathematical
rigor), and the error part is evaluated with interval arithmetic.
Example 4.18
Consider approximating sin(x), x ∈ [−0.1, 0.1] by a degree-5
1. Taylor polynomial about zero,
2. interpolating polynomial at the points x
k
= −.1 +.04k, 0 ≤ k ≤ 5.
For the Taylor polynomial, we observe that the ﬁfth degree Taylor polynomial
is the same as the sixth degree Taylor polynomial, and we have
sin(x) ∈ x −
1
6
x
3
+
1
120
x
5
−
1
5040
x
7
sin(ξ) for some ξ ∈ [−0.1, 0.1]. (4.25)
4
approximately
5
The error of approximation of smooth functions by Chebyshev polynomials can be much
less than for nonsmooth (merely C
0
) functions, as is indicated in Remark ?? combined with
Theorem ??; however, bounds on the error may be more complicated to ﬁnd in this case.
Approximating Functions and Data 183
TABLE 4.1: Error factors K and M in polynomial approximations
f(x) = p(x) + K(x)M(f; x).
Type of approxima-
tion K M(f)
degree n Taylor poly-
nomial
1
(n + 1)!
(x−x0)
n+1
f
(n+1)
(ξ(x)), ξ ∈ [a, b] unknown
polynomial interpola-
tion at n + 1 points
1
(n + 1)!
n

i=0
(x −xi) f
(n+1)
(ξ(x)), ξ ∈ [a, b] unknown
|f(x) −p(x)| ≤ K(x)M(f, x) (Bounds on the error only
6
):
piecewise linear inter-
polation
h
2
8
max
x∈[a,b]
|f
′′
(x)|
interpolation with
clamped cubic splines
5
384
h
4
max
x∈[a,b]
|f
′′′′
(x)|
6
The actual equation 4.24 can be given, but it is more complicated, involving conditional
branches.
We can replace sin(ξ) by an appropriate interval to get a pointwise estimate;
for example,
sin(0.05) ∈ .05 −
.05
3
6
+
.05
5
120
−
.05
7
5040
[0, 0.05]
⊆ [0.049979169270821, 0.04997916927084],
where the above bounds are mathematically rigorous. Here, K was evaluated
at the point x, but, sin(ξ) was replaced by sin([0.0.05]). Similarly,
sin(−0.01) ∈ (−.01) −
(−.01)
3
6
+
(−.01)
5
120
−
(−.01)
7
5040
[−0.01, 0]
⊆ [−0.00999983333417, −0.00999983333416].
Thus, since we know sin(x) is monotonic for x ∈ [−0.01, 0.05],
[−0.00999983333417, 0.04997916927084] represents a fairly sharp bound on
the range ¦sin(x) [ x ∈ [−0.01, 0.05]¦. Alternately, it may be more convenient
in some contexts to evaluate K and M over the entire interval, although this
leads to a less sharp result. Using that technique, we would have
sin(0.05) ∈ .05 −
.05
3
6
+
.05
5
120
+
[−0.1, 0.1]
7
5040
[−0.1, 0.1]
⊆ .05 −
.05
3
6
+
.05
5
120
−
[−0.19841269841270 10
−11
, 0.19841269841270 10
−11
]
⊆ [0.04997916926884, 0.04997916927282],
184 Applied Numerical Methods
and
sin(−0.01) ∈ (−.01) −
(−.01)
3
6
+
(−.01)
5
120
−
[−0.1, 0.1]
7
5040
[−0.1, 0.1]
⊆ (−.01) −
(−.01)
3
6
+
(−.01)
5
120
−
[−0.19841269841270 10
−11
, 0.19841269841270 10
−11
]
⊆ [−0.00999983333616, −0.00999983333218],
thus obtaining (somewhat less sharp) bounds
[−0.00999983333616, 0.04997916927282]
on the range ¦sin(x) [ x ∈ [−0.01, 0.05]¦.
In general, substituting intervals into the polynomial approximation itself
does not give sharp bounds on the range. For example,
sin([−0.01, 0.05]) ∈ ([−.01, .05]) −
([−.01, .05])
3
6
+
([−.01, .05])
5
120
−
[−0.19841269841270 10
−11
, 0.19841269841270 10
−11
]
⊆ [−0.01002083333616, 0.05000016927282].
Nonetheless, in some contexts in which there is no alternative, this technique
gives usable bounds.
Computing bounds based on the interpolating polynomial is similar to com-
puting bounds based on the Taylor polynomial, and is left as Exercise 4.
4.8 Applications
To better understand the population dynamics of American green tree frogs
(Hyla cinerea), scientists used a capture-mark-recapture method to follow a
population at an urban study site in Lafayette, LA, during their breeding
seasons. The following data are the weekly frog population estimates from
week 2 (June 24, 2004) of the 2004 dataset [2].
Week 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18
Estimate 143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523
Now, suppose we are looking for a least squares ﬁt to the data, which means
we want to ﬁnd a function of the form
y ≈ f(t) =
n

j=0
c
j
ϕ
j
(t),
Approximating Functions and Data 185
where the coeﬃcients c
j
, 0 ≤ j ≤ n can be found by solving the minimization
problem
min
{cj}
n
j=0
m

i=1
(y
i
−f(t
i
))
2
.
Here, (t
i
, y
i
) are the given data, so m = 16. We have decided to use the
hat functions (on page 161) ϕ
j
(t), j = 1, 2, ..., 13 for t ∈ [1, 19] as the basis
functions; that is, we will use hat functions centered at the points 1, 2.5, 3,
. . . , 17.5, 19. We saw in Section 3.4.2 that this minimization problem can be
solved with a QR-decomposition. The following matlab “m” ﬁles show how
to ﬁnd all the coeﬃcients c
j
, j = 1, 2, ..., 13, and plot both the actual data and
the least squares ﬁt in the same xy plane.
function p = hat_function_value (j, xi, n, h, z);
p=0;
if(j==1 & xi(1) <= z & z <=xi(2))
p=(xi(2)-z)/h;
elseif (j==n+1 & xi(n) <= z & z <= xi(n+1))
p=(z-xi(n))/h;
elseif( 2 <= j & j <= n & xi(j-1) <= z & z <= xi(j))
p=(z-xi(j-1))/h;
elseif (2 <= j & j <= n & xi(j) <= z & z <= xi(j+1))
p=(xi(j+1)-z)/h;
end
return
The above function is used in the following matlab script.
clc,clear,close all
t = [2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18];
y = [143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523];
h=1.5;
xi=1:h:19;
m = max(size(t)); % m data items
n = length(xi)-1; % a basis of n+1 functions
A=zeros(m,n+1);
for i=1:m % i means ith observation
for j=1:n+1 % j means jth basis function
A(i,j)= hat_function_value (j,xi,n,h,t(i)); % call the hat function
end
end
[Q,R]=qr(A); % perform the QR decomposition of matrix A.
e=Q’*y’; % solve the square triangular system
f=e(1:(n+1)); % use the first n rows of Rx=Q’b.
r=inv(R(1:(n+1),1:n+1));
c=r*f % c is the coefficient vector of the hat function basis.
for i=1:length(t)
xx(i)=sum(c’.*(A(i,1:(n+1))));
end
axis([0,20,100,1450]) % plot both the actual data and the least squares
hold % approximation
plot(t,y,’*’,t,xx,’-’)
legend(’Actual Data’,’Least Squares Fit’)
Running the above matlab gives the following c
j
in the command window:
>> c
c =
1.0e+003 *
-0.6726
0.5508
0.1434
186 Applied Numerical Methods
0.1783
0.1242
0.1255
0.2258
1.5561
-0.7023
0.6612
0.3106
0.3562
0.8566
The script also gives the following plot:
0 5 10 15 20
200
400
600
800
1000
1200
1400

Actual Data
Least Squares Fit
Notice that the population estimate of week 11 is much larger than the other
week estimates. This suggests that the week 11 data is anoutlier, that is,
it is somehow exceptional or in error. In the next ﬁtting experiment, we
remove (t
10
, y
10
) = (11, 1429) from our dataset, and adjust the matlab codes
accordingly. In particular, we replace the second and third line of the script
by
t = [2 3 4 5 6 7 8 9 10 12 14 15 16 17 18];
y = [143 415 140 177 150 125 133 151 123 487 523 228 416 341 523];
and we replace the “axis” command (after examination of the data and ex-
perimentation) by
axis([0,20,100,700])
The resulting plot (seen below) shows that closer approximation can be ob-
tained by omitting this outlier.
Approximating Functions and Data 187
0 5 10 15 20
100
200
300
400
500
600
700

Actual Data
Least Squares Fit
The corresponding set of coeﬃcients is
c =
1.0e+003 *
-0.6728
0.5509
0.1432
0.1797
0.1192
0.1518
0.1256
0.0800
1.3010
0.1340
0.4160
0.3035
0.9620
Note: Here, we have explicitly used the QR decomposition for illustration.
Actually, matlab has a function lscov that will compute the least squares
solution to a system Ax = b, probably more eﬃciently than the explicit mat-
lab statements we have just exhibited. A corresponding matlab script for
the data with the outlier removed is as follows.
clc,clear,close all
t = [2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18];
y = [143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523];
h=1.5;
xi=1:h:19;
m = max(size(t)); % m data items
n = length(xi)-1; % a basis of n+1 functions
A=zeros(m,n+1);
for i=1:m % i means ith observation
for j=1:n+1 % j means jth basis function
A(i,j)= hat_function_value (j,xi,n,h,t(i)); % call the hat function
end
end
c = lscov(A,y’) % Compute the least squares solution
for i=1:length(t)
xx(i)=sum(c’.*(A(i,1:(n+1))));
end
axis([0,20,100,1450]) % plot both the actual data and the least squares
hold % approximation
188 Applied Numerical Methods
plot(t,y,’*’,t,xx,’-’)
legend(’Actual Data’,’Least Squares Fit’)
4.9 Exercises
1. For f(t) = sin(t),
(a) compute the coeﬃcients of the degree 3 Taylor polynomial approx-
imation at t = 0;
(b) compute the coeﬃcients of the degree 3 polynomial that interpo-
lates f at t = −1, t = −1/3, t = 1/3, and t = 1.
(c) Rewrite each of the degree-3 polynomials in 1a and 1b in terms of
the basis ϕ
0
≡ 1, ϕ
1
≡ t, ϕ
2
≡ t
2
, and ϕ
3
≡ t
3
, then compare
coeﬃcients.
(d) Estimate the maximum error
max
t∈[−1,1]
[f(t) −p(t)[
for each of the approximations in 1a and 1b.
2. Repeat the computations for Example 4.18 on page 182, except use the
interpolating polynomial at the six Chebyshev points, rather than at six
equally spaced points.
3. Fill in the details of the computations in Example 4.4 (starting on
page 152). In particular, solve the lower triangular system, and also
arrange the computations so you see that, by solving it, you do the
same computations as computing the divided diﬀerences. Also, rear-
range the terms of the polynomial to rewrite it in power form, and show
that you get the same coeﬃcients in power form as you do by solving
the Vandermonde system.
4. Complete the computations in Example 4.18 on page 182. That is,
approximate sin(x), x ∈ [−0.1, 0.1] by an interpolating polynomial of
degree 5, and use interval arithmetic to obtain an interval bounding the
exact solution of sin(0.05). Do it
(a) with equally spaced points, and
(b) with Chebyshev points.
5. Redo the matlab computations in Example 4.5 (on page 154, but use
interval arithmetic (say, with intlab). Doing so, you will obtain math-
ematically rigorous bounds on the truncation error, from the upper
bounds of the intervals produced.
Approximating Functions and Data 189
6. Actually do Example 4.7 (on page 157).
7. (A signiﬁcant programming project) Consider the piecewise linear in-
terpolant in Example 4.9 (on page 161).
(a) Write a matlab function [y] = hat(t,ti,i) that accepts a set
of abscissas ti (an array with m points, the speciﬁed point i, and
the speciﬁed point of evaluation t, and returns the value of the hat
function centered at ti(i).
(b) Use matlab and your hat function routine from part (a) of this
problem to evaluate the piecewise linear interpolant from Exam-
ple 4.9 and produce a graph as in Example 4.9.
(c) Observe that your graph is the same graph as the one that appears
in the text for the example.
8. Show that conditions (n) in Deﬁnition 4.7 of the natural spline inter-
polant (on page 165) lead to the system (4.16) (on page 166).
9. Write down the details stating that (4.14) (on page 166) can be written
in matrix form as (4.15) (on page 166).
10. (Involves signiﬁcant algebraic manipulation) In Example 4.11,
(a) Write down s
−1
, s
0
, s
1
, s
2
, and s
3
explicitly in terms of branching.
(b) Simplify the expansion of s(x) in Example 4.11 to write s(x) as a
set of cubic polynomials in power form. (You will have a diﬀerent
cubic polynomial for each of the three subintervals [0, 1], [1, 2], and
[2, 3].)
(c) Write a matlab routine that evaluates your polynomial. (You will
need to use branching statements such as if — elseif — else —
end.)
(d) Use your matlab routine to plot s(x).
(e) Compare your graph with that given in the text to make sure it is
the same.
11. Let s
1
(x) = 1 + c(x + 1)
3
, −1 ≤ x ≤ 0, where c is a real number.
Determine s
2
(x) on 0 ≤ x ≤ 1 so that
s(x) =
_
s
1
(x), −1 ≤ x ≤ 0,
s
2
(x), 0 ≤ x ≤ 1,
is a natural cubic spline, i.e. s
′′
(−1) = s
′′
(1) = 0 on [−1, 1] with nodal
points at −1, 0, 1. How must c be chosen if one wants s(1) = −1?
12. Complete the computations in Example 4.15 (on page 175, analogously
to what was done in Example 4.14 (on page 173), to verify that we obtain
the graph that is displayed in the text. Show all your computations.
190 Applied Numerical Methods
13. Fill in the details of the computations in Example 4.17 (on page 180).
Exhibit all of your work.
Chapter 5
Eigenvalue-Eigenvector Computation
In Chapter 3, we introduced the concept of an eigenvalue-eigenvector pair of
an n by n matrix A, that is, a scalar λ and a vector v such that
Av = λv,
and we referred to eigenvalues and eigenvectors when talking about conver-
gence of iterative methods. In fact eigenvalues and eigenvectors are fundamen-
tal in various applications. In particular, physical systems exhibit character-
istic modes of vibration (that is, resonance frequencies) that are described by
eigenvalues and eigenvectors. For example, structures such as buildings and
bridges can be modeled by diﬀerential equations, and the resonant, or char-
acteristic
1
frequencies of models of such structures are routinely computed in
earthquake-prone zones such as California, prior to construction.
We discuss some basic methods for computing eigenvalues and eigenvectors
in this chapter. First, we introduce necessary facts.
5.1 Facts About Eigenvalues and Eigenvectors
THEOREM 5.1
λ is an eigenvalue of A if and only if the determinant det(A −λI) = 0.
The determinant deﬁnes the characteristic polynomial
det(A −λI) = λ
n
+α
n−1
λ
n−1
+α
n−2
λ
n−2
+ +α
1
λ +α
0
.
Thus, the fundamental theorem of algebra tells us that A has exactly n eigen-
values, the roots of the above polynomial, in the complex plane, counting
multiplicities. The set of eigenvalues of A is called the spectrum of A. Recall
the following.
1
The preﬁx “eigen” is a German-language preﬁx, meaning, roughly, “characteristic.”
191
192 Applied Numerical Methods
(a) The spectral radius of A is deﬁned by
ρ(A) = max
λ an eigenvalue of A
[λ[.
(b) |A|
2
=
_
ρ(A
H
A). If A
H
= A (that is, if A is Hermitian), then |A|
2
=
ρ(A).
Also, it can be shown that [λ[ ≤ |A| for any induced matrix norm | | and
any eigenvalue λ.
Example 5.1
Let
A =
_
1 2
−1 4
_
.
Then the characteristic polynomial of A is
[A −λI[ =
¸
¸
¸
¸
1 −λ 2
−1 4 −λ
¸
¸
¸
¸
= (1 −λ)(4 −λ) −(2)(−1)
= λ
2
−5λ + 6,
so the eigenvalues of A are λ
2
= 2 and λ
1
= 3. In this case, we can compute
the eigenvectors v = (v
1
, v
2
)
T
corresponding to λ
2
= 2 as follows:
(A −2I)v =
_
1 −2 2
−1 4 −2
__
v
1
v
2
_
=
_
−1 2
−1 2
__
v
1
v
2
_
= 0
for any vector v with v
1
= 2v
2
. Thus there is a space of eigenvectors of the
form
v
(1)
= t
_
2
1
_
, t a number
corresponding to λ
2
= 2. Similarly, the space of eigenvectors corresponding
to λ
1
= 3 is
v
(2)
= t
_
1
1
_
, t a number.
For this example, the matrix of eigenvectors (normalized somehow, say, so the
second component of each is equal to 1) is
P = (v
(1)
, v
(2)
) =
_
2 1
1 1
_
is non-singular.
We now use matlab to compute induced norms of A, to illustrate that they
are greater than than or equal to max¦λ
1
, λ
2
¦ = 3. In this computation, we
also illustrate the relationship between the matrix P of eigenvectors of A and
A.
Eigenvalue-Eigenvector Computation 193
>> A = [1 2
-1 4]
A =
1 2
-1 4
>> norm(A,1)
ans = 6
>> norm(A,2)
ans = 4.4966
>> max(sqrt(eig(A’*A)))
ans = 4.4966
>> norm(A,inf)
ans = 5
>> P = [2 1
1 1]
P =
2 1
1 1
>> inv(P)*A*P
ans =
2 0
0 3
>>
Example 5.1 is special in several ways:
1. In general, the eigenvalues and eigenvectors are complex, even if the
matrix A is real.
2. In general, an n by n matrix A does not have n linearly independent
eigenvectors, although certain classes of matrices, such as symmetric
ones, do.
3. In general, we do not explicitly form the characteristic equation to com-
pute eigenvalues and eigenvectors, but we use iterative methods like the
basic ones we explain later in this chapter.
DEFINITION 5.1 A square matrix A is called defective if it has an
eigenvalue of multiplicity k having fewer than k linearly independent eigen-
vectors.
For example, if
A =
_
1 1
0 1
_
, then λ
1
= λ
2
= 1, but x = t
_
1
0
_
is the only eigenvector, so A is defective.
THEOREM 5.2
Let A and P be nn matrices, with P nonsingular. Then λ is an eigenvalue
of A with eigenvector x if and only if λ is an eigenvalue of P
−1
AP with
194 Applied Numerical Methods
eigenvector P
−1
x. (P
−1
AP is called a similarity transformation of A, and A
and P
−1
AP are called similar.)
THEOREM 5.3
Let ¦x
i
¦
n
i=1
be eigenvectors of A corresponding to distinct eigenvalues ¦λ
i
¦
n
i=1
.
Then the vectors ¦x
i
¦
n
i=1
are linearly independent.
If A has n diﬀerent eigenvalues, then the n eigenvectors are linearly inde-
pendent and thus form a basis for C
n
. (This means that the matrix P formed
from these eigenvectors is non-singular.) Note that n diﬀerent eigenvalues is
suﬃcient but not necessary for ¦x
i
¦
n
i=1
to form a basis. Consider A = I with
eigenvectors ¦e
i
¦
n
i=1
.
We now consider some results for the special case when matrix A is Hermi-
tian. Recall that, if A
H
= A, then A is called Hermitian. (A real symmetric
matrix is a special kind of Hermitian matrix.)
THEOREM 5.4
Let A be Hermitian (or real symmetric). The eigenvalues of A are real,
and there is an orthonormal system of eigenvectors w
1
, w
2
, . . . , w
n
of A with
Aw
j
= λ
j
w
j
and (w
j
, w
k
) = w
H
k
w
j
= δ
jk
.
The orthonormal system is linearly independent and spans C
n
, and thus
forms a basis for C
n
. Thus, any vector x ∈ C
n
can be expressed as
x =
n

j=1
a
j
w
j
, where a
j
= (x, w
j
) and |x|
2
2
=
n

i=1
[a
j
[
2
.
The following fact can be used to obtain initial guesses for eigenvalues, to
be used in iterative methods for ﬁnding eigenvalues and eigenvectors.
THEOREM 5.5
(Gerschgorin’s Circle Theorem) Let A be any n n complex matrix. Then
every eigenvalue of A lies in the union of the discs
n
_
j=1
K
ρj
(a
jj
), where K
ρj
= ¦z ∈ C : [z −a
jj
[ ≤ ρ
j
¦ for j = 1, 2, . . . , n,
and where the centers a
jj
are diagonal elements of A and the radii ρ
j
can be
taken as:
ρ
j
=
n

k=1
k=j
[a
jk
[, j = 1, 2, . . . , n (5.1)
Eigenvalue-Eigenvector Computation 195
(absolute sum of the elements of each row excluding the diagonal elements),
ρ
j
=
n

k=1
k=j
[a
kj
[, j = 1, 2, . . . , n (5.2)
(absolute column sums excluding diagonal elements), or
ρ
j
= ρ = (
n

j,k=1
j=k
[a
jk
[
2
)
1/2
(5.3)
for j = 1,2,. . . ,n.
Example 5.2
A =
_
_
2 1
1
2
−1 −3i 1
3 −2 −6
_
_
Using absolute row sums,
a
11
= 2, ρ
1
= 3/2,
a
22
= −3i, ρ
2
= 2,
a
33
= −6, ρ
3
= 5.
The eigenvalues are in the union of these discs. For example, ρ(A) ≤ 11. Also,
A is nonsingular, since no eigenvalue λ can equal zero. (See Figure 5.1.)
Re
Im
2
3
2
-3i
2
-6
5
FIGURE 5.1: Illustration of Gerschgorin discs for Example 5.2.
196 Applied Numerical Methods
Symmetric matrices (or more generally, Hermitian matrices) are special
from the point of view of eigenvalues and eigenvectors. We have a special
version of Gerschgorin’s theorem for such matrices:
THEOREM 5.6
(A Gerschgorin Circle Theorem for Hermitian matrices) If A is Hermitian
or real symmetric, then a one-to-one correspondence can be set up
2
between
each disc K
ρ
(a
jj
) and each λ
j
from the spectrum λ
1
, λ
2
, . . . , λ
n
of A, where
ρ = max
j
n

k=1
k=j
[a
jk
[ or ρ =
_
_
_
n

j,k=1
j=k
[a
jk
[
2
_
_
_
1/2
.
(Recall that K
ρ
(a
jj
) = ¦z ∈ C : [z −a
jj
[ ≤ ρ¦.)
We now turn to some basic methods for computing eigenvalues and eigen-
vectors.
5.2 The Power Method
In this section, we describe a simple iterative method for computing the
eigenvector corresponding to the largest (in modulus) eigenvalue of a matrix
A. We assume ﬁrst that A is nondefective, i.e., A has a complete set of
eigenvectors, and A has a unique simple
3
dominant eigenvalue. We will discuss
more general cases later.
Speciﬁcally, suppose that the n n matrix A has a complete set of eigen-
vectors corresponding to eigenvalues ¦λ
j
¦
n
j=1
, and the eigenvalues satisfy
[λ
1
[ > [λ
2
[ ≥ [λ
3
[ ≥ ≥ [λ
n
[. (5.4)
Since the eigenvectors ¦x
j
¦ are linearly independent, they form a basis for
C
n
. That is, any vector q
(0)
∈ C
n
can be written
q
(0)
=
n

j=1
c
j
x
j
, (5.5)
2
This does not necessarily mean that there is only one eigenvalue in each disk; consider the
matrix A =
_
0 1
1 0
_
.
3
A simple eigenvalue is an eigenvalue corresponding to a root of multiplicity 1 of the char-
acteristic equation
Eigenvalue-Eigenvector Computation 197
for some coeﬃcients c
j
. Starting with initial guess q
(0)
, we deﬁne the sequence
¦q
(ν)
¦
ν≥1
by
q
(ν+1)
=
1
σ
ν+1
Aq
(ν)
, ν = 0, 1, 2, . . . (5.6)
where the sequence ¦σ
ν
¦
ν≥1
consists of scale factors chosen to avoid overﬂow
and underﬂow errors. From (5.5) and (5.6), we have
q
(ν)
=
_
ν

i=1
σ
−1
i
_
n

j=1
λ
ν
j
c
j
x
j
= λ
ν
1
_
ν

i=1
σ
−1
i
_
_
c
1
x
1
+
n

j=2
_
λj
λ1
_
ν
c
j
x
j
_
.
(5.7)
Since by (5.4), [λ
j
/λ
1
[ < 1 for j ≥ 2, we have lim
ν→∞
(λ
j
/λ
1
)
ν
= 0 for j ≥ 2,
and if c
1
,= 0,
lim
ν→∞
q
(ν)
= lim
ν→∞
_
λ
ν
1
ν

i=1
1
σ
i
_
c
1
x
1
. (5.8)
The scale factors σ
i
are usually chosen so that |q
(ν)
|
∞
= 1 or |q
(ν)
|
2
=
1 for ν = 1, 2, 3, . . . , i.e., the vector q
(ν)
is normalized to have unit norm;
thus σ
ν+1
= |Aq
(ν)
|
∞
or |Aq
(ν)
|
2
, since q
(ν+1)
= Aq
(ν)
/σ
ν+1
. With either
normalization, the limit in (5.8) exists; in fact,
lim
ν→∞
q
(ν)
=
x
1
|x
1
|
, (5.9)
i.e., the sequence q
(ν)
converges if c
1
,= 0 to an eigenvector of unit length
corresponding to the dominant eigenvalue of A.
If q
(0)
is chosen randomly, the probability that c
1
,= 0 is close to one, but
not one. However, even if the exact q
(0)
happens to have been chosen with
c
1
= 0, rounding errors on the computer may still result in a component in
direction x
1
.
Example 5.3
We illustrate the method with the matrix A from Example 5.1 and the
following matlab dialog. We use σ
ν+1
= max
1≤i≤n
[q
(ν)
i
[:
>> A = [1 2
-1 4]
A =
1 2
-1 4
>> q = rand(2,1)
q =
0.4057
0.9355
>> q = q/max(abs(q))
q =
0.4337
1.0000
198 Applied Numerical Methods
>> q = A*q
q =
2.4337
3.5663
>> q = q/max(abs(q))
q =
0.6824
1.0000
.
.
.
q =
0.9784
1.0000
>> q = A*q
q =
2.9784
3.0216
>> q = q/max(abs(q))
q =
0.9857
1.0000
>> q = A*q
q =
2.9857
3.0143
>> q = q/max(abs(q))
q =
0.9905
1.0000
>> q = A*q
q =
2.9905
3.0095
>> max(abs(q))
ans =
3.0095
>>
One sees linear convergence to the eigenvalue λ
1
= 3 and corresponding
eigenvector v = (1, 1)
T
.
Consider again Eq. (5.7). Since by assumption λ
2
is the eigenvalue of second
largest absolute magnitude, we see that, for ν suﬃciently large,
q
(ν)
−
_
λ
ν
1
n

i=1
σ
−1
i
_
c
1
x
1
(λ
2
/λ
1
)
ν
→ k as ν → ∞,
where k is a constant vector. Hence,
q
(ν)
−
x
1
|x
1
|
= O
_¸
¸
¸
¸
λ
2
λ
1
¸
¸
¸
¸
ν
_
. (5.10)
That is, the rate of convergence of the sequence q
(ν)
to the exact eigenvector
is governed by the ratio [λ
2
/λ
1
[. In practice, this ratio may be too close to
1, yielding a slow convergence rate. For instance, if [λ
2
/λ
1
[ = 0.95, then
[λ
2
/λ
1
[
ν
≤ 0.1 only for ν ≥ 44, that is, it takes over 44 iterations to reduce
the error in (5.10) by a factor of 10.
Eigenvalue-Eigenvector Computation 199
Example 5.4
In Example 5.3, the σ
ν
are approximations to the eigenvalues. According to
our analysis, the convergence rate should be linear, with convergence factor
λ
1
/λ
2
= 2/3. We test this with the following matlab dialog, where we
begin the power method iteration with the last iterate q
(ν)
we computed in
Example 5.4.
>> q = q/max(abs(q))
q =
0.9937
1.0000
>> current = norm(q-[1;1])
current =
0.0063
>> old = current;
>> q = A*q;
>> q = q/max(abs(q))
q =
0.9958
1.0000
>> current = norm(q-[1;1])
current = 0.0042
>> current/old
ans = 0.6653
>> old = current;
>> q = A*q;
>> q = q/max(abs(q))
q =
0.9972
1.0000
>> current = norm(q-[1;1])
current = 0.0028
>> current/old
ans = 0.6657
>> old = current;
>> q = A*q;
>> q = q/max(abs(q))
q =
0.9981
1.0000
>> current = norm(q-[1;1])
current = 0.0019
>> current/old
ans = 0.6660
>>
On the other hand, the method is very simple, requiring n
2
multiplications
to compute Aq
(ν)
at each iteration. Also, if A is sparse, the work is reduced,
and only the nonzero elements of A need to be stored.
REMARK 5.1 The matlab computations in the previous examples may
be put into the following matlab function:
function [lambda,v,n_iter,success] ...
= power_method (A, start_vector, max_iter, tol)
%
% [lambda,v,success] = power_method (A, start_vector, max_iter, tol)
% returns an approximation to the dominant eigenvalue lambda
% and corresponding eigenvector v, starting with the column vector
200 Applied Numerical Methods
% start_vector. The iteration stops when either maxitr is reached or the
% infinity norm between successive estimates for lambda becomes less than
% tol.
%
% On return, success is set to ’1’ ( "true") if successive estimates for
% lambda have become less than tol, and is set to ’0’ ("false") otherwise.
q = start_vector;
if (norm(q,inf)==0)
disp(’Error in power_method.m: start_vector is the zero vector’);
n_iter=0
success=0;
lambda = 0;
v = start_vector;
return
end
q = q/max(abs(q));
success=1;
for i=1:max_iter
old_q = q;
q = A*q;
nu = max(abs(q));
q = q/nu;
diff = norm(q-old_q,inf);
if (diff < tol)
lambda = nu;
v = q;
n_iter = i;
return
end
end
disp(sprintf(’Tolerance tol = %12.4e was not met in power_method’,tol));
disp(sprintf(’within max_iter = %7.0f iterations.’, max_iter));
disp(sprintf(’Current difference diff: %12.4e’,diff))
success=0
lambda = nu;
v = q;
n_iter = max_iter;
For example, this function (stored, say, in the user’s current matlab direc-
tory as power method.m) may be used as follows:
>> A = [1 2
-1 4]
A =
1 2
-1 4
format long
>> [lambda, v, n_iter, success] = power_method(A, rand(2,1), 100, 1e-15)
lambda =
3.000000000000003
v =
-0.999999999999998
-1.000000000000000
n_iter =
88
success =
1
>>
If the real matrix A has complex eigenvalues, the dominant eigenvalue is
Eigenvalue-Eigenvector Computation 201
necessarily not unique. Furthermore, we would need to begin with a non-real
starting vector to have a chance of converging to any eigenvalue.
If the dominant eigenvalue is unique but not simple, the power method
will still converge. Suppose that λ
1
has multiplicity r and has r linearly
independent eigenvectors. Then,
q
(0)
=
r

j=1
c
j
x
j
+
n

j=r+1
c
j
x
j
.
The sequence q
(ν)
will converge to the direction
r

j=1
c
j
x
j
.
However, if the dominant eigenvalue is not unique, e.g.
A =
_
_
0 0 1
1 0 0
0 1 0
_
_
, λ
1
= 1, λ
2
=
1
2
+
√
3
2
i, λ
3
=
1
2
−
√
3
2
i,
then the power method will fail to converge. This severely limits the applica-
bility of the power method. Once a dominant eigenvalue and eigenvector have
been found, a deﬂation technique may be applied to deﬁne a smaller matrix
whose eigenvalues are the remaining eigenvalues of A. The power method is
then applied to the smaller matrix. If all eigenvalues of A are simple with
diﬀerent magnitudes, this procedure can be used to ﬁnd all the eigenvalues.
Example 5.5
Let
A =
_
0 1
−1 0
_
.
The eigenvalues of A are i and −i, where i is the imaginary unit, and cor-
responding eigenvectors are (−i, 1)
T
and (i, 1)
T
. In the following matlab
dialog, we illustrate that the simple power method does not converge, and
that the matlab function eig can compute the eigenvalues and eigenvectors
of A.
>> A = [0 1
-1 0]
A =
0 1
-1 0
>> start_vector = rand(2,1) + i*rand(2,1)
start_vector =
0.5252 + 0.6721i
0.2026 + 0.8381i
>> [lambda, v, n_iter, success] = power_method(A, start_vector, 1000, 1e-5)
Tolerance tol = 1.0000e-005 was not met in power_method
within max_iter = 1000 iterations.
202 Applied Numerical Methods
Current difference diff: 1.9443e+000
success =
0
lambda =
1
v =
0.6090 + 0.7795i
0.2350 + 0.9720i
n_iter =
1000
success =
0
>> [P,Lambda] = eig(A)
P =
0.7071 0.7071
0 + 0.7071i 0 - 0.7071i
Lambda =
0 + 1.0000i 0
0 0 - 1.0000i
>> inv(P)*A*P
ans =
0 + 1.0000i 0
0 0 - 1.0000i
>>
The power method may sometimes be used with deﬂation to compute all
eigenvalues and eigenvectors of a matrix. See our text [1] for an explanation
of deﬂation, and for more details and additional convergence analysis of the
power method.
5.3 Other Methods for Eigenvalues and Eigenvectors
The simple power method is usually not used alone in modern software for
eigenvalues and eigenvectors. Instead, sophisticated implementations of vari-
ous methods are combined. Such methods include the inverse power method
with deﬂation, the QR method with origin shifts, the Jacobi method, etc.
The resulting software, such as that underneath the matlab functions eig
and eigs or in the publicly available (free, open-source) LAPACK package
[5] do not fail often. For more details of some of these methods, see our
graduate-level text [1]. We outline several of them here.
5.3.1 The Inverse Power Method
The inverse power method has a faster rate of convergence than the power
method, and can be used to compute any eigenvalue, not just the dominant
one. Let A have eigenvalues λ
1
, λ
2
, . . . , λ
n
corresponding to linearly inde-
pendent eigenvectors x
1
, x
2
, . . . , x
n
. (Here, the eigenvalues are not necessarily
ordered.) Then, the matrix (A−λI)
−1
has eigenvalues (λ−λ
1
)
−1
, (λ−λ
2
)
−1
,
Eigenvalue-Eigenvector Computation 203
. . . , (λ−λ
n
)
−1
, corresponding to eigenvectors x
1
, x
2
, . . . , x
n
. It can be shown
(see [1] and the references therein) that, if
[(λ −λ
2
)
−1
[ ≥ [(λ −λ
j
)
−1
[, 3 ≤ j ≤ n,
then
q
(ν)
=
1
(λ −λ
1
)
ν
_
c
1
x
1
+ O
_¸
¸
¸
¸
λ
1
−λ
λ
2
−λ
¸
¸
¸
¸
ν
__
. (5.11)
Thus, the iterates q
(ν)
converge in direction to x
1
. The error estimate for the
inverse power method analogous to (5.10) is
q
(ν)
−
x
1
|x
1
|
= O
_¸
¸
¸
¸
λ −λ
1
λ −λ
2
¸
¸
¸
¸
ν
_
, (5.12)
where q
(ν)
is normalized so that |q
(ν)
| = 1.
It can also be shown that, if q is approximately an eigenvector of A, the
Rayleigh quotient
q
T
Aq
q
T
q
is an approximation to the eigenvalue corresponding to q. Thus, in the inverse
power method, we can adjust λ on each iteration by setting it to the Rayleigh
quotient.
Example 5.6
Let
A =
_
1 2
−1 4
_
.
be as in Example 5.1. The eigenvalues of A are λ
1
= 3 and λ
2
= 2, so the
eigenvalues of A−µI)
−1
are 1/(2−µ) and 1/(3−µ). Suppose we have already
found λ
1
= 3 using the power method. Then, if we choose an initial λ less
than λ
2
, we will have [λ − λ
2
[ < [λ −λ
1
[, and the inverse power method will
converge to v
(2)
and λ
2
. We use the following matlab function:
function [lambda_1, v_1, success, n_iter] = inverse_power_method...
(lambda, q0, A, tol, maxitr)
%
% [lambda_1, v_1, success]=inverse_power_method(lambda, q0, A, tol, maxitr)
% computes an eigenvalue and eigenvector of the matrix A, according to the
% inverse power method as described in Section 5.3 (starting on page 303)
% of the text.
% On entry:
% lambda is the shift for inv(A - lambda I)
% q0 is the initial guess for the eigenvector.
n = size(A,1);
a=inv(A-lambda*eye(n));
q_nu = q0;
alam=2*norm(q_nu,inf); %(Initialize the approximate eigenvalue)
check=1;
success = false;
for k=1:maxitr
204 Applied Numerical Methods
alam2=alam;
q_nu=a*q_nu;
q_nu=q_nu/norm(q_nu,inf);
alam=(q_nu’*a*q_nu)/(q_nu’*q_nu); %(Update the approx. eigenvalue)
check=abs(alam-alam2); % (stop if successive eigenvalue approximations
if (check < tol) % are close)
success = true;
break
end
end
n_iter = k;
lambda_1=lambda+(1/alam); % (the eigenvalue of the original matrix)
v_1 = q_nu;
disp(sprintf(’ %9.0f %15.4f %15.5f ’,k,lambda,lambda_1));
With inverse power method.m, we have the following dialog:
>> A = [1 2
-1 4]
A =
1 2
-1 4
>> lambda = -1;
>> q0 = rand(2,1)
q0 =
0.4057
0.9355
>> format long
>> [lambda_1, v_1, success, n_iter] =...
inverse_power_method(lambda, q0, A, 1e-16, 1000)
118 -1.0000 2.00000
lambda_1 =
1.999999999999997
v_1 =
-1.000000000000000
-0.499999999999999
success =
1
n_iter =
118
>>
5.3.2 The QR Method
The QR method is an iterative method for reducing a matrix to triangular
form using orthogonal similarity transformations. The eigenvalues of the tri-
angular matrix then appear on the diagonal, and, by Theorem 5.2, must also
be the eigenvalues of the original matrix. The eigenvectors may be found by
transforming back the eigenvectors of the ﬁnal triangular matrix.
Computing eigenvalues and eigenvectors by the QR method involves two
steps:
1. reducing the matrix to an almost triangular form (called the Hessenberg
form, and
2. iteration by QR decompositions.
Eigenvalue-Eigenvector Computation 205
We apply origin shifts in the QR method similarly to how we applied such
shifts (i.e. replacing A by A −λI) in the inverse power method.
We give a detailed explanation of the QR method in [1].
5.3.3 Jacobi Diagonalization (Jacobi Method)
The Jacobi method for computing eigenvalues of a symmetric matrix is one
of the oldest numerical methods for the eigenvalue problem. It was replaced
by the QR algorithm as the method of choice in the 1960’s. However, it is
making a comeback due to its adaptability to parallel computers [19, 40]. We
give a brief description of the Jacobi method in this section. Let A
(1)
= A be
an n n symmetric matrix. The procedure consists of
A
(k+1)
= N
H
k
A
(k)
N
k
(5.13)
where the N
k
are unitary matrices that eliminate the oﬀ-diagonal element of
largest modulus. It can be shown that if a
(k)
pq
is the oﬀ-diagonal element of
largest modulus, one transformation increases the sum of the squares of the
diagonal elements by 2a
2
pq
and at the same time decreases the sum of the
squares of the oﬀ-diagonal elements by the same amount. Thus, A
(k)
tends
to a diagonal matrix as k → ∞. Since A
(1)
is symmetric, A
(2)
= N
H
1
A
(1)
N
1
is symmetric, so A
(3)
, A
(4)
. . . are symmetric. Also, since A
(k+1)
is similar
to A
(k)
, A
(k+1)
has the same eigenvalues as A
(k)
, and hence has the same
eigenvalues as A.
We now consider how to ﬁnd N
k
such that the largest oﬀ-diagonal element
of A
(k)
is eliminated. Let
A
(k)
=
_
_
_
_
a
(k)
11
. . . a
(k)
1n
.
.
.
a
(k)
n1
. . . a
(k)
nn
_
_
_
_
,
206 Applied Numerical Methods
and suppose that [a
(k)
pq
[ ≥ [a
(k)
ij
[ for 1 ≤ i, j ≤ n. Let
N
k
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1
p-th
col.
q-th
col.
.
.
. ↓ ↓
1
p-th
row → cos(α
k
) −sin(α
k
)
1
.
.
.
1
q-th
row → sin(α
k
) cos(α
k
)
1
.
.
.
1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
N
k
is a Givens transformation (also called a plane rotator or Jacobi rotator).
When A
(k+1)
is constructed, only rows p and q and columns p and q of A
(k+1)
are diﬀerent from those of A
(k)
. The choice for α
k
is such that a
(k+1)
pq
= 0.
That is, since
a
(k+1)
pq
= (−a
(k)
pp
+a
(k)
qq
) cos α
k
sin α
k
+a
(k)
pq
(cos
2
α
k
−sin
2
α
k
) = 0,
cos α
k
and sin α
k
are chosen so that
cos
2
α
k
=
1
2
+
a
(k)
pp
−a
(k)
qq
2r
, sin
2
α
k
=
1
2
−
a
(k)
pp
−a
(k)
qq
2r
, and
sin α
k
cos α
k
=
a
(k)
pq
r
,
where r
2
= (a
(k)
pp
−a
(k)
qq
)
2
+ 4(a
(k)
pq
)
2
.
In summary, the Jacobi computational algorithm consists of the following
steps, where the third step provides stability with respect to rounding errors:
(1) At step k, ﬁnd a
(k)
pq
such that p ,= q and [a
(k)
pq
[ ≥ [a
(k)
ij
[ for 1 ≤ i, j ≤ n,
i ,= j.
(2) Set r =
_
(a
(k)
pp
−a
(k)
qq
)
2
+ 4(a
(k)
pq
)
2
and t = 0.5 + (a
(k)
pp
−a
(k)
qq
)/2r.
(3) Set chk = (a
(k)
pp
−a
(k)
qq
).
IF chk ≥ 0 THEN
set c =
√
t and s = a
(k)
pq
/(rc),
ELSE
Eigenvalue-Eigenvector Computation 207
set s =
√
1 −t and c = a
(k)
pq
/(rs).
END IF
(4) Set
N
i,j
=
_
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
_
1 if i = j,
c if i = p, j = p,
−s if i = p, j = q,
s if i = q, j = p,
c if i = q, j = q,
0 otherwise.
(5) Set A
(k+1)
= N
T
A
(k)
N.
(6) Go to Step (1) until [a
(k)
pq
[ < ε.
Example 5.7
A = A
(1)
=
_
_
5 1 0
1 5 2
0 2 5
_
_
, N
1
=
_
_
_
_
1 0 0
0
1
√
2
1
√
2
0 −
1
√
2
1
√
2
_
_
_
_
.
Then,
A
(2)
= N
H
1
A
(1)
N
1
=
_
_
_
_
5
1
√
2
1
√
2
1
√
2
3 0
1
√
2
0 7
_
_
_
_
.
Notice that the sum of the squares of the diagonal elements of A
(2)
is 8 more
than the sum of the squares of the diagonal elements of A
(1)
, and the sum of
the squares of the oﬀ-diagonal elements of A
(2)
is 8 less than the sum of the
squares of the oﬀ-diagonal elements of A
(1)
.
5.4 Applications
In our example in Section 3.7 (page 140), we used the dominant eigenvalue
of the matrix
A =
_
_
_
0.14 0 0.6
0.56 0.24 0
0 0.56 0.9
_
_
_
to estimate the long run behavior of the system deﬁned in (3.42). To compute
the eigenvalues of A, we used the eig command in matlab as follows:
208 Applied Numerical Methods
>> A=[0.14 0 0.6; 0.56 0.24 0; 0 0.56 0.9]
A =
0.1400 0 0.6000
0.5600 0.2400 0
0 0.5600 0.9000
>> [v,lambda]=eig(A)
v =
-0.1989 + 0.5421i -0.1989 - 0.5421i 0.4959
0.6989 0.6989 0.3160
-0.3728 - 0.1977i -0.3728 + 0.1977i 0.8089
lambda =
0.0806 + 0.4344i 0 0
0 0.0806 - 0.4344i 0
0 0 1.1188
We thus ﬁnd the dominant eigenvalue to be 1.1188. Now, let us compute the
dominant eigenvalue with an iterative method we have learned in this chapter.
Since what we need in this example is the dominant eigenvalue, we can ﬁrst
try the simple iterative method, the power method. Since A is nonnegative,
primitive and irreducible, A has a positive, simple, and strictly dominant
eigenvalue (see [4]). Thus, we know from Section 5.2 that the power method
will converge. Using the function power method.m given in Remark 5.1, we
have the following dialog:
>> A
A =
0.1400 0 0.6000
0.5600 0.2400 0
0 0.5600 0.9000
>> format long
>> [lambda,v,n_iter,success]=power_method(A, rand(3,1),100,1e-10)
lambda =
1.11876441203147
v =
0.61301779326244
0.39065073581277
1.00000000000000
n_iter =
25
success =
1
The result shows that the dominant eigenvalue of A is approximately
4
1.11876441203147, which agrees with our previous result from Section 3.7.
5.5 Exercises
1. If
A =
_
_
2 −1 0
−1 2 −1
0 −1 2
_
_
,
then
4
based on the tolerance, we can only expect the ﬁrst 10 digits of this to be correct.
Eigenvalue-Eigenvector Computation 209
(a) Use the Gerschgorin theorem to bound the eigenvalues of A.
(b) Compute the eigenvalues and eigenvectors of A directly from the
deﬁnition, and compare with the results you obtained by using
Gerschgorin’s circle theorem.
2. Consider
A =
_
0 1
1 0
_
.
(a) Compute the eigenvalues of A directly from the deﬁnition.
(b) Apply the Gerschgorin circle theorem to the matrix A.
(c) Why is A not a counterexample to the Gerschgorin theorem for
Hermitian matrices (Theorem 5.6)?
3. Let
A =
_
_
_
0
1
4
1
5
−
1
4
0
1
3
−
1
5
−
1
3
0
_
_
_.
Show that the spectral radius ρ(A) < 1.
4. Let A be a strictly diagonally dominant matrix. Can zero be in the
spectrum of A?
5. Apply several iterations of the power method to the matrix from Prob-
lem 1 on page 208, using several diﬀerent starting vectors. Compare the
results to the results you obtained from Problem 1 on page 208.
6. Let A be a real symmetric n n matrix with dominant eigenvalues
λ
1
= 1 and λ
2
= −1. What would happen if we applied the power
method to A?
7. Apply several iterations of the inverse power method to the matrix from
Problem 1 on page 208, using several diﬀerent starting vectors, and
using the centers of the Gerschgorin circles as estimates for λ. Compare
the results to the results you obtained from Problem 5 on page 209.
8. Let A be a (2n+1)(2n+1) symmetric matrix with elements a
ij
= (1.5)
i
if i = j and a
ij
= (0.5)
i+j−1
if i ,= j. Let the eigenvalues of A be λ
i
,
i = 1, 2, . . . , 2n + 1, ordered such that λ
1
≤ λ
2
≤ . . . ≤ λ
2n
≤ λ
2n+1
.
We wish to compute the eigenvector x
n+1
associated with the middle
eigenvalue λ
n+1
using the inverse power method q
r
= (λI−A)
−1
q
r−1
for
r = 1, 2, . . . Considering Gerschgorin’s Theorem for symmetric matrices,
choose a value for λ that would ensure rapid convergence. Explain how
you chose this value.
9. Apply one or more iterations of the Jacobi method to compute the
eigenvalues and eigenvectors of the matrix in Problem 2.
210 Applied Numerical Methods
10. Apply one or more iterations of the inverse power method to compute
an eigenvalue of the matrix
A =
_
0 1
1 0
_
.
Hint: Since the eigenvalues are not real, you must choose a complex
initial guess for the eigenvalue. One possibility might be to choose this
randomly.
Chapter 6
Numerical Diﬀerentiation and
Integration
In this chapter, we study the fundamental problem of approximating integrals
and derivatives.
6.1 Numerical Diﬀerentiation
There are two common ways to develop approximations to derivatives, us-
ing Taylor’s formula or Lagrange interpolation. We derive the formulas with
Taylor’s formula here, while we also consider derivations using Lagrange in-
terpolation in [1].
6.1.1 Derivation of Formulas
Consider applying Taylor’s formula for approximating derivatives. Suppose
that f has two continuous derivatives, and we wish to approximate f
′
at some
point x
0
. By Taylor’s formula,
f(x) = f(x
0
) +f
′
(x
0
)(x −x
0
) +
(x −x
0
)
2
2
f
′′
(ξ(x))
for some ξ between x and x
0
. Thus, letting x = x
0
+ h,
f
′
(x
0
) =
f(x
0
+h) −f(x
0
)
h
−
h
2
f
′′
(ξ).
Hence,
f
′
(x
0
) =
f(x
0
+h) −f(x
0
)
h
+O(h) (forward-diﬀerence formula). (6.1)
To obtain a better approximation, suppose that f has three continuous deriva-
tives, and consider
_
¸
¸
_
¸
¸
_
f(x
0
+h) = f(x
0
) +f
′
(x
0
)h +f
′′
(x
0
)
h
2
2
+f
′′′
(ξ
1
)
h
3
6
f(x
0
−h) = f(x
0
) −f
′
(x
0
)h +f
′′
(x
0
)
h
2
2
−f
′′′
(ξ
2
)
h
3
6
.
(6.2)
211
212 Applied Numerical Methods
Subtracting the above two expressions and dividing by 2h gives
f
′
(x
0
) =
f(x
0
+h) −f(x
0
−h)
2h
+O(h
2
) (central-diﬀerence formula). (6.3)
Similarly, we can go out one more term in (6.2) (assuming f has four contin-
uous derivatives). Adding the two resulting expressions and dividing by h
2
then gives
f
′′
(x
0
) =
1
h
2
_
f(x
0
−h) −2f(x
0
) +f(x
0
+h)
¸
−
h
2
24
_
f
(4)
(ξ
1
) +f
(4)
(ξ
2
)
¸
.
Hence, using the Intermediate Value Theorem,
f
′′
(x
0
) =
f(x
0
+h) −2f(x
0
) +f(x
0
−h)
h
2
−
h
2
12
f
(4)
(ξ). (6.4)
Example 6.1
f(x) = xln x. Estimate f
′′
(2) using (6.4) with h = 0.1. Doing so, we obtain
f
′′
(2) ≈
f(2.1) −2f(2) +f(1.9)
(0.1)
2
= 0.50021.
(Notice that f
′′
(2) = 1/2, so the approximation is accurate.)
6.1.2 Error Analysis
One diﬃculty with numerical diﬀerentiation is that rounding error can be
large if h is too small. In the computer, f(x
0
+ h) =
˜
f(x
0
+ h) + e(x
0
+ h)
and f(x
0
) =
˜
f(x
0
) +e(x
0
), where e(x
0
+h) and e(x
0
) are roundoﬀ errors that
depend on the number of digits used by the computer.
Consider the forward-diﬀerence formula
f
′
(x
0
) =
f(x
0
+h) −f(x
0
)
h
+
h
2
f
′′
(ξ(x)).
We will assume that [e(x)[ ≤ ǫ[f(x)[ for some relative error ǫ, that [f(x)[ ≤ M
0
for some constant M
0
, and that [f
′′
(x)[ ≤ M
2
for some constant M
2
, for all
values of x near x
0
that are being considered. Then, these assumed bounds
and repeated application of the triangle inequality give
¸
¸
¸
¸
¸
f
′
(x
0
) −
˜
f(x
0
+h) −
˜
f(x
0
)
h
¸
¸
¸
¸
¸
≤
¸
¸
¸
¸
e(x
0
+h) −e(x
0
)
h
¸
¸
¸
¸
+
h
2
M
2
≤
2ǫM
0
h
+
hM
2
2
= E(h), (6.5)
where ǫ is any number such that [e(x)[ ≤ ǫ[f(x)[ for all x under consideration.
That is, the error is bounded by a curve such as that in Figure 6.1. Thus, if
Numerical Diﬀerentiation and Integration 213
h
error
hM
2
2
2ǫM
0
h
FIGURE 6.1: Illustration of the total error (roundoﬀ plus truncation)
bound in forward diﬀerence quotient approximation to f
′
.
the value of h is too small, the error can be large.
If we use calculus to minimize the expression on the right of (6.5) with
respect to h, we obtain
h
opt
=
2
√
ǫM
0
√
M
2
,
with a minimal bound on the error of
E(h
opt
) = 2
_
M
0
M
2
√
ǫ.
Although the right member of (6.5) is merely a bound, we see that h
opt
gives
a good estimate for the optimal step in the divided diﬀerence, and E(h
opt
)
gives a good estimate for the minimum achievable error. In particular, the
minimum achievable error is O(
√
ǫ) and the optimal h is also O(
√
ǫ), both in
the estimates and in the numerical experiments in Example 6.2.
Example 6.2
We will use matlab to observe the behavior of the total error when we
approximate ln
′
(3) by evaluating (ln(3 + h) − ln(3))/h using IEEE double
precision arithmetic. We use the following matlab functions:
function difference_table(f,fprime,x)
%
% Issuing the command
% difference_table(’f’,’fprime’,x)
% causes a table of difference quotients to be formed using the
% difference_quotient function.
for i=1:30
h=4^(-i);
value = difference_quotient(f,x,h);
error = value - feval(fprime,x);
fprintf(’%3d %12.2e %20.16f %12.2e\n’, i, h, value, error);
end clear value error
function [diff] = difference_quotient(f,x,h)
%
214 Applied Numerical Methods
% difference_quotient (f,x,h) returns the forward difference quotient of
% f at x with stepsize h, as in formula (6.1), page 323 of the text.
fxph = feval(f,x+h); fx = feval(f,x); diff = (fxph-fx)/h;
function [y] = logprime(x)
y = 1/x;
(These functions are available with more detailed in-line documentation at
http://interval.louisiana.edu/Classical-and-Modern-NA/#Chapter_6.)
With these functions, we have the following matlab dialog.
>> difference_table(’log’,’logprime’,3)
1 2.50e-001 0.3201708306941464 -1.32e-002
2 6.25e-002 0.3299085952437721 -3.42e-003
3 1.56e-002 0.3324682801346626 -8.65e-004
4 3.91e-003 0.3331165076408524 -2.17e-004
5 9.77e-004 0.3332790916319937 -5.42e-005
6 2.44e-004 0.3333197707015643 -1.36e-005
7 6.10e-005 0.3333299425394216 -3.39e-006
8 1.53e-005 0.3333324856357649 -8.48e-007
9 3.81e-006 0.3333331214380451 -2.12e-007
10 9.54e-007 0.3333332804031670 -5.29e-008
11 2.38e-007 0.3333333209156990 -1.24e-008
12 5.96e-008 0.3333333320915699 -1.24e-009
13 1.49e-008 0.3333333432674408 9.93e-009
14 3.73e-009 0.3333333730697632 3.97e-008
15 9.31e-010 0.3333334922790527 1.59e-007
16 2.33e-010 0.3333339691162109 6.36e-007
17 5.82e-011 0.3333358764648438 2.54e-006
18 1.46e-011 0.3333435058593750 1.02e-005
19 3.64e-012 0.3333740234375000 4.07e-005
20 9.09e-013 0.3334960937500000 1.63e-004
21 2.27e-013 0.3339843750000000 6.51e-004
22 5.68e-014 0.3359375000000000 2.60e-003
23 1.42e-014 0.3437500000000000 1.04e-002
24 3.55e-015 0.3750000000000000 4.17e-002
25 8.88e-016 0.5000000000000000 1.67e-001
26 2.22e-016 0.0000000000000000 -3.33e-001
27 5.55e-017 0.0000000000000000 -3.33e-001
28 1.39e-017 0.0000000000000000 -3.33e-001
29 3.47e-018 0.0000000000000000 -3.33e-001
30 8.67e-019 0.0000000000000000 -3.33e-001
>>
We see the minimum error occurs with h ≈ 5.96 10
−8
, and the minimum
absolute error is about 1.24 10
−9
.
To analyze this example, notice that f
′′
(x) = −
1
x
2
and M
2
= max [f
′′
(ξ)[ ≈
1
9
, and M
0
≈ ln(3) ≈ 1. Suppose that the error is
2ǫ
h
M
0
+
h
2
M
2
, so
e(h) =
2ǫ
h
M
0
+
h
2
M
2
.
The minimum error occurs at e
′
(h) = 0, which gives h
opt
≈
√
36ǫ. In matlab,
if we assume ln is evaluated with maximal accuracy, ǫ is the IEEE double
precision machine epsilon, namely ǫ ≈ 2.23 10
−16
. Thus, h
opt
≈ 8.9 10
−8
,
close to what we observed. The minimum error is predicted to be about
2
_
1
1
9
_
2.23 10
−16
≈ 10
−8
,
Numerical Diﬀerentiation and Integration 215
somewhat larger but within a factor of 10 of what was observed.
With higher-order formulas, we can obtain a smaller total error bound, at
the expense of additional complication. In particular, if the roundoﬀ error is
O(1/h) and the truncation error is O(h
n
), then the optimal h is O(ǫ
1/(n+1)
)
and the minimum achievable error bound is O(ǫ
n/(n+1)
).
6.2 Automatic (Computational) Diﬀerentiation
Numerical diﬀerentiation has been used extensively in the past, e.g. for com-
puting the derivative f
′
for use in Newton’s method.
1
Another example of the
use of such derivative formulas is in the construction of methods for the solu-
tion of boundary value problems in diﬀerential equations, such was illustrated
in Example 3.18, on page 93, while we were studying Cholesky factorizations.
However, as we have just seen (in ¸6.1.2 above) roundoﬀ error limits the ac-
curacy of ﬁnite-diﬀerence approximations to derivatives. Moreover, it may be
diﬃcult in practice to determine a step size h for which near-optimal accu-
racy can be attained. This can cause signiﬁcant problems, for example, in
multivariate ﬂoating point Newton methods.
For complicated functions, algebraic computation of the derivatives by hand
is also impractical. One possible alternative
2
is to compute the derivatives
with symbolic manipulation systems such as Mathematica, Maple, or Reduce.
These systems have facilities for output of the derivatives as statements in
common compiled programming languages. However, such systems are often
not able to adequately simplify the expressions for the derivatives, resulting in
expressions for derivatives that can be many times as long as the expressions
for the function itself. This “expression swell” not only can result in ineﬃcient
evaluation, but also can cause roundoﬀ error to be a problem, even though
there is no truncation error.
A third alternative is automatic diﬀerentiation, also called “computational
diﬀerentiation.” In this scheme, there is no truncation (method) error and the
expression for the function is not symbolically manipulated, yet the user only
need supply the expression for the function itself. The technique, increasingly
used during the two decades prior to composition of this book, is based upon
deﬁning an arithmetic on composite objects, the components of which repre-
sent function and derivative values. The rules of this arithmetic are based on
the elementary rules of diﬀerentiation learned in calculus, in particular, on
the chain rule.
1
or the multidimensional analog, as described in §8.1 on page 291
2
but not for Example 3.18
216 Applied Numerical Methods
6.2.1 The Forward Mode
In the “forward mode” of automatic diﬀerentiation, the derivative or deriva-
tives are computed at the same time as the function. For example, if the func-
tion and the ﬁrst k derivatives are desired, then the arithmetic will operate
on objects of the form
u
∇
= ¸u, u
′
, u
′′
, , u
(k)
). (6.6)
Addition of such objects comes from the calculus rule “the derivative of a sum
is the sum of the derivatives,” that is,
u
∇
+v
∇
= ¸u +v, u
′
+ v
′
, u
′′
+v
′′
, , u
(k)
+v
(k)
). (6.7)
In other words, the j-th component of u
∇
+v
∇
is the j-th component of u
∇
plus the j-th component of v
∇
, for 1 ≤ j ≤ k. Subtraction is deﬁned similarly,
while products u
∇
v
∇
are deﬁned such that the ﬁrst component of u
∇
v
∇
is
the ﬁrst component of u
∇
times the ﬁrst component of v
∇
, etc., as follows:
u
∇
v
∇
=
_
uv, u
′
v +uv
′
, u
′′
v + 2u
′
v
′
+uv
′′
, ,
k

j=0
_
k
j
_
u
(k−j)
v
(j)
_
. (6.8)
Rules for applying functions such as “exp,” “sin,” and “cos” to such objects
are similarly deﬁned. For example,
sin(u
∇
) = ¸sin(u), u
′
cos(u), −sin(u)(u
′
)
2
+ cos(u)u
′′
, ). (6.9)
The diﬀerentiation object corresponding to a particular value a of the inde-
pendent variable x is of the form
x
∇
= ¸a, 1, 0, 0).
Example 6.3
Suppose the context requires us to have values of the function, of the ﬁrst
derivative, and of the second derivative for the function
f(x) = xsin(x) −1,
where we want function and derivative values at x = π/4. What steps would
the computer do to complete the automatic diﬀerentiation?
The computer would ﬁrst resolve f into a sequence of operations (some-
times called a code list , tape, or decomposition into elementary operations).
If we associate the independent variable x with the variable v
1
and the i-th
Numerical Diﬀerentiation and Integration 217
intermediate result with v
i+1
, a sequence of operations for f can be
3
v
∇2
← sin(v
∇1
)
v
∇3
← v
∇1
v
∇2
v
∇4
← v
∇3
−1
(6.10)
We now illustrate with 4-digit decimal arithmetic, with rounding to nearest.
We ﬁrst set
v
∇1
← ¸π/4, 1, 0) ≈ ¸0.7854, 1, 0).
Second, we use (6.9) to obtain
v
∇2
← sin(¸0.7854, 1, 0))
i.e. ¸sin(0.7854), 1 cos(0.7854), −sin(0.7854) (1
2
) + cos(0.7854) 0)
≈ ¸0.7071, 0.7071, −0.7071).
Third, we use (6.8) to obtain
v
∇3
← ¸0.7854, 1, 0)¸0.7071, 0.7071, −0.7071)
i.e. ¸0.7854 0.7071, 1 0.7071 + 0.7854 0.7071,
0 0.7071 + 2 1 0.7071 + 0.7854 (−0.7071))
≈ ¸0.5554, 1.263, 0.8589)
Finally, the second derivative object corresponding to the constant 1 is ¸1, 0, 0),
so we apply formula (6.7) to obtain
v
∇4
← ¸0.5554, 1.263, 0.8589) −¸1, 0, 0)
≈ ¸−0.4446, 1.263, .08589).
Comparing, we have
f(π/4) = (π/4 sin(π/4) −1 ≈ −0.4446,
f
′
(x) = xcos(x) + sin(x) so f
′
(π/4) ≈ 1.262,
f
′′
(x) = −xsin(x) + 2 cos(x) so f
′′
(π/4) ≈ 0.8589,
where the above values were computed to 16 digits, then rounded to four
digits. This illustrates the validity of automatic diﬀerentiation.
4
3
We say “a sequence of operations for f can be,” rather than “the sequence of operations
for f is,” because, in general, decompositions for a particular expression are not unique.
4
The discrepancy between the values 1.263 and 1.262 for f
′
(π/4) is due to the fact that
rounding to four digits was done after each operation in the automatic diﬀerentiation. If the
expression for f
′
were ﬁrst symbolically derived, then evaluated with four digit rounding
(rather than exactly, then rounding), then a similar error would occur.
218 Applied Numerical Methods
6.2.2 The Reverse Mode
The reverse mode of automatic diﬀerentiation, when used to compute the
gradient of a function f of n variables, can be more eﬃcient than the forward
mode. In particular, when the forward mode (or for that matter, when ﬁnite
diﬀerences or when symbolic derivatives) is used, the number of operations
required to compute the gradient is proportional to n times the number of
operations to compute the function. In contrast, when the reverse mode is
used, it can be proven that the number of operations required to compute the
the gradient ∇F (which has n components) is bounded by 5 times the number
of operations required to evaluate the f itself, regardless of n. (However, a
quantity of numbers proportional to the number of operations required to
evaluate f needs to be stored when the reverse mode is used.)
So, how does the reverse mode work? We can think of the reverse mode
as forming a system of equations relating the derivatives of the intermediate
variables in the computation through the chain rule, then solving the system
of equations for the derivative of the independent variable. Suppose we have
a code list such as (6.10), giving the sequence of instructions for evaluating a
function f. For example, one such operation could be
v
p
= v
q
+v
r
,
where v
p
is the value to be computed, while v
q
and v
r
have previously been
computed. Then, computing f
′
is equivalent to computing v
′
M
, where v
M
corresponds to the value of f. (That is, v
M
is the dependent variable, generally
the result of the last operation in the computation of the expression for f.)
We form a sparse linear system with an equation for each operation in the
code list, whose variables are v
′
k
, 1 ≤ k ≤ M. For example, the equation
corresponding to an addition v
p
= v
q
+v
r
would be
v
′
q
+v
′
r
− v
′
p
= 0,
while the equation corresponding to a product v
p
= v
q
v
r
would be
v
r
v
′
q
+v
q
v
′
r
−v
′
p
= 0,
where the values of the intermediate quantities v
q
and v
r
have been previously
computed and stored from an evaluation of f. Likewise, if the operation were
v
p
= sin(v
q
), then the equation would be
cos(v
q
)v
′
q
−v
p
= 0,
while if the operation were addition of a constant, v
p
= v
q
+ c, then the
equation would be
v
′
q
−v
′
p
= 0.
If there is a single independent variable and the derivative is with respect
to this variable, then the ﬁrst equation would be
v
′
1
= 1.
Numerical Diﬀerentiation and Integration 219
We illustrate with the f for Example 6.3. If the code list is as in (6.10),
then the system of equations will be
_
_
_
_
1 0 0 0
cos(v
1
) −1 0 0
v
2
v
1
−1 0
0 0 1 −1
_
_
_
_
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
_
_
_
_
=
_
_
_
_
1
0
0
0
_
_
_
_
. (6.11)
If v
1
= x = π/4 as in Example 6.3, then this system, ﬁlled using four-digit
arithmetic, is
_
_
_
_
1 0 0 0
0.7071 −1 0 0
0.7071 0.7854 −1 0
0 0 1 −1
_
_
_
_
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
_
_
_
_
=
_
_
_
_
1
0
0
0
_
_
_
_
. (6.12)
The reverse mode consists simply of solving this system with forward substi-
tution. This system has solution
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
_
_
_
_
≈
_
_
_
_
1.0000
0.7071
1.2625
1.2625
_
_
_
_
.
Thus f
′
(π/4) = v
′
4
≈ 1.2625, which corresponds to what we obtained with
the forward mode.
Example 6.4
Suppose f(x
1
, x
2
) = x
2
1
−x
2
2
. Compute
∇f(x
1
, x
2
) =
_
∂f
∂x
1
,
∂f
∂x
2
_
T
at (x
1
, x
2
) = (1, 2) using the reverse mode.
Solution: A code list for this function can be
v
1
= x
1
v
2
= x
2
v
3
= v
2
1
v
4
= v
2
2
v
5
= v
3
−v
4
The reverse mode system of equations for computing ∂f/∂x
i
is thus
_
_
_
_
_
_
1 0 0 0 0
0 1 0 0 0
2v
1
0 −1 0 0
0 2v
2
0 −1 0
0 0 1 −1 −1
_
_
_
_
_
_
_
_
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
v
′
5
_
_
_
_
_
_
= e
i
, (6.13)
220 Applied Numerical Methods
where e
i
is the vector whose i-th component is 1 and all of whose other
components are 0. When x
1
= 1 and x
2
= 2, we have
_
_
_
_
_
_
1 0 0 0 0
0 1 0 0 0
2 0 −1 0 0
0 4 0 −1 0
0 0 1 −1 −1
_
_
_
_
_
_
_
_
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
v
′
5
_
_
_
_
_
_
= e
i
.
Now, ∂f/∂x
1
can be computed by ignoring the row and column corresponding
to v
′
2
, while ∂f/∂x
2
can be computed by ignoring the row and column corre-
sponding to v
′
1
. We thus obtain ∂f/∂x
1
= 2 and ∂f/∂x
2
= −4 (Exercise 10).
In fact, a directional derivative can be computed in the reverse mode with
the same amount of work it takes to compute a single partial derivative. For
example, the directional derivative of f(x
1
, x
2
) at (x
1
, x
2
) = (1, 2) in the
direction of u = (1/
√
2, 1/
√
2)
T
can be obtained by solving the linear system
_
_
_
_
_
_
1 0 0 0 0
0 1 0 0 0
2 0 −1 0 0
0 4 0 −1 0
0 0 1 −1 −1
_
_
_
_
_
_
_
_
_
_
_
_
v
′
1
v
′
2
v
′
3
v
′
4
v
′
5
_
_
_
_
_
_
=
_
_
_
_
_
_
1/
√
2
1/
√
2
0
0
0
_
_
_
_
_
_
(6.14)
for v
′
5
.
6.2.3 Implementation of Automatic Diﬀerentiation
Automatic diﬀerentiation can be incorporated directly into the program-
ming language compiler, or the technology of operator overloading (available
in object-oriented languages) can be used. A number of packages are available
to do automatic diﬀerentiation. The best packages (such as ADOLC, for diﬀer-
entiating “C” programs and ADIFOR, for diﬀerentiating Fortran programs) can
accept the deﬁnition of the function f in the form of a fairly generally written
computer program. Some of them (such as ADIFOR) produce a new program
that will evaluate both the function and derivatives, while others (such as
ADOLC) produce a code list or “tape” from the original program, then operate
on the code list to produce the derivatives. The monograph [17] contains a
comprehensive overview of theory and implementation of both the forward
and backward modes.
Within matlab, intlab has a special gradient data type that provides the
forward mode of automatic diﬀerentiation through operator overloading. Here
is how we might do Example 6.3 (but with the ﬁrst derivative only), using
intlab.
>> x = gradientinit(pi/4)
Numerical Diﬀerentiation and Integration 221
gradient value x.x =
0.7854
gradient derivative(s) x.dx =
1
>> x*sin(x) - 1
gradient value ans.x =
-0.4446
gradient derivative(s) ans.dx =
1.2625
>>
Above, notice that we initialize a variable to be of type “gradient” (provided
by intlab) with the constructor gradientinit. The “gradient” type has
two or more components, corresponding to a value and the partial derivatives.
(There will be more than two components if the argument to gradientinit
is a vector with more than one component.) The value ans.x contains the
function value, while ans.dx contains the derivative value.
6.3 Numerical Integration
The problem throughout the remainder of this chapter is determining ac-
curate methods for approximating the integral
_
b
a
f(x)dx. Approximating
integrals is called numerical integration or quadrature.
6.3.1 Introduction
Our goal is to approximate an integral
J(f) =
_
b
a
f(x)dx. (6.15)
with quadrature formulas of the form
Q(f) = (b −a)
m

j=0
α
j
f(x
j
), (6.16)
where the α
0
, α
1
, . . . , α
m
are called weights and the x
0
, x
1
, . . . , x
m
are the
sample or nodal points. We have
J(f) = Q(f) +E(f), (6.17)
where E(f) is the error in the quadrature formula.
To simplify derivation and use of the formulas, we derive the formulas over
the simple interval [a, b] = [−1, 1], then use a change of variables to apply
222 Applied Numerical Methods
these formulas over arbitrary intervals [a, b]. Furthermore, the basic formulas
we so derive may not work well over intervals [a, b] that are wide in relation
to how fast the integrand f is varying. In such instances, we divide [a, b] into
subintervals, and apply the basic formula over each subinterval, eﬀectively
computing the integral as a sum of integrals.
As with numerical diﬀerentiation, we present the essentials and numerous
examples here, while we present alternative methods and derivations, as well
as a more complete analysis in [1].
6.3.2 Newton-Cotes Formulas
In the approximation
J(f) =
_
b
a
f(x)dx ≈ Q(f) = (b −a)
m

j=0
α
j
f(x
j
), (6.18)
the Newton–Cotes Formulas, are derived by setting the sample points x
j
beforehand to be equally spaced, then determining the weights to make the
formula exact for as high a degree polynomial as possible.
DEFINITION 6.1 The (m+1 point) open Newton–Cotes formulas have
points x
j
= x
0
+ (j + 1)h, j = 0, 1, 2, . . . , m, where h = (b − a)/(m + 2) and
x
0
= a + h. The (m + 1 point) closed Newton–Cotes formulas have points
x
j
= x
0
+jh, j = 0, 1, . . . , m, where h = (b −a)/m and x
0
= a. That is, the
sample points in the open formulas do not include the end points, whereas the
sample points in the closed formulas do.
Example 6.5
Suppose that a = −1, b = 1, and m = 2. Then, the open points are
x
0
= −0.5, x
1
= 0, and x
2
= 0.5,
while the closed points are
x
0
= −1, x
1
= 0, and x
2
= 1.
We now derive both the closed and open Newton–Cotes formulas with three
points. We obtain three equations for the three unknowns w
i
by matching
the ﬁrst three powers of x. That is, we plug f(x) ≡ 1, f(x) ≡ x, and
f(x) ≡ x
2
into (6.18) and solve the resulting system for the α
i
. For notational
convenience, we set w
i
= (b − a)α
i
, and solve for the w
i
rather than the α
i
.
Numerical Diﬀerentiation and Integration 223
For the open formula, we obtain
_
1
−1
1dx = 2 = w
0
+ w
1
+ w
2
,
_
1
−1
xdx = 0 = −0.5w
0
+ 0.5w
2
,
_
1
−1
x
2
dx =
2
3
= 0.25w
0
+ 0.25w
2
.
This is the system of equations
_
_
1 1 1
−0.5 0 0.5
0.25 0 0.25
_
_
_
_
w
0
w
1
w
2
_
_
=
_
_
2
0
2/3
_
_
,
whose solution is w
0
= w
2
= 4/3 and w
1
= −2/3. Hence, the open quadrature
formula is:
_
1
−1
f(x)dx ≈
4
3
f
_
−
1
2
_
−
2
3
f(0) +
4
3
f
_
1
2
_
.
Using the same technique for the closed formula, we get
_
1
−1
f(x)dx ≈
1
3
f(−1) +
4
3
f(0) +
1
3
f(1).
This closed formula is called Simpson’s Rule. (You will do the computations
to derive Simpson’s rule in Exercise 1 on page 249.)
6.3.3 Gaussian Quadrature
We have just seen that Newton-Cotes formulas can be derived by
(a) choosing the sample (or nodal) points x
i
, 0 ≤ i ≤ m, equidistant on
[a, b], and
(b) choosing the weights w
i
, 0 ≤ i ≤ m, so that numerical quadrature is
exact for the highest degree polynomial possible.
In Gaussian quadrature, the points and weights x
i
and w
i
, 0 ≤ i ≤ m, are
both chosen so that the quadrature formula is exact for the highest degree
polynomial possible. This results in the degree of precision for (m+ 1)-point
Gaussian quadrature being 2m+ 1. Consider the following example.
Example 6.6
Take J(f) =
_
b
a
f(x)dx and m = 1. We want to ﬁnd w
0
, w
1
, x
0
, and x
1
such that Q(g) = J(g) for the highest degree polynomial possible. Letting
224 Applied Numerical Methods
g(x) = 1, g(x) = x, g(x) = x
2
, and g(x) = x
3
, we obtain the following
nonlinear system:
_
1
−1
1 dx = 2 = w
0
+ w
1
,
_
1
−1
x dx = 0 = w
0
x
0
+ w
1
x
1
,
_
1
−1
x
2
dx =
2
3
= w
0
x
2
0
+ w
1
x
2
1
,
_
1
−1
x
3
dx = 0 = w
0
x
3
0
+ w
1
x
3
1
.
Solving, we obtain w
0
= w
1
= 1, x
0
= −1/
√
3, x
1
= 1/
√
3, which are the
2-point Gaussian weights and points. The formula therefore is
_
1
−1
f(x)dx ≈ f
_
−
1
√
3
_
+f
_
1
√
3
_
.
This formula, known as the 2-point Gauss-Legendre quadrature rule, is exact
5
by design when f is a polynomial of degree 3 or less.
In Example 6.6, we had four unknowns w
0
, w
1
, x
0
, and x
1
, and we had
four conditions (ﬁtting 1, x, x
2
, and x
3
to the formula exactly, leading to four
equations). Even though the equations are nonlinear in x
0
and x
1
, the same
number of equations as unknowns allowed us to specify the unknowns. In
general, if we have m + 1 points ¦x
i
¦
m
i=0
, we will have 2(m + 1) unknowns,
and will be able to ﬁt 2m+2 powers of x exactly. That is, we can design the
formula to be exact for polynomials of degree 2m+ 1 or less.
How might we determine the weights and sample points in Gaussian quadra-
ture? We now answer this question. Suppose we want to design a formula
with m + 1 points ¦x
i
¦
m
i=0
such that
_
1
−1
f(x)dx ≈
m

i=0
w
i
f(x
i
) = 2
m

i=0
α
i
f(x
i
) (6.19)
5
that is, has no approximation error
Numerical Diﬀerentiation and Integration 225
is exact if f = f
2m+1
is a polynomial of degree 2m+1 or less. Let p
m+1
be a
polynomial of degree m + 1 with the following properties:
1. The roots of p
m+1
are x
0
through x
m
, that is, p
m+1
(x
i
) = 0,
0 ≤ i ≤ m.
2.
_
1
−1
p
m+1
(x)q
m
(x) = 0 whenever q
m
is a polynomial of degree
m or less.
(6.20)
Then, by long division of polynomials, we may write
f
2m+1
(x) = p
m+1
(x)q
m
(x) +r
m
(x),
where q
m
is the quotient polynomial, of degree m or less, and r
m
is the
remainder polynomial, also of degree m or less. Plugging this into (6.19)
gives
_
1
−1
f
2m+1
(x)dx =
_
1
−1
p
m+1
(x)q
m
(x)dx +
_
1
−1
r
m
(x)dx
=
_
1
−1
r
m
(x)dx
=
m

i=0
w
i
p
m+1
(x
i
)q
m
(x
i
) +
m

i=0
w
i
r
m
(x
i
)
=
m

i=0
w
i
r
m
(x
i
)
Thus, if the x
i
are chosen according to (6.20), all we need to do is choose the
weights w
i
(or α
i
) to make the formula exact for polynomials of degree m or
less, since then,
_
1
−1
r
m
(x)dx =
m

i=0
w
i
r
m
(x
i
).
We can compute such x
i
with a special technique we illustrate in the following
example.
Example 6.7
Suppose we want to derive x
1
and x
2
in 2-point Gaussian quadrature, as
in Example 6.6. Matching the integral with powers of x, we get the four
equations in Example 6.6. Let’s assume the polynomial with roots x
0
and
x
1
is of the form p
2
(x) = x
2
+ c
1
x + c
0
. Then, if we take c
0
times the ﬁrst
equation plus c
1
times the second equation plus 1 times the third equation,
226 Applied Numerical Methods
we obtain
2c
0
+
2
3
= 2α
0
(c
0
+c
1
x
0
+c
1
x
2
0
) + 2α
1
(c
0
+c
1
x
1
+c
1
x
2
1
)
= 0.
Similarly, taking c
0
times the second equation plus c
1
times the third equation
plus 1 times the fourth equation gives.
2
3
c
1
= 2α
0
x
0
(c
0
+c
1
x
0
+c
1
x
2
0
) + 2α
1
x
1
(c
0
+c
1
x
1
+c
1
x
2
1
)
= 0.
We thus get the following system of two linear equations in the two unknowns
c
0
and c
1
:
_
2 0
0 2/3
__
c
0
c
1
_
=
_
−2/3
0
_
,
giving c
0
= −1/3, c
1
= 0, and
p
2
(x) = x
2
−
1
3
,
with roots x
0
= −1/
√
3 and x
1
= 1/
√
3. We then plug x
0
and x
1
into the
ﬁrst two equations to obtain
2
_
_
1 1
−1/
√
3 1/
√
3
_
_
_
_
α
0
α
1
_
_
=
_
_
2
0
_
_
,
giving α
0
= α
1
= 1/2.
We can diagram the linear combinations of the ﬁtting equations we take to
obtain the coeﬃcients of p
m+1
as follows:
_
1
−1
1 dx = 2 = α
0
+ α
1
c
0
_
1
−1
x dx = 0 = α
0
x
0
+ α
1
x
1
c
1
c
0
_
1
−1
x
2
dx =
2
3
= α
0
x
2
0
+ α
1
x
2
1
1 c
1
_
1
−1
x
3
dx = 0 = α
0
x
3
0
+ α
1
x
3
1
1
In fact, the technique illustrated in Example 6.7 works for general m, to
reduce ﬁnding the x
i
in (m + 1)-point Gaussian quadrature to ﬁnding the
zeros of a degree m polynomial.
Numerical Diﬀerentiation and Integration 227
A more sophisticated way of computing the x
i
is through the theory of
orthogonal polynomials. We explain orthogonal polynomials in the book [1]
for our second course in numerical analysis. In particular, the polynomials
p
m+1
constructed as in Example 6.7 are, to within a constant, the Legendre
polynomials of degree m + 1. Similarly, (m + 1)-point Gaussian quadrature
rules to compute
_
1
−1
f(x)dx exactly when f is a polynomial of degree 2m+1
or less are termed Gauss–Legendre quadrature rules.
Sample points and weights for Gauss–Legendre quadrature for various m
appear in Table 6.1.
TABLE 6.1: Weights and sample points: Gauss–Legendre quadrature
1 point (m = 0) α
1
= 1, z
1
= 0 (midpoint rule)
2 point (m = 1) α
1
= α
2
= 1/2, z
1
= −
1
√
3
, z
2
=
1
√
3
3 point (m = 2)
α
1
=
5
18
, α
2
=
8
18
, α
3
=
5
18
,
z
1
= −
_
3
5
, z
2
= 0, z
3
=
_
3
5
4 point (m = 3)
α
1
=
1
4
−
1
6
√
4.8
, α
2
=
1
4
+
1
6
√
4.8
, α
3
= α
2
, α
4
= α
1
,
z
1
= −
_
3+
√
4.8
7
, z
2
= −
_
3−
√
4.8
7
, z
3
= −z
2
, z
4
= −z
1
.
2 point (m = 1) α
0
= α
1
= 1/2, z
0
= −
1
√
3
, z
1
=
1
√
3
3 point (m = 2)
α
0
=
5
18
, α
1
=
8
18
, α
2
=
5
18
,
z
0
= −
_
3
5
, z
1
= 0, z
2
=
_
3
5
4 point (m = 3)
α
0
=
1
4
−
1
6
√
4.8
, α
1
=
1
4
+
1
6
√
4.8
, α
2
= α
1
, α
3
= α
0
,
z
0
= −
_
3+
√
4.8
7
, z
1
= −
_
3−
√
4.8
7
, z
2
= −z
1
, z
3
= −z
0
.
6.3.4 More General Integrals
The techniques for deriving formulas in the previous sections apply for more
general integrals of the form
J(f) =
_
b
a
ρ(x)f(x)dx, (6.21)
where ρ is not necessarily equal to 1 and where a, b, or both might be inﬁnite.
228 Applied Numerical Methods
Example 6.8
Suppose we want a quadrature rule of the form
_
∞
0
e
−x
f(x)dx ≈ w
0
f(x
0
) +w
1
f(x
1
)
that is exact when f is a polynomial of degree 3 or less. We may use the
technique illustrated in Example 6.7 gives
_
∞
0
1 e
−x
dx = 1 = w
0
+ w
1
,
_
∞
0
x e
−x
dx = 1 = w
0
x
0
+ w
1
x
1
,
_
∞
0
x
2
e
−x
dx = 2 = w
0
x
2
0
+ w
1
x
2
1
,
_
∞
0
x
3
e
−x
dx = 6 = w
0
x
3
0
+ w
1
x
3
1
.
This gives
c
0
+ c
1
+ 2 = 0
c
0
+ 2c
1
+ 6 = 0,
whence c
1
= −4, c
0
= 2, and
p
2
(x) = x
2
−4x + 2,
so x
0
= 2 −
√
2 and x
1
= 2 +
√
2. Plugging these into the equations matching
f(x) ≡ 1 and f(x) ≡ x gives
1 = w
0
+ w
1
,
1 = (2 −
√
2)w
0
+ (2 +
√
2)w
1
,
whence w
0
= (
√
2 + 1)/(2
√
2), w
1
= (
√
2 −1)/(2
√
2), and
_
∞
0
e
−x
f(x)dx ≈
√
2 + 1
2
√
2
f(2 −
√
2) +
√
2 −1
2
√
2
f(2 +
√
2).
This is known as the 2-point Gauss–Laguerre quadrature formula. In general,
Gaussian formulas that approximate integrals with a = 0, b = ∞, and ρ(x) =
e
−x
with m + 1 points are known as Gauss-Laguerre. In general, Gaussian
formulas that approximate integrals with a = 0, b = ∞, and ρ(x) = e
−x
with m+1 points are known as Gauss–Laguerre formulas. The corresponding
polynomials p
m+1
are known as Laguerre polynomials.
Numerical Diﬀerentiation and Integration 229
Example 6.9
Suppose a = −∞, b = ∞, ρ(x) = e
−x
2
, and we wish to derive a 2-point
Gauss formula. That is, we seek an approximation of the form
_
∞
−∞
e
−x
2
f(x)dx ≈ w
0
f(x
0
) +w
1
f(x
1
).
Proceeding as in Example 6.8, we have
_
∞
−∞
1 e
−x
2
dx =
√
π = w
0
+ w
1
,
_
∞
−∞
x e
−x
2
dx = 0 = w
0
x
0
+ w
1
x
1
,
_
∞
−∞
x
2
e
−x
2
dx =
√
π
2
= w
0
x
2
0
+ w
1
x
2
1
,
_
∞
−∞
x
3
e
−x
2
dx = 0 = w
0
x
3
0
+ w
1
x
3
1
.
Continuing as in Example 6.8, we obtain c
0
= −1/2, c
1
= 0, and from them
obtain x
0
= −1/
√
2, x
1
= 1/
√
2, then w
0
= w
1
=
√
π/2. (You will ﬁll in the
details in Exercise 4.) The formula is thus
_
∞
−∞
e
−x
2
f(x)dx ≈
√
π
2
f
_
−
1
√
2
_
+
√
π
2
f
_
1
√
2
_
.
This is known as the 2-point Gauss–Hermite quadrature formula, and Gaus-
sian formulas that approximate integrals with a = −∞, b = ∞, and ρ(x) =
e
−x
with m + 1 points are known as Gauss–Hermite formulas. The corre-
sponding orthogonal polynomials p
m+1
are known as Hermite polynomials.
Gauss–Laguerre and Gauss–Hermite formulas are useful for integrating
smooth functions over semi-inﬁnite and inﬁnite intervals, respectively. Occa-
sionally, an application requires a derivation of a special formula with diﬀerent
a, b, and ρ. An example of this is in Exercise 5.
6.3.5 Error Terms
So far, we have seen error terms for approximation of a function and deriva-
tive, as indicated in Table 6.2.
All of these approximations are of the form
¦Exact value¦ = ¦Approximate value¦ +¦Error term¦
or
¸
¸
¦Exact value¦ −¦Approximate value¦
¸
¸
≤ ¦Error bound¦.
230 Applied Numerical Methods
TABLE 6.2: Error terms seen so far
Item
approx-
imated
Approximation
method
Formula for error Reference
function
f
Taylor polyno-
mial
f
(n+1)
(ξ(x))(x −x
0
)
n+1
(n + 1)!
Taylor’s Theo-
rem, on page 3
and page 145
f
polynomial in-
terpolation
f
(n+1)
(ξ(x))
n

j=0
(x −x
j
)
(n + 1)!
Formula (4.7) on
page 154
f
piecewise lin-
ear polynomial
interpolation
1
8
h
2
max
x∈[a,b]
[f
′′
(x)[
Theorem 4.4 on
page 162
f
clamped
boundary
cubic spline
interpolation
5
384
h
4
max
x∈[a,b]
¸
¸
¸
¸
d
4
f
dx
4
(x)
¸
¸
¸
¸
Theorem 4.6 on
page 168
f
′
forward diﬀer-
ence quotient
−
h
2
f
′′
(ξ)
Above (6.1) on
page 211
f
′′
central diﬀer-
ence quotient
−
h
2
12
f
(4)
(ξ)
Formula (6.4) on
page 212
We now explain how to compute error bounds when
¦Exact value¦ = J(f) =
_
b
a
ρ(x)f(x)dx
and
¦Approximate value¦ = Q(f) =
_
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
_
1. A Newton–Cotes quadrature rule,
2. a Gaussian quadrature rule, or
3. a special quadrature rule.
_
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
_
.
In fact, we can use a general approach given in [18, Chapter 16].
Numerical Diﬀerentiation and Integration 231
THEOREM 6.1
For the usual Newton–Cotes, Gauss, Gauss–Laguerre, and Gauss–Hermite
formulas with positive weights, we have
J(f) = Q(f) +
E
µ
µ!
f
(µ)
(ξ)
for some ξ ∈ [a, b], where J(f) = Q(f) when f is a polynomial of degree µ−1
or less, but J(x
µ
) ,= Q(x
µ
), and where
E
µ
= J(x
µ
) −Q(x
µ
).
(For the Newton–Cotes formulas with m odd, µ = m+1, for the Newton–Cotes
formulas with m even, µ = m+ 2, and for Gaussian formulas, µ = 2m+ 2.)
PROOF The proof depends on the inﬂuence function G(s) for the formula
being of constant sign. See [18, Chapter 16].
Example 6.10
Suppose we want to ﬁnd the error term in Simpson’s rule (derived in Ex-
ample 6.5, starting on page 222). We designed the formula to be exact for
f(x) = 1, x, and x
2
, so µ ≥ 3. In fact
_
1
−1
x
3
dx = 0 =
1
3
(−1)
3
+
4
3
(0)
3
+
1
3
(1)
3
,
so µ ≥ 4. We compute
E
4
=
_
1
−1
x
4
dx −
_
1
3
(−1)
4
+
4
3
(0)
4
+
1
3
(1)
4
_
=
2
5
−
2
3
= −
4
15
,= 0.
Thus, the multiplying factor for the error term is −4/(15 4!) = −1/90, and
_
1
−1
f(x)dx =
1
3
f(−1) +
4
3
f(0) +
1
3
f(1) −
1
90
f
(4)
(ξ)
for some ξ ∈ [−1, 1].
Example 6.11
Let’s compute the error in the 2-point Gauss–Legendre formula, given in
Example 6.6 (starting on page 223). The formula was designed to be exact
when f(x) = x
k
for k ≤ 3, so we try µ = 4:
J(f) −Q(f) =
_
1
−1
x
4
dx −
_
f
_
−
1
√
3
_
+f
_
1
√
3
__
=
2
5
−
2
9
=
8
45
,= 0.
232 Applied Numerical Methods
Thus, µ = 4, E
4
= 8/45, the multiplying factor is E
4
/4! = 1/135, and
_
1
−1
f(x)dx = f
_
−
1
√
3
_
+f
_
1
√
3
_
+
1
135
f
(4)
(ξ)
for some ξ ∈ [−1, 1].
We summarize these error terms for quadrature formulas in Table 6.3.
TABLE 6.3: Some quadrature formula error terms
Formula Formula for error Reference
Trapezoidal rule (2-point
closed Newton–Cotes)
−
2
3
f
′′
(ξ)
Exercise 2 on
page 249
Simpson’s Rule (3-point
closed Newton–Cotes)
−
1
90
f
(4)
(ξ)
Example 6.10
on page 231
1-point Gauss–Legendre
(midpoint rule)
1
3
f
(2)
(ξ)
Table 6.1 on
page 227
2-point Gauss–Legendre
1
135
f
(4)
(ξ)
Example 6.11
on page 231
2-point Gauss–Laguerre
1
6
f
(4)
(ξ)
Exercise 6 on
page 250
2-point Gauss–Hermite
√
π
48
f
(4)
(ξ)
Exercise 7 on
page 250
Higher-order Newton–Cotes and Gaussian formulas, along with their error
terms, are available in published tables, on the web, and in software. When
computing error terms for specially derived formulas, care must be taken to
assure the conditions under which Theorem 6.1 is true are satisﬁed, or else
use other methods, such as those in our second text [1]. In particular, one
should study [18, Chapter 16] and determine if the inﬂuence function G(s) for
the formula is of constant sign.
Numerical Diﬀerentiation and Integration 233
6.3.6 Changes of Variables
To use a quadrature rule with error term over an interval [a, b] other than
[−1, 1], that is, to evaluate an integral of the form
_
t=b
t=a
f(t)dt,
we may use the change of variables
t(x) = a +
b −a
2
(x + 1), dt = t
′
(x)dx =
b −a
2
dx. (6.22)
We thus have
_
b
a
f(t)dt =
b −a
2
_
1
−1
f
_
a +
b −a
2
(x + 1)
_
dx (6.23)
We use this change of variables both in the formula and the error term.
Example 6.12
Suppose we want to apply Simpson’s rule over a small interval [x
i
, x
i+1
] =
[x
i
, x
i
+h]. Simpson’s rule over [−1, 1] is
_
1
−1
f(x)dx = 0 =
1
3
f(−1) +
4
3
f(0) +
1
3
f(1) −
1
90
d
4
f(x)
dx
4
¸
¸
¸
¸
x=ξ
.
Using the change of variables (6.22) to change from x to t, we have a = x
i
,
b = x
i
+ h, and (b − a)/2 = h/2. Furthermore, since x
0
= −1, x
1
= 0, and
x
2
= 1 in Simpson’s rule, we have
t
0
= x
i
+
h
2
(−1 + 1) = x
i
,
t
1
= x
i
+
h
2
(0 + 1) = x
i
+
h
2
,
t
2
= x
i
+
h
2
(1 + 1) = x
i
+h.
Also, since t
′′
(x) = 0, repeated application of the chain rule gives
d
4
f(t)
dx
4
=
d
4
f(t(x))
dt
4
d
4
t
dx
4
=
_
h
2
_
4
f
(4)
(t).
234 Applied Numerical Methods
Simpson’s rule with error term over the interval [x
i
, x
i
+h] thus becomes
_
xi+h
xi
f(t)dt =
h
2
_
1
x=−1
f(t(x))dx
=
h
2
_
1
3
f(x
i
) +
4
3
f(x
i
+
h
2
) +
1
3
f(x
i
+ h)
−
1
90
_
h
2
_
4
d
4
f(t)
dt
4
¸
¸
¸
¸
t=ζ
_
=
h
6
_
f(x
i
) + 4f(x
i
+
h
2
) + f(x
i
+h)
_
−
h
5
2880
d
4
f(ζ)
dt
4
(6.24)
for some ζ ∈ [x
i
, x
i
+h].
With a change of variables, we may use the formulas we have seen, or higher-
order formulas derived with the techniques we have seen, to approximate
_
b
a
f(t)dt for arbitrary a and b. However, the error, given in Theorem 6.1 for
many formulas, is proportional to both the µ-th derivative of f and (b−a)
µ+1
.
As in polynomial interpolation, we may not be able to decrease the error
by increasing the order µ of the formula. In such instances, the composite
formulas of the next section may be appropriate.
6.3.7 Composite Quadrature
Using Theorem 6.1 and the change of variables we have presented, for many
quadrature rules Q(f) that are exact when f(x) = x
µ−1
but not exact when
f(x) = x
µ
, we have
_
b
a
f(t)dt = Q(f) +K
_
b −a
2
_
µ+1
f
(µ)
(ζ), (6.25)
for some ζ ∈ [a, b] and constant K = E
m
u/µ! that depends on the quadrature
formula but not on f, provided f has µ continuous derivatives. However, the
error can easily increase if we apply, say, a sequence of Newton–Cotes formulas
with increasing numbers of points (and hence increasing µ), even though the
constant K is smaller if we use a larger number of points. One reason for this
is that errors in evaluation of f (due either to roundoﬀ error or measurement
error, if the function values are obtained from measuring a physical process)
can be viewed as making the higher-and-higher order derivatives ever larger.
Observing (6.25), we see that we may subdivide the interval [a, b] into N
subintervals, [x
i
, x
i+1
] = [x
i
, x
i
+ h], each of length h = (b − a)/N, x
0
= a,
x
N
= b, and, with our change of variables,
_
xi+h
xi
f(t)dt = Q
i
(f) +K
_
h
2
_
µ+1
f
(µ)
(ζ
i
),
Numerical Diﬀerentiation and Integration 235
where Q
i
(f) is the quadrature rule applied to [x
i
, x
i
+ h] and ζ
i
is some
unknown number in [x
i
, x
i
+h]. Thus,
_
b
a
f(t)dt =
N−1

i=0
_
xi+h
xi
f(t)dt
=
N−1

i=0
Q
i
(f) +
N−1

i=0
K
_
h
2
_
µ+1
f
(µ)
(ζ
i
)
=
N−1

i=0
Q
i
(f) +K
_
h
2
_
µ+1 N−1

i=0
f
(µ)
(ζ
i
)
=
N−1

i=0
Q
i
(f) +K
_
h
2
_
µ+1
Nf
(µ)
(ζ(N))
=
N−1

i=0
Q
i
(f) +Kh
µ
b −a
2
µ+1
f
(µ)
(ζ(N))
=
N−1

i=0
Q
i
(f) +K
_
1
N
_
µ
_
b −a
2
_
µ+1
f
(µ)
(ζ(N)),
for some ζ ∈ [a, b], where
N−1

i=0
f
(µ)
(ζ
i
) = Nf
(µ)
(ζ(N)) =
b −a
h
f
(µ)
(ζ(N))
by the intermediate value theorem. The error is thus proportional to (1/N)
µ
,
and we can decrease the error by decreasing the number of subintervals. The
computation
Q
C,N
=
N−1

i=0
Q
i
(f) (6.26)
is called the composite quadrature rule with N sub-panels corresponding to Q.
In principle, we can compute and bound f
(µ)
(x) using automatic diﬀerentia-
tion, to determine the N that is needed to achieve a particular error bound.
Until recently
6
, however, heuristic
7
estimates for the error can be obtained
by assuming ζ(N) does not depend on N, so the error in the composite rule
is
E
C,N
(f) = K
_
1
N
_
µ
_
b −a
2
_
µ+1
f
(µ)
(ζ(N)) ≈
˜
K
_
1
N
_
µ
. (6.27)
6
and, indeed, in many instances when f comes from measured data or from a computation
to which automatic diﬀerentiation cannot be easily used
7
that is, rule-of-thumb
236 Applied Numerical Methods
We then have
E
C,2N
(f) ≈
_
1
2
_
µ
E
C,N
(f),
so
_
b
a
f(t)dt = Q
C,N
(f) +E
C,N
(F)
= Q
C,2N
(f) +E
C,2N
(f)
≈ Q
C,2N
(f) +
_
1
2
_
µ
E
C,N
(f).
We may solve this approximation for E
C,N
(f) to obtain
E
C,N
(f) ≈
1
1 −
_
1
2
_
µ
(Q
C,2N
(f) − Q
C,N
(f)),
and
_
b
a
f(t)dt ≈ Q
C,N
(f) +
1
1 −
_
1
2
_
µ
(Q
C,2N
(f) −Q
C,N
(f))
=
Q
C,2N
−
_
1
2
_
µ
Q
C,N
1 −
_
1
2
_
µ
.
(6.28)
This technique both gives an approximation for the error and a higher-order
approximation to the exact value. The technique is used in various contexts in
numerical analysis and software for computing integrals and other quantities.
Used iteratively, it is called Richardson extrapolation. Richardson extrapola-
tion used with the Trapezoidal rule (the 2-point closed Newton–Cotes rule,
Exercise 2) is called Romberg integration. For details, see our text [1] for a
second course in numerical analysis.
Example 6.13
Suppose Q(f) is Simpson’s rule (as in Example 6.12 on page 233) to compute
_
π
−π
sin(t)
t
dt,
where lim
t→0
sin(t)/t = 1. We have f(x) = sin(x)/x, a = −π, b = π, µ = 4,
and K = −1/90. For N = 1, h = 2π, and
Q
C,1
=
π
3
[f(−π) + 4f(0) +f(π)] =
π
3
[0 + 4 1 + 0] =
4π
3
.
Numerical Diﬀerentiation and Integration 237
For N = 2, h = π, and
Q
C,2
=
π
6
_
f(−π) + 4f(−
π
2
) +f(0)
_
+
π
6
_
f(0) + 4f
_
π
2
_
+f(π)
_
=
π
6
_
f(−π) + 4f
_
−
π
2
_
+ 2f(0) + 4f
_
π
2
_
+f(π)
_
=
π
6
_
0 + 4
_
2
π
_
+ 2(1) + 4
_
2
π
_
+ 0
_
=
8
3
+
π
3
.
Thus,
E
C,1
(f) ≈
1
1 −
_
1
2
_
4
_
8
3
+
π
3
−
4π
3
_
=
16
15
_
8
3
−π
_
≈ −0.5066,
and
_
π
−π
sin(t)
t
dt = Q
C,1
+E
C,1
≈
4π
3
+
16
15
_
8
3
−π
_
≈ 3.682
The error approximation we can use is
E
C,2
≈
1
16
E
C,1
≈ −.03167.
Thus, we would expect the ﬁrst two digits of our answer 3.682 to be correct.
In fact, the function
Si(x) =
_
x
0
sin(x)
x
dx =
1
2
_
x
−x
sin(x)
x
dx
is called the sine integral , and is available as the matlab function sinint.
Thus, the integral in this example is 2Si(π), and we have the following matlab
dialog.
>> exact = 2*sinint(pi)
exact = 3.703874103964933
>> approx = 8/3 + pi/3
approx = 3.713864217863264
>> true_error = exact-approx
true_error = -0.009990113898332
>>
Assuming matlab’s routine sinint gives a result that has all or most of
its digits correct, we see that the actual error in our computed value for the
integral is well within our heuristic estimate for the error.
Example 6.14
We supply a routine composite Newton Cotes.m on the web site http://
interval.louisiana.edu/Classical-and-Modern-NA/#Chapter_6. This
238 Applied Numerical Methods
routine implements composite Newton–Cotes formulas of various orders, and
doubles N until the heuristic estimate for the error is within a speciﬁed toler-
ance. On the other hand, with the change of variables πu = t, πdu = dt, the
integral in Example 6.13 can be written as
_
π
−π
sin(t)
t
= π
_
1
−1
sin(πu)
πu
du = π
_
1
−1
sinc(u)du,
where
sinc(u) =
_
¸
_
¸
_
1, u = 0,
sin(πu)
πu
, u ,= 0
is known as the sinc function. matlab has a routine sinc to evaluate the sinc
function. We may use sinc with our routine composite Newton Cotes.m to
use a composite Simpson’s rule to compute the integral from Example 6.13:
>> [value, success] = composite_Newton_Cotes(’sinc’,-1,1,3,1e-14)
2 1.182160
4 1.179154
8 1.178990
16 1.178980
32 1.178980
64 1.178980
128 1.178980
256 1.178980
512 1.178980
1024 1.178980
2048 1.178980
4096 1.178980
value = 1.178979744472170
success = 1
>> integral = pi*value
integral = 3.703874103964942
>>
We see that, with 4096 subintervals, we obtain an approximation to the
error of less than 10
−14
, and the ﬁrst 14 digits of the value returned agree
with the ﬁrst 14 digits of the value matlabreturns for 2 times the sine integral.
6.3.8 Adaptive Quadrature
If the function varies more rapidly in one part of the interval of integration
than in other parts, and it is not known beforehand where the rapid variation
is, then a single rule or a composite rule in which the subintervals all have
the same length is not the most eﬃcient. Also, in general, routines within
larger numerical software libraries or packages, a user typically supplies a
function f, an interval of integration [a, b], and an error tolerance ǫ, without
supplying any additional information about the function’s smoothness.
8
In
8
A function is “smooth” if it has many continuous derivatives. Generally the “degree of
smoothness” refers to the number of continuous derivatives available. Even if a function has,
Numerical Diﬀerentiation and Integration 239
such cases, the quadrature routine itself should detect which portions of the
interval of integration (or domain of integration in the multidimensional case)
need to have a small interval length, and which portions need to have a larger
interval length, to achieve the speciﬁed tolerance ǫ. In such instances, adaptive
quadrature is appropriate.
Adaptive quadrature can be considered to be a type of branch and bound
method
9
. In particular, the following general procedure can be used to com-
pute
_
b
a
f(x)dx.
ALGORITHM 6.1
(Adaptive quadrature)
INPUT:
1. the interval of integration [a, b] and the function f;
2. an absolute error tolerance ǫ, and a minimum interval length δ.
OUTPUT: Either “tolerance has not been met” or “tolerance has been met”
and an approximation sum to the integral
1. (Initialization)
(a) Input an absolute error tolerance ǫ, and a minimum interval length
δ.
(b) Input the interval of integration [a, b] and the function f.
(c) sum ← 0.
(d) sum ← 0.
(e) L ← ¦[a, b]¦, where L is a list of subintervals that needs to be
considered.
2. DO WHILE L , = ∅.
(a) Remove the ﬁrst interval from L and place it in the current interval
[c, c].
(b) Apply a quadrature formula over the current interval [c, c] to obtain
an approximation I
c
.
in theory, many continuous derivatives, we might consider it not to be smooth numerically
if it changes curvature rapidly at certain points. An example of this is the function f(x) =
√
x
2
+ ǫ: as ǫ gets small, the graph of this function becomes indistinguishable from that of
f(x) = |x|.
9
We explain another type of branch and bound method, of common use in optimization, in
[1, Section 9.6.3].
240 Applied Numerical Methods
(c) ( bound): Use an error formula for the rule to obtain a bound E
c
for the error, or else obtain E
c
as a heuristic estimate for the error;
This can be done by either using an error formula or by comparing
with a diﬀerent quadrature rule of the same or diﬀerent order.
(d) IF E
c
< ǫ(c −c), THEN
sum ← sum +I
c
.
ELSE
IF (c −c) < δ THEN
RETURN with a message that the tolerance ǫ could not be
met with the given minimum step size δ.
ELSE
( branch): form two new intervals [c, (c +c)/2] and [(c +
c)/2, c], and store each into the list L.
END IF
END IF
END DO
3. RETURN with a message that the tolerance has been met, and return
sum as the approximation to the integral.
END ALGORITHM 6.1.
An early good example implementation of an adaptive quadrature routine
is given in the classic text [15] of Forsythe, Malcolm, and Moler.
10
This
routine, quanc8, is based on an 8-panel Newton-Cotes quadrature formula
and a heuristic estimate for the error. The heuristic estimate is obtained by
comparing the approximation with 8-panel rule over the entire subinterval
I
c
and the approximation with the composite rule obtained by applying the
8-point rule over the two halves of I
c
; see [15, pp. 94–105] for details. The
routine itself
11
can be found in NETLIB, presently at http://www.netlib.
org/fmm/quanc8.f.
An extremely elegant implementation, using recursion, is the pair of rou-
tines matlab functions quadtx and quadgui described in [25, Section 6.3].
In recursion, the adaptive process is arranged so the loop of Step 2 of Algo-
rithm 6.1 is absent, and, instead, the quadrature routine is called again. A
recursive version of Algorithm 6.1 is as follows.
10
This text doubles as an elementary numerical analysis text and as a “user guide” for the
routines it explains. It distinguished itself from other texts of the time by featuring routines
that were simple enough to be used to explain the elementary concepts, yet sophisticated
enough to be used to solve practical problems.
11
In Fortran 66, but written carefully and clearly.
Numerical Diﬀerentiation and Integration 241
ALGORITHM 6.2
(Recursive version of adaptive quadrature)
INPUT:
1. the interval of integration [a, b] and the function f;
2. an absolute error tolerance ǫ, and a minimum interval length δ.
OUTPUT: Either “failure” (tolerance has not been met) or “success” (toler-
ance has been met) and an approximation I to the integral
1. Apply a quadrature formula over the current interval [a, b] to obtain an
approximation I.
2. ( bound): Use an error formula for the rule to obtain an approximation
I for the integral and a bound E for the error, or else obtain E as a
heuristic estimate for the error; This can be done by either using an
error formula or by comparing with a diﬀerent quadrature rule of the
same or diﬀerent order.
3. IF E < ǫ, THEN
RETURN “success” and I.
ELSE
IF (b −a) < δ THEN
RETURN “failure”.
ELSE ( branch)
(a) Invoke this algorithm with function f, interval of integration
[a, (a + b)/2], error tolerance ǫ/2, minimum interval length δ,
and output success
1
and I
1
.
(b) Invoke this algorithm with function f, interval of integration
[(a + b)/2, b], error tolerance ǫ/2, minimum interval length δ,
and output success
2
and I
2
.
(c) IF both success
1
and success
2
, THEN
RETURN “success” and I = I
1
+ I
2
,
ELSE
RETURN “failure”.
END IF
END IF
END IF
242 Applied Numerical Methods
END ALGORITHM 6.2.
Usually, adaptive algorithms that are implemented recursively are simpler,
easier to program, and easier for humans to understand than adaptive algo-
rithms that use lists and loops. However, functions that are invoked recur-
sively usually involve more overhead and are less eﬃcient.
An illustration of the behavior of an adaptive quadrature algorithm is [25,
Figure 6.3].
The matlab routine quad does an adaptive quadrature based on Simpson’s
rule.
Example 6.15
We can use the matlab routine quad, the matlab function sinc, and the
change of variables from Example 6.14 to compute the sine integral from
Example 6.13 to a speciﬁed (heuristically determined) accuracy:
>> format long
>> [I,n_function_values] = quad(’sinc’,-1,1,1e-14)
I = 1.178979744472168
n_function_values = 1177
>> I = pi*I
I = 3.703874103964933
>>
We see that the ﬁrst 14 digits agree with the result we obtained in Exam-
ple 6.14, but with only about 1/4 the number of function evaluations.
6.3.9 Multiple Integrals, Singular Integrals, and Inﬁnite In-
tervals
We describe some special considerations in numerical integration in this
section.
6.3.9.1 Multiple Integrals
Consider
_
b
a
_
d
c
f(x, y)dydx or
_
b
a
_
d
c
_
s
r
f(x, y, z)dxdydz.
How can we approximate these integrals? One way is with a product formula,
in which we apply a one-dimensional quadrature rule in each variable.
Using recursion, it is not hard to write a system of matlab “m” ﬁles that
computes
_
bn
an
_
bn−1
an−1

_
b1
a1
f(x
1
, x
2
, , x
n
)dx
1
dx
2
. . . dx
n
.
Numerical Diﬀerentiation and Integration 243
for general n using an already-programmed quadrature routine for one dimen-
sional integrals. For instance, we may modify the routine
composite Newton Cotes.m (which we used in Example 6.14 on page 237).
We create a function multiquad, which calls our modiﬁcation
multiquad composite Newton Cotes, which in turn calls the integration func-
tion multiquad func, which in turn calls our top routine multiquad. That
is, we have:
multiquad → multiquad composite Newton Cotes → multiquad func
→ multiquad.
The routines can be as follows.
function [value, n_eval, success] = multiquad...
(n, a, b, current_arg, n_eval, f, tol, m)
[value, success, n_eval] = multiquad_composite_Newton_Cotes...
(f, a(n), b(n), m, tol, a(1:n), b(1:n), n, current_arg, n_eval);
% (Calls multiquad_func)
if (~success)
return
end
function [value, n_eval,success] = multiquad_func...
(x, a, b, n, current_arg, n_eval, f, tol, m)
current_arg(n)=x;
if (n == 1)
value = feval(f,current_arg);
n_eval = n_eval + 1;
success = 1;
else
[value, n_eval, success] = multiquad...
(n-1, a(1:n-1), b(1:n-1), current_arg, n_eval, f, tol, m);
if (~success)
return
end
end
Example 6.16
Consider the illustrative example
In this case, we know that
I
3
=
__
1
0
e
−x
dx
_3
=
_
1 −e
−1
_
3
≈ 0.252580457827647,
so we may check any results we obtain. We program the integrand as
function [y] = multiquad_example(x)
y = exp(-(x(1)+x(2)+x(3)));
We now use our recursive routine multiquad:
>> format long
>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-8,3)
value = 0.252580458119098
n_eval = 6180168
success = 1
244 Applied Numerical Methods
>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-8,5)
value = 0.252580457923039
n_eval = 27000
success = 1
>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-12,5)
value = 0.252580457827670
n_eval = 3459640
success = 1
>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-12,7)
value = 0.252580457827654
n_eval = 83888
success = 1
>>
We see that decreasing the tolerance has a much larger eﬀect on the number
of functions with this multidimensional quadrature than if we are computing
a single integral. We also see the eﬀect of using higher-order formulas.
Programming multidimensional quadrature with recursion is perhaps the
easiest way to understand quadrature as iterated integrals with the product
rule concept. However, this is usually not the most eﬃcient way of program-
ming multidimensional quadrature: Programming the recursion explicitly in
a single routine leads to a more complicated program, but to a program that
completes more quickly. Other increases in eﬃciency may be obtained with
product rules based on adaptive routines. A third possibility would be to
think of the product rule as a single rule, to be applied in an adaptive routine
that subdivides the n-dimensional region directly, analogously to the process
we explain for optimization in Chapter ??. For triple integrals, matlab has
the function triplequad.
Example 6.17
We use triplequad to approximate our integral from Example 6.16:
>> value = triplequad(@(x,y,z) exp(-(x+y+z)),0,1,0,1,0,1,1e-12)
value = 0.252580457827648
>>
The response is still slow, but possibly somewhat faster than our recursive
implementation with composite Newton–Cotes. See matlab’s help facility
for additional control over the accuracy of the computation, etc. Also, see
matlab’s help facility for anonymous functions for an explanation of the
syntax “@(x,y,z) exp(-(x+y+z)).”
6.3.9.2 Singularities and Inﬁnite Intervals
Consider
_
v
a
f(x)dx. Suppose that f is Riemann integrable but has a sin-
gularity somewhere on [a, b]. (Alternately, for example, f may be continuous
but f
′
may have a singularity on [a, b], which results in low accuracy of the nu-
merical quadrature methods used unless a large number of intervals is taken.)
Numerical Diﬀerentiation and Integration 245
Example 6.18
An illustrative example is
_
1
0
1
√
x
dx = 2.
We get an error if we apply a standard quadrature routine that tries to eval-
uate the integrand at the end points, and we get low accuracy or ineﬃcient
computation, even if we don’t attempt to evaluate at the end point, if we
don’t take account of the singularity. For example:
>> [value, success] = composite_Newton_Cotes(@(x) 1/sqrt(x),0,1,3,1e-4)
2 Inf
4 Inf
value = Inf
success = 1
>>
matlab’s adaptive routine quad does somewhat better:
>> [value,n_function_values] = quad(@(x) 1./sqrt(x),0,1,1e-4)
value = 2.001366443256776
n_function_values = 134
>> [value,n_function_values] = quad(@(x) 1./sqrt(x),0,1,1e-12)
value = 1.999999999792113
n_function_values = 2426
>>
An adaptive routine should be able to detect that it cannot cannot evaluate
exactly at an end point to have a chance of evaluating a singular integral
accurately and eﬃciently.
Example 6.19
Suppose we want to evaluate the illustrative integral
_
∞
0
e
−x
dx = 1.
We cannot use quad directly, since the limits must be ﬁnite:
>> [value,n_function_values] = quad(@(x) exp(-x),0,Inf,1e-12)
value = NaN
n_function_values = 13
>>
In fact, the matlab function quadgk supports inﬁnite limits:
>> [value, error_bound] = quadgk(@(x) exp(-x),0,Inf)
value = 1
error_bound = 1.644999922012122e-011
>>
We outline a few basic techniques for handling singular integrals in our
text for the second course [1, Section 6.3.5]. See matlab’s “help” facility for
various quadrature routines available within matlaband its toolboxes.
246 Applied Numerical Methods
6.3.10 Interval Bounds
Mathematically rigorous bounds on integrals can be computed, for sim-
ple integration, for composite rules, and for adaptive quadrature, if interval
arithmetic is used in the error formulas. As an example, take the two point
Gauss–Legendre quadrature rule:
_
1
−1
f(x)dx =
_
f
_
−1
√
3
_
+f
_
1
√
3
__
+
1
135
f
(4)
(ξ), (6.29)
for some ξ ∈ [−1, 1], where the quadrature formula is obtained from Table 6.1
(on page 227) and where the error term is obtained from Theorem 6.1. Now,
suppose we want to ﬁnd guaranteed error bounds on the integral
_
1
−1
e
0.1x
dx.
Then, the fourth derivative of e
0.1x
is (0.1)
4
e
0.1x
, and an interval evaluation
of this over x = [−1, 1] gives
(0.1)
4
e
0.1x
∈ [0.9048, 1.1052] 10
−4
for x ∈ [−1, 1],
where the interval enclosure for the range e
0.1[−1,1]
was obtained using the
matlab toolbox intlab [33]. The application of the quadrature rule thus
gives
_
1
−1
e
0.1x
dx ∈ e
−0.1/
√
3
+e
0.1/
√
3
+ [0.9048, 1.1052] 10
−4
⊆ e
−0.1/[1.7320,1.7321]
+e
0.1/[1.7320,1.7321]
+ [0.9048, 1.1052] 10
−4
⊆ [2.0034, 2.0035],
where the computations were done within intlab. This set of computations
provides a mathematical proof that the exact integral lies within [2.0034, 2.0035].
The higher order derivatives required in the quadrature formulas can be
bounded over intervals using a combination of automatic diﬀerentiation (ex-
plained in ¸6.2, starting on page 215, of this book) and interval arithmetic.
The mathematically rigorous error bounds obtained by this technique can
be used in an adaptive quadrature technique, and the resulting routine can
give mathematically rigorous bounds, provided I
c
and sum are computed with
interval arithmetic and the error bounds are added to each I
c
when it is added
to sum. Such a routine is described in [10], although an updated package was
not widely available at the time this book was written.
Numerical Diﬀerentiation and Integration 247
6.4 Applications
Consider the following ordinary diﬀerential equation for population growth
dx
dt
= (a −bx)x = ax −bx
2
, (6.30)
where a and b are positive constants. Here, x(t) is the population density at
time t, a is the birth rate and bx(t) is the density-dependent death rate. Thus,
(a−bx(t)) is the density-dependent growth rate of the population. (This type
of equation is also well-known as the logistic equation.) The solution of a
logistic equation can be derived by separating variables, and is
x(t) =
a
Ce
−at
+b
,
where C is an arbitrary constant. It follows that lim
t→∞
x(t) = a/b.
For illustration, let us approximate the diﬀerential equation by a diﬀerence
equation. The derivative dx/dt can be approximated by a diﬀerence quotient,
dx
dt
≈
x(t +h) − x(t)
h
.
This leads to
x(t +h) −x(t)
h
= ax(t) −bx
2
(t).
After simpliﬁcation, we obtain the following diﬀerence equation
x(t +h) = (1 +ah)x(t) − bhx
2
(t). (6.31)
We will see that equation (6.31) has diﬀerent dynamical behavior depending
on the magnitude of ah. To visualize the dynamics of equation (6.31), we run
the following matlab code:
clear all
h=0.01;
b=0.2;
a=25; %Change a=200,250,295
y(1)=25;
T=0.5;
t=0:h:T;
K=a/b
for j=1:length(t)-1
x(j+1)=(1+a*h)*x(j)-b*h*x(j)^2;
end
plot(t,x,’k o-’,’LineWidth’,2)
248 Applied Numerical Methods
0 0.1 0.2 0.3 0.4 0.5
20
40
60
80
100
120
140
t
x
(
t
+
h
)
a=25
0 0.1 0.2 0.3 0.4 0.5
0
200
400
600
800
1000
1200
t
x
(
t
+
h
)
a = 200
0 0.1 0.2 0.3 0.4 0.5
0
200
400
600
800
1000
1200
1400
1600
t
x
(
t
+
h
)
a = 250
0 0.1 0.2 0.3 0.4 0.5
0
200
400
600
800
1000
1200
1400
1600
1800
2000
t
x
(
t
+
h
)
a = 295
Notice that, in contrast to (6.30), the dynamics of (6.31) changes as the
parameter a varies. Consequently, we seek another diﬀerence equation to
approximate the diﬀerential equation so that it has same dynamics as (6.30).
Since x(t) is continuous on its domain x(t + h) is a close approximation to
x(t) for h > 0 small. Thus,
(1 +ah)x(t) −bhx
2
(t) ≈ (1 +ah)x(t) −bhx(t +h)x(t).
Rearranging this gives
x(t +h) −x(t)
h
= (1 +ah)x(t) −bhx(t +h)x(t).
Solving the above equation for x(t +h) gives the diﬀerence equation
x(t +h) =
(1 +ah)x(t)
1 +bhx(t)
. (6.32)
The following ﬁgure shows that with diﬀerent values of a, the population
modeled by 6.32 has the same outcome (all converge to a/b).
Numerical Diﬀerentiation and Integration 249
0 0.1 0.2 0.3 0.4 0.5
20
40
60
80
100
120
140
t
x
(
t
+
h
)
a = 25
0 0.1 0.2 0.3 0.4 0.5
0
100
200
300
400
500
600
700
800
900
1000
t
x
(
t
+
h
)
a = 200
0 0.1 0.2 0.3 0.4 0.5
0
200
400
600
800
1000
1200
1400
t
x
(
t
+
h
)
a = 250
0 0.1 0.2 0.3 0.4 0.5
0
500
1000
1500
t
x
(
t
+
h
)
a = 295
Finally, we point out that the exact diﬀerence equation version of logis-
tic growth can be obtained by separating variables and integrating equation
(6.30), giving
x(t +h) =
ae
ah
x(t)
a +b(e
ah
−1)x(t)
. (6.33)
6.5 Exercises
1. Fill in the details in the derivation of Simpson’s rule. (See page 223.)
2. Derive the trapezoidal rule (the 2-point closed Newton–Cotes rule)
_
1
−1
f(x)dx = w
0
f(−1) +w
1
f(1) +E(f),
where E(f) is the error term. (That is, ﬁnd w
0
, w
1
, and E(f).)
3. Use the transformations as in Section 6.3.6 to transform the trapezoidal
rule and corresponding error term to the interval [x
i
, x
i
+h].
4. Fill in the details of the computations in Example 6.9.
250 Applied Numerical Methods
5. Derive a 2-point Gauss formula that integrates
_
π
−π
f(x) sin(x)dx
exactly when f is a polynomial of degree 3 or less.
6. Use Theorem 6.1 (on page 231)to derive the error in the 2-point Gauss–
Laguerre quadrature formula. (See Example 6.8 on page 228.)
7. Use Theorem 6.1 (on page 231)to derive the error in the 2-point Gauss-
Hermite quadrature formula. (See Example 6.9 on page 229.)
8. Carry out the details of the computation to derive (6.29).
9. Assume that we have a ﬁnite-diﬀerence approximation method where
the roundoﬀ error is O(ǫ/h) and the truncation error is O(h
n
). Using
the error bounding technique exempliﬁed in (6.5) on page 212, show that
the optimal h is O(ǫ
1/(n+1)
) and the minimum achievable error bound
is O(ǫ
n/(n+1)
).
10. Fill in the details of the computations for Example 6.4.
11. Solve the system (6.14) (on page 220) and compare your result to the
corresponding directional derivative of f computed by taking the gradi-
ent of f and taking the dot product with the direction.
12. Consider quadrature formulas of the form
_
1
0
f(x) [x ln(1/x)] dx = a
0
f(0) +a
1
f(1).
(a) Find a
0
and a
1
such that the formula is exact for linear polynomials.
(b) Describe how the above formula, for h > 0, can be used to approx-
imate
_
h
0
g(t) t ln(h/t) dt.
13. Suppose that I(h) is an approximation to
_
b
a
f(x) dx, where h is the
width of a uniform subdivision of [a, b]. Suppose that the error satisﬁes
I(h) −
_
b
a
f(x) dx = c
1
h +c
2
h
2
+O(h
3
),
where c
1
and c
2
are constants independent of h. Let I(h), I(h/2), and
I(h/3) be calculated for a given value of h. Use the values I(h), I(h/2)
and I(h/3) to ﬁnd an O(h
3
) approximation to
_
b
a
f(x) dx.
Numerical Diﬀerentiation and Integration 251
14. Compute an accurate approximation to the following integral:
I =
_
1
−∞
1
√
2π
e
−x
2
/2
dx .
15. Find the nodes x
i
and the corresponding weights A
i
, i = 0, 1, 2, so the
formula
_
1
−1
1
√
1 −x
2
f(x) dx ≈
2

i=0
A
i
f(x
i
)
is exact when f(x) is any polynomial of degree 5. Compare your solution
with the roots of the Chebyshev polynomial of the ﬁrst kind T
3
, given
by T
3
(x) = cos(3 cos
−1
(x)).
16. Suppose that a particular composite quadrature rule is used to approx-
imate
_
2
0
e
x
2
dx. The following values are obtained for N = 8, 16, and
32 intervals, respectively: 16.50606, 16.45436, and 16.45347. Using only
these values, estimate the power of h to which the error is proportional,
where h =
2
N
.
17. A two dimensional Gaussian quadrature formula has the form
_
1
−1
_
1
−1
f(x, y) dx dy = f(α, α) +f(−α, α) +f(α, −α) +f(−α, −α)
+E(f).
Find the value of α such that the formula is exact (i.e. E(f) = 0) for
every polynomial f(x, y) of degree less than or equal to 2 in 2 variables
i.e., f(x, y) =
2

i,j=0
a
ij
x
i
y
j
.
18. (A signiﬁcant programming project) Add a function
[a k, b k] = abfunc(k, x)
to the system consisting of multiquad,
multiquad composite Newton Cotes and multiquad func from Sec-
tion 6.3.9.1, so the resulting system computes
_
bn
an
_
bn−1(xn)
an−1(xn)

_
b1(x2,...,xn)
a1(x2,...,xn)
f(x
1
, x
2
, , x
n
)dx
1
dx
2
. . . dx
n
.
The function name abfunc should be passed as an argument to
multiquad (and hence also to multiquad composite Newton Cotes and
252 Applied Numerical Methods
multiquad func), and evaluated in multiquad composite Newton Cotes.
Its evaluation
[a k, b k] = feval(abfunc, k, current arg)
returns the appropriate value of the lower bound a
k
and upper bound
b
k
. (Note that x
i
, k + 1 ≤ i ≤ n has already been stored in
current arg(n-i+1), that the value of n in
multiquad composite Newton Cotes is in the array n array, and the
value of k appropriate for the actual call is n.)
Test your modiﬁed routine by computing an approximation to
_
1
0
_
1−x3
0
_
1−x3−x2
0
x
1
+x
2
+x
3
dx
1
dx
2
dx
3
=
1
8
= 0.125.
Chapter 7
Initial Value Problems for Ordinary
Diﬀerential Equations
7.1 Introduction
In this chapter, we study solution of initial-value problems (IVP) for systems
of diﬀerential equations. We can write such initial value problems as ﬁnding
solutions to
_
y
′
(t) = f(t, y(t)), a ≤ t ≤ b,
y(a) = y
0
,
(7.1)
where f is a given function and y
0
is given. To introduce the solution con-
cepts, we ﬁrst consider y, y
0
, and f to represent real-values and real-valued
functions, then later show that our techniques hold when y, y
0
, and f are
vectors with n components. In fact, we will see that arbitrary systems of
nonlinear diﬀerential equations can be transformed into the form (7.1), and
the form (7.1) is the form in which software for ﬁnding approximate solutions
to initial value problems solves such systems.
Generally, the approximate numerical solution that software delivers is in
the form of a table of values t
k
of the independent variable and corresponding
values y
k
of the dependent variable (that is, of the function that solves the
diﬀerential equation). We think of t
k+1
= t
k
+ h
k
, where h
k
is the k-th step
length, and y
k
is our numerical approximation to y(t
k
).
An important consideration when ﬁnding approximate numerical solutions
to an initial value problem is whether such a solution exists and whether it
is unique. Roughly, the solution exists and is unique if f is continuous and
satisﬁes a Lipschitz condition in y (see Deﬁnition 2.2 on page 48). We give
examples and some theory in our second-level text [1].
We now consider a prototypical method; although more eﬃcient methods
are usually used in practice, this method illustrates the basic ideas of numer-
ical solution of initial value problems in practice.
253
254 Applied Numerical Methods
7.2 Euler’s method
The simplest method we consider is Euler’s method. One can view Eu-
ler’s method as a kind of “successive tangent line approximation” method, or
repeated approximation of y by degree-1 Taylor polynomials. In particular,
y(t
k+1
) = y(t
k
) +h
k
y
′
(t
k
) +O(h
2
k
)
= y(t
k
) +h
k
f(t
k
, y(t
k
)) +O(h
2
k
)
= y(t
k
) +h
k
Φ(t
k
, y(t
k
)) +O(h
2
k
), (7.2)
where Φ(t, y) is the iteration function for the step, which for Euler’s method
is Φ(t, y) = f(t, y), If we replace y(t
k
) by our approximation y
k
, we obtain
Euler’s method:
y
k+1
= y
k
+h
k
f(t
k
, y
k
). (7.3)
Example 7.1
For illustration, suppose a = 0, b = 1, f(t, y) = t, h = 0.25, and y(0) = 0.
We see immediately that the solution is
y(t) =
_
t
0
sds =
t
2
2
.
y
1
= y
0
+ 0.25t
0
= 0 + 0.25(0) = 0,
y
2
= y
1
+ 0.25t
1
= 0 + 0.25(0.25) = 0.125,
y
3
= 0.125 + 0.25(0.5) = 0.25,
y
4
= 0.25 + 0.25(0.75) = 0.4375.
We see that, when f does not depend on the unknown function y, the initial
value problem reduces to ﬁnding the values of an indeﬁnite integral, and
Euler’s method reduces approximating an integral as
_
t
k
+h
t
k
f(s)ds ≈ h
k
f(t
k
),
that is, we approximate the integral over [t
k
, t
k
+h
k
] by the area of the rect-
angle of width h
k
and height f(t
k
), that is, as in a left Riemann sum. Indeed,
methods for ﬁnding approximate solutions to initial value problems resemble
methods for ﬁnding integrals in various ways, and we speak of “integrating”
the diﬀerential equation.
Example 7.2
Consider
_
y
′
(t) = t +y,
y(1) = 2.
Initial Value Problems for Ordinary Diﬀerential Equations 255
(The exact solution is y(t) = −t −1 +4e
−1
e
t
.) Assuming a constant step size
h
k
= h, Euler’s method has the form
_
y
k+1
= y
k
+hf(t
k
, y
k
) = y
k
+h(t
k
+y
k
),
y
0
= 2, t
0
= 1.
Applying Euler’s method, we obtain
h = 0.1 h = 0.05
k t
k
y
k
y(t
k
)
0 1 2 2
1 1.1 2.3 2.32068
2 1.2 2.64 2.68561
3 1.3 3.024 3.09944
4 1.4 3.4564 3.56730
5 1.5 3.94304 4.09489
k t
k
y
k
y(t
k
)
0 1 2 2
1 1.05 2.1500 2.15508
2 1.1 2.3100 2.32068
The error for h = 0.1 at t = 1.1 is about 0.02 and the error for h = 0.05 at
t = 1.1 is about 0.01. If h is cut in half, the error is cut in half, suggesting that
the error is proportional to h. This seems to be consistent with the truncation
error in one step being O(h
2
), since the total number of steps is proportional
to 1/h. (However, this is not entirely obvious.)
See our text [1] (or other sources) for a convergence analysis of Euler’s
method. In fact, the “global” error (that is, when Euler’s method is applied
for N = (b −a)/h steps to ﬁnd an approximation to y(b)) can be shown to be
O(h), as in Example 7.2. Furthermore, it can be shown that, for small h with
ﬂoating point arithmetic, the rounding error is proportional to 1/h. Thus, just
as in our analysis of total error in the forward diﬀerence approximation to the
derivative (see Section 6.1.2, starting on page 212), the minimum achievable
error in Euler’s method, regardless of how small we make the step sizes h
k
, is
proportional to the square root of the accuracy to which we can evaluate f.
For this reason, as well as for reasons of eﬃciency
1
, higher-order methods are
often used.
DEFINITION 7.1 A method of the form
y
k+1
= y
k
+hΦ(f, t
k
, y
k
),
that is, as in (7.2), in which y
k+1
depends only on t
k
and y
k
and not on
previous values such as y
k−1
, is termed a single-step method. If
[y
k+1
−hΦ(t
k
, y(t
k
))[ = O(h
ω+1
),
1
that is, having the computations complete in a practical amount of time
256 Applied Numerical Methods
where y(t
k
) is the exact solution to the initial value problem at t
k
, we say that
the method has order ω.
For example Euler’s method has order 1.
Do not confuse the order of the method for ﬁnding approximate solutions
to the diﬀerential equation with the order of the diﬀerential equation itself.
Before we examine higher-order methods, we pause to look at how we handle
higher-order diﬀerential equations.
7.3 Higher-Order and Systems of Diﬀerential Equations
Traditionally
2
, IVP’s for higher order diﬀerential equations are not consid-
ered separately from ﬁrst-order equations. By a change of variables, higher
order problems can be reduced to a system of the form of (7.1). For example,
consider the scalar IVP for the m-th-order scalar diﬀerential equation:
_
y
(m)
(t) = g(t, y
(m−1)
(t), y
(m−2)
(t), , y
′′
(t), y
′
(t), y(t)), a ≤ t ≤ b,
y(a) = u
0
, y
′
(a) = u
1
, , y
(m−1)
(a) = u
m−1
.
(7.4)
We can reduce this high order IVP to a ﬁrst-order system of the form (7.1)
by deﬁning x : [a, b] →R
m
componentwise by
x(t) = [x
1
(t), x
2
(t), , x
m
(t)]
T
= [y(t), y
′
(t), y
(2)
(t), , y
(m−1)
(t)]
T
.
Then,
_
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
_
x
′
1
(t) = x
2
(t),
x
′
2
(t) = x
3
(t),
x
′
3
(t) = x
4
(t),
.
.
.
x
′
m−1
(t) = x
m
(t),
x
′
m
(t) = g(t, x
m
, x
m−1
, , x
2
, x
1
),
_
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
_
and x(a) =
_
_
_
_
_
u
0
u
1
.
.
.
u
m−1
_
_
_
_
_
. (7.5)
That is, in this case f(t, x) is deﬁned by:
_
f
i
(t, x) = x
i+1
, 1 ≤ i ≤ m−1
f
m
(t, x) = g(t, x
m
, x
m−1
, , x
2
, x
1
).
2
Recently, there has been some discussion concerning eﬃcient methods that do consider
higher-order problems separately.
Initial Value Problems for Ordinary Diﬀerential Equations 257
Example 7.3
Consider _
_
_
y
′′
(t) = y
′
(t) cos(y(t)) +e
−t
,
y(0) = 1,
y
′
(0) = 2.
Let x
1
= y and x
2
= y
′
. Then,
_
_
_
x
′
1
(t) = x
2
(t),
x
′
2
(t) = x
2
(t) cos(x
1
(t)) +e
−t
,
x
1
(0) = 1, x
2
(0) = 2,
which can be represented in vector form as
_
¸
¸
_
¸
¸
_
dx
dt
= f(t, x) =
_
x
2
(t)
x
2
(t) cos(x
1
(t)) +e
−t
_
,
x(0) =
_
1
2
_
.
Now, we may interpret y and f in (7.3) as vectors, and apply a couple of steps
of Euler’s method, with h = 0.1 and the help of matlab. We use the function
function [f] = ode_sys_example(t,x)
f = zeros(2,1);
f(1) = x(2);
f(2) = x(2) * cos(x(1)) + exp(-t);
(Here, we initialize the array f, since, otherwise, matlab forms a row vec-
tor by default.) We compute an approximation to y(1.3) with the following
matlab dialog.
>> x = [1;2]
x =
1
2
>> t = 1
t = 1
>> h=0.1
h = 0.100000000000000
>> x = x + h*ode_sys_example(t,x)
x =
1.200000000000000
2.144848405290772
>> t = t + h
t = 1.100000000000000
>> x = x + h*ode_sys_example(t,x)
x =
1.414484840529077
2.255855758843984
>> t = t + h
t = 1.200000000000000
>> x = x + h*ode_sys_example(t,x)
x =
1.640070416413476
2.321093379172201
>> t = t + h
t = 1.300000000000000
258 Applied Numerical Methods
>> x = x + h*ode_sys_example(t,x)
x =
1.872179754330696
2.332280252695239
>>
Of course, many mathematical models begin as systems of diﬀerential equa-
tions, rather than as a single diﬀerential equation involving higher-order deriva-
tives. The same techniques apply for such systems.
Example 7.4
Partial diﬀerential equations can be written as systems of ordinary diﬀer-
ential equations by approximating all but one of the derivatives. When the
resulting system of ordinary diﬀerential equations is then solved, this is called
the method of lines. For example, processes of diﬀusion (say of chemicals or
ﬂuids through media) may be modeled by the diﬀerential equation
∂u
∂t
= D∆u, (7.6)
where D is related to the medium in which diﬀusion is taking place, and where
∆u =
∂
2
u
∂x
2
+
∂
2
u
∂y
2
+
∂
2
u
∂z
2
is the Laplacian operator. If we are looking at diﬀusion in a single spatial
dimension, such as the distribution of temperature along a rod (or, say, vertical
diﬀusion of a ﬂuid that is assumed to be uniform in the horizontal dimensions),
then Equation (7.6), also known as the heat equation, becomes
∂u
∂t
= D
∂
2
u
∂x
2
, (7.7)
where u = u(x, t). (D may in general depend on x and t, but we will assume
for simplicity here that it is constant.) Proceeding as in Example 3.18 on
page 93, we use
∂
2
u
∂x
2
≈
u(x +h, t) −2u(x, t) +u(x −h, t)
h
2
. (7.8)
For example, suppose we have a rod that is initially at temperature 0 at both
ends, and, starting at time t = 0, the end at x = 0 is immersed in ice, so
u(0, t) = 0 for t ≥ 0, and the end at x = 1 is heated at the rate u(1, t) = t.
Suppose, as in Example 3.18, we subdivide 0 ≤ x ≤ 1 into 4 subintervals,
having u
1
(t) correspond to u(1/4, t), u
2
(t) correspond to u(1/2, t), and u
3
(t)
correspond to u(3/4, t). In this way, we replace the boundary value problem
Initial Value Problems for Ordinary Diﬀerential Equations 259
consisting of (7.7) and u(0, t) = 0, u(1, t) = t, u(x, 0) = 0 by the system of
ordinary diﬀerential equations
u
′
1
= 16u
2
−32u
1
,
u
′
2
= 16u
3
−32u
2
+ 16u
1
,
u
′
3
= 16t −32u
3
+ 16u
2
,
with
_
_
u
1
(0)
u
2
(0)
u
3
(0)
_
_
=
_
_
0
0
0
_
_
.
Such discretizations may be solved with software for initial value problems
for ordinary diﬀerential equations. However, with such ﬁnite diﬀerence dis-
cretizations in space, and, indeed, with various other discretizations in space,
the resulting system of ordinary diﬀerential equations is usually stiﬀ , in the
sense we describe in Section 7.9 on page 273 to follow. This is especially so if
h is small. Thus, generally, software for stiﬀ systems should be used with the
method of lines.
We now study various higher-order methods, that is, methods, that is,
methods for which the order ω as in Deﬁnition 7.1 is greater than 1.
7.4 Higher-Order Taylor Series Methods
If y(t), the solution of (7.1), is suﬃciently smooth
3
, we see that
y(t
k+1
) = y(t
k
) +hy
′
(t
k
) +
h
2
2
y
′′
(t
k
) + +
h
p
p!
y
(p)
(t
k
) +O(h
p+1
) (7.9)
where, using (7.1), these derivatives can be computed explicitly with the mul-
tivariate chain rule of the usual calculus. Thus, (7.9) leads to the following
numerical scheme:
_
_
_
y
0
= y(a)
y
k+1
= y
k
+hf(t
k
, y
k
) +
h
2
2
d
dt
f(t
k
, y
k
) + +
h
p
p!
d
p−1
dt
p−1
f(t
k
, y
k
)
(7.10)
for k = 0, 1, 2, , N − 1. This is called a Taylor series method. (Note that
Euler’s method is a Taylor series method of order p = 1.)
3
that is, if the solution y(t) contains enough continuous derivatives for Taylor’s theorem
(on page 3) to hold
260 Applied Numerical Methods
In the past, these methods were seldom used in practice since they required
evaluations of high-order derivatives. However, with eﬃcient implementations
of automatic diﬀerentiation,
4
these methods are increasingly solving impor-
tant real-world problems. For example, very high-order Taylor methods (of
order 30 or higher) are used, with the aid of automatic diﬀerentiation, in the
“COSY Inﬁnity” package, which is used world-wide to model atomic particle
accelerator beams. (See, for example, [8].)
By construction, the order of the method (7.10) is ω = p. In weighing the
practicality of this method, one should consider the structure of the problem
itself, along with the ease (or lack thereof) of computing the derivatives. For
example, with n = 1, we must compute
d
dt
f(t, y) =
∂f
∂t
+f
∂f
∂y
,
d
2
dt
2
f(t, y) =
∂
2
f
∂
2
t
+ 2f
∂
2
f
∂t∂y
+f
2
∂
2
f
∂
2
y
+
∂f
∂t
∂f
∂y
+f
_
∂f
∂y
_
2
,
etc.
If f is mildly complicated, then it is impractical to compute these formulas
by hand
5
; also, observe that, for n > 1, the number of terms can become
large, although many may be zero; thus, an implementation of automatic
diﬀerentiation should take advantage of the structure in f.
Example 7.5
Consider
_
y
′
(t) = f(t, y) = t +y,
y(1) = 2,
which has exact solution y(t) = −t −1 +4e
−1
e
t
. The Taylor series method of
order 2 for this example has
f(t, y) = t +y
and
d
dt
f(t, y) =
∂f
∂t
+f
∂f
∂y
= 1 +t +y.
Therefore,
y
k+1
= y
k
+hf(t
k
, y
k
) +
h
2
2
d
dt
f(t
k
, y
k
)
= y
k
+h(t
k
+y
k
) +
h
2
2
(1 +t
k
+y
k
).
Letting h = 0.1, we obtain the following results:
4
These implementations can be very sophisticated.
5
but this does not rule out automatic diﬀerentiation
Initial Value Problems for Ordinary Diﬀerential Equations 261
k t
k
y
k
(Euler) y
k
(T.S. order 2) y(t
k
) (Exact)
0 1 2 2 2
1 1.1 2.3 2.32 2.3207
2 1.2 2.64 2.6841 2.6856
3 1.3 3.024 3.0969 3.0994
4 1.4 3.4564 3.5636 3.5673
7.5 Runge–Kutta Methods
A classic form higher-order methods that do not explicitly require deriva-
tives take is Runge–Kutta methods. We now show how Runge–Kutta methods
are derived by deriving a simple one. For simplicity, we derive it for a scalar
diﬀerential equation, although Runge–Kutta methods can easily be applied
to systems.
If y(t) is the exact solution of (7.1), then
y(t
k+1
) −y(t
k
) =
_
t
k+1
t
k
f(t, y(t))dt, 0 ≤ k ≤ N −1. (7.11)
Approximating the integral on the right side by the midpoint rule, we obtain
_
t
k+1
t
k
f(t, y(t))dt ≈ hf
_
t
k
+
h
2
, y(t
k
+
h
2
)
_
. (7.12)
Now, by Taylor’s Theorem,
y(t
k
+
h
2
) ≈ y(t
k
) +
h
2
y
′
(t
k
) = y(t
k
) +
h
2
f(t
k
, y(t
k
)). (7.13)
By (7.11), (7.12), and (7.13), it is seen that y(t) approximately satisﬁes
_
¸
_
¸
_
y(t
k+1
) ≈ y(t
k
) +hf
_
t
k
+
h
2
, K
1
_
, 0 ≤ k ≤ N −1,
with K
1
= y(t
k
) +
h
2
f(t
k
, y(t
k
)),
(7.14)
which suggests the following numerical method, known as the midpoint method
for solution of (7.1). We seek y
k
, 0 ≤ k ≤ N, such that
_
¸
¸
¸
_
¸
¸
¸
_
y
0
= y(t
0
),
y
j+1
= y
j
+hf
_
t
j
+
h
2
, K
1,j
_
, j = 0, 1, 2, , N −1,
K
1,j
= y
j
+
h
2
f(t
j
, y
j
).
(7.15)
262 Applied Numerical Methods
We can write (7.15) in the form:
_
y
0
= y(t
0
)
y
j+1
= y
j
+hΦ(t
j
, y
j
, h),
(7.16)
where
Φ(t
j
, y
j
, h) = f
_
t
j
+
h
2
, y
j
+
h
2
f(t
j
, y
j
)
_
.
It can be shown that, when f(t, y) does not depend on y, a step of the
midpoint method reduces to the midpoint rule, that is, the degree-0 Gauss–
Legendre quadrature formula:
y(t
k+1
= y(t
k
)+
_
t
k
+h
t
k
f(s)ds = y(t
k
)+hf(t
k
+h/2)+
h
3
12
f
′′
(ξ).y(t
k+1
) = y(t
k
)+
_
t
k
+h
t
k
f(s)ds = y(t
k
)+hf(t
k
+
(See Table 6.1, Table 6.3, and Section 6.3.6.) Indeed, the midpoint method
has order ω = 2. We present a proof in our second course [1].
In general, Runge–Kutta methods have the form
_
y
0
= y(a)
y
k+1
= y
k
+hΦ(t
k
, y
k
, h)
(7.17)
where
Φ(t, y, h) =
R

r=1
c
r
K
r
,
K
1
= f(t, y),
K
r
= f(t +a
r
h, y +h
r−1

s=1
b
rs
K
s
)
and
a
r
=
r−1

s=1
b
rs
, r = 2, 3, , R.
Such a method is called an R-stage Runge–Kutta method. Notice that Euler’s
method is a one-stage Runge–Kutta method and the midpoint method is a
two-stage Runge–Kutta method with c
1
= 0, c
2
= 1, a
2
=
1
2
, b
21
=
1
2
, i.e.,
y
k+1
= y
k
+hf
_
t
k
+
h
2
, y
k
+
h
2
f(t
k
, y
k
)
_
.
The coeﬃcients a
R
, b
rs
, and c
r
can be derived by matching terms in the Taylor
expansion. In general, for a particular number of stages and a particular order,
the coeﬃcients a
R
, b
rs
, and c
r
are not unique, that is, there are in general
various R stage methods of a given order. We discuss these issues in [1].
Initial Value Problems for Ordinary Diﬀerential Equations 263
The most well-known Runge–Kutta scheme is 4-th order; it has the form:
y
0
= y(t
0
)
y
k+1
= y
k
+
h
6
[K
1
+ 2K
2
+ 2K
3
+K
4
]
K
1
= f(t
k
, y
k
)
K
2
= f
_
t
k
+
h
2
, y
k
+
h
2
K
1
_
K
3
= f
_
t
k
+
h
2
, y
k
+
h
2
K
2
_
K
4
= f(t
k
+h, y
k
+hK
3
),
(7.18)
i.e.,
Φ(t
k
, y
k
, h) =
h
6
[K
1
+ 2K
2
+ 2K
3
+K
4
].
Notice that in single-step methods, y
k+1
= y
k
+hΦ(t
k
, y
k
, h), hΦ(t
k
, y
k
, h) is
an approximation to the “rise” in y in going from t
k
to t
k
+h. In the fourth-
order Runge–Kutta method, Φ(t
k
, y
k
, h) is a weighted average of approximate
“slopes” K
1
, K
2
, K
3
, K
4
evaluated at t
k
, t
k
+ h/2, t
k
+ h/2 and t
k
+ h,
respectively.
Example 7.6
Consider y
′
(t) = t +y, y(1) = 2, with h = 0.1. We obtain
k t
k
Euler
Runge–Kutta order 2
(Modiﬁed Euler)
Runge–Kutta
order 4
y(t
k
) (exact)
0 1 2 2 2 2
1 1.1 2.30 2.32 2.32068 2.32068
2 1.2 2.64 2.6841 2.68561 2.68561
3 1.3 3.024 3.09693 3.09943 3.09944
Higher-order Runge–Kutta methods are sometimes used, such as in the
adaptive step control schemes we describe in Section 7.7.
7.6 Stability
In a method for integrating an IVP, it is important to know how small
errors that have accumulated in the value y
k
≈ y(t
k
) propagate to subsequent
approximations y
ℓ
≈ y(t
ℓ
), ℓ > k.
264 Applied Numerical Methods
DEFINITION 7.2 Assume we take a constant step size h = (b − a)/N
to compute y
k
, 1 ≤ k ≤ N, with y
n
≈ y(b), in a single-step method is for
integrating an initial value problem. We say the method is numerically stable
if there is a constant c independent of h such that
|y
N
−z
N
| ≤ c|y
k
−z
k
| for all k ≤ N. (7.19)
Under certain continuity and Lipschitz conditions (see [1]), Runge–Kutta
methods are stable in the sense that (7.19) is satisﬁed. This implies that an
error |z
k
−y
k
| will not be magniﬁed by more than a constant c at ﬁnal time
t
N
, i.e., “small errors” have “small eﬀect.”
The above deﬁnition of stability is not satisfactory if the constant c is very
large. Consider, for example, Euler’s method applied to the scalar equation
y
′
= λy, λ = constant. Then Euler’s scheme gives y
j+1
= y
j
(1 +λh), 0 ≤ j ≤
N − 1. An error, say at t = t
k
, will cause us to compute z
j+1
= z
j
(1 + λh)
instead and hence [z
j+1
−y
j+1
[ = [1 +λh|z
j
−y
j
[, k ≤ j ≤ N −1. Thus, the
error will be magniﬁed if [1 + λh[ > 1, will remain the same if [1 + λh[ = 1,
and will be suppressed if [1 +λh[ < 1. Consider the problem
y
′
= −1000y + 1000t
2
+ 2t, 0 ≤ t ≤ 1, y(0) = 0,
whose exact solution is y(t) = t
2
, 0 ≤ t ≤ 1. We ﬁnd for Euler’s method
that [z
j+1
− y
j+1
[ = [1 − 1000h|z
j
− y
j
[, 0 ≤ j ≤ N − 1. The error will be
suppressed if [1 − 1000h[ < 1, i.e., 0 ≤ h ≤ 0.002. Consider the following
table:
h N y
N
1 1 0
0.1 10 9 10
16
0.01 10
2
overﬂow
0.001 10
3
0.99999900
0.0001 10
4
0.99999990
0.00001 10
5
0.99999999
For h > .002, small errors are violently magniﬁed. For example, for h = .01,
the errors are magniﬁed by [1−
1000
100
[ = 9 at each time step, even though there
exists a c as in (7.19).
This motivates a second concept of stability that will be important when
we discuss stiﬀ systems.
DEFINITION 7.3 A numerical method for solution of (7.1) is called
absolutely stable if when applied to the scalar equation y
′
= λy, t ≥ 0, it yields
values ¦y
j
¦
j≥0
with the property that y
j
→ 0 as j → ∞. The set of values λh
for which a method is absolutely stable is called the set of absolute stability.
Initial Value Problems for Ordinary Diﬀerential Equations 265
Example 7.7
(Absolute stability of Euler’s method and the midpoint method)
1. Euler’s Method applied to y
′
= λy yields y
j+1
= y
j
(1 +λh), whence
y
j
= y
0
(1 +λh)
j+1
.
Clearly, assuming that λ is real, y
j
→ 0 as j → ∞if and only if [1+λh[ <
1 or −2 < λh < 0. Hence, the interval of absolute stability of Euler’s
method is (−2, 0).
Generally, however, when we analyze stability for systems of diﬀerential
equations, we need to consider the possibility of complex λh. (We will
see why in Section 7.9.2.) In such a context, we seek a region in the
complex plane for which [1 + λh[ < 1. If λh = x + yi where i is the
imaginary unit, we have
[1 +λh[
2
= [1 +x +yi[
2
= (1 +x)
2
+y
2
< 1.
This describes a circle of radius 1 centered at −1 + 0i, as in this ﬁgure:
−2 −1.5 −1 −0.5 0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Re(z)
I
m
(
z
) The interior of this circle is
the region of stability of
Euler’s Method.
2. The midpoint method applied to y
′
= λy yields
y
j+1
= y
j
+hλ(y
j
+
h
2
λy
j
) = y
j
_
1 +λh +
λ
2
h
2
2
_
.
Hence, y
j
→ 0 as j → ∞ if [1 +λh+λ
2
h
2
/2[ < 1, which for λ real leads
to an interval of absolute stability (−2, 0).
When we consider λ to be complex, we obtain
¸
¸
¸
¸
1 +λh +
λ
2
h
2
2
¸
¸
¸
¸
2
=
_
1 +x +
x
2
− y
2
2
_
2
+ (y +xy)
2
< 1.
266 Applied Numerical Methods
Using a computer algebra system to ﬁnd the boundary curves of this
region of stability, then plotting them with matlab, we ﬁnd the region
of stability to be the interior of the following oval.
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1
−1.5
−1
−0.5
0
0.5
1
1.5
In general, explicit methods (such as the Taylor series methods or Runge
Kutta methods we have considered so far) require signiﬁcantly small h for ac-
curate approximations of problems with large [λ[. (Notice that the linear case
with λ models the nonlinear case y
′
= f(t, y) with Lipschitz constant L ≈ λ.)
These methods should not be used for such problems or their system analogs
(stiﬀ systems). We will consider later methods suitable for such problems.
7.7 Adaptive Step Controls
It often occurs that the curvature of solutions y(t) to initial value problems
varies considerably as t changes. To achieve a given accuracy, such software
can take large steps h
k
in regions of small curvature, but must take smaller
steps in regions of large curvature to achieve the same accuracy. Not only
does the user of such software usually not know beforehand which step size h
will give the required accuracy, but choosing a ﬁxed step size h that is small
enough for the intervals where the curvature (that is, y
′′
) is large will result
in many more steps than necessary over intervals in which the curvature is
small, and hence lead to ineﬃciency.
Analogously to composite formulas for numerical integration, we have the
global error y
N
−y(t
N
), where t
0
= a and t
N
= b, and the local error y
1
−y(t
1
)
incurred by taking a single step of the method. (The analogous concepts for
numerical integration are the error over the entire interval and the error for
a single application of the quadrature formula.) As in adaptive quadrature
Initial Value Problems for Ordinary Diﬀerential Equations 267
routines, we focus on computing estimates for the error. The most common
technique for deriving such estimates is to assume we cannot evaluate an exact
range for the error term, but that we know the order ω of the method as in
Deﬁnition 7.1 (on page 255). To take a step from y
k
to y
k+1
, we assume that
y
k
is exact
6
, and we have two methods, one of order ω and one of order ω+1,
giving y
k+1,1,h
and y
k+1,2,h
with
y
k+1,1,h
−y(t
k+1
) ≈ C
1
h
ω+1
,
y
k+1,2,h
−y(t
k+1
) ≈ C
2
h
ω+2
,
whence
y
k+1,1,h
−y
k+1,2,h
≈ C
1
h
ω+1
−C
2
h
ω+2
(7.20)
≈ C
1
h
ω+1
≈ y
k+1,1,h
−y(t
k+1
).
An error bound ǫ is speciﬁed, h is decreased if the error estimated through
(7.20) is too large, and h is increased if the error estimated through (7.20) is
suﬃciently small.
Several schemes are developed in which combining the same values of f(t, y)
in two diﬀerent ways results in Runge–Kutta methods of orders ω and ω +
1. A classic example of this is the routine RKF45 in [15, page 129 ﬀ]. The
method used there, combining six evaluations of f per step to obtain both a
fourth-order and ﬁfth-order Runge–Kutta method, is called the Runge–Kutta–
Fehlberg method. We derive it in detail, as well as present an algorithm, in [1].
The Runge–Kutta–Fehlberg method is available in matlab with the function
ode45.
Example 7.8
A classic illustrative problem in population dynamics is the predator-prey
equations. Suppose we have an animal (say, rabbits) and a predator (say
foxes). The population x
1
(t) of rabbits depends on the number of rabbits
present, and also on the number x
2
(t) of foxes. Similarly, the number of
foxes present depends on how much food they have (that is, on the number
of rabbits present) and on the number of foxes present. The classic model is
x
′
1
(t) = αx
1
− βx
1
x
2
,
x
′
2
(t) = −γx
2
+ δx
1
x
2
.
To use ode45, we program the following.
function [f] = predator_prey(t,y)
global alpha
6
That is, for the purposes of analysis, we assume y
k
= y(t
k
).
268 Applied Numerical Methods
global beta
global gamma
global delta
f = zeros(2,1);
f(1) = alpha * y(1) - beta*y(1)*y(2);
f(2) = -gamma * y(2) + delta*y(1)*y(2);
Suppose we want to see what happens out to time t = 0.5, with 1000 rabbits
and 100 foxes initially, and with α = 2, β = 0.1, γ = 1, and δ = 0.1. We then
run ode45 with the following matlab dialog.
>> global alpha
>> global beta
>> global gamma
>> global delta
>> alpha = 2;
>> beta = 0.1;
>> gamma = 2;
>> delta = 0.1;
>> [T,Y] = ode45(’predator_prey’,[0,0.5], [1000,100]);
>> plot(T,Y(:,1),T,Y(:,2))
>> hold
Current plot held
>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...
’MarkerEdgeColor’,’red’,’Markersize’,4)
>> size(T)
ans = 129 1
This results in the following ﬁgure.
0 0.1 0.2 0.3 0.4 0.5
−200
0
200
400
600
800
1000
1200
This ﬁgure indicates that, with these values of the parameters α, β, γ, and δ
and the chosen initial number of rabbits and foxes, the foxes rapidly increase
and the rabbits die out before time t = 0.1, then the foxes slowly die out. It
shows that a total of 129 steps were taken, and we see on the horizontal line
at level 0 that the steps become larger where the solutions are not varying as
much.
If a Taylor series method is used, an alternative error control is an interval
evaluation of the error term. That is, if we are taking a step from t
k
to
t
k+1
= t
k
+ h
k
, the actual error in the Taylor series method of order ω (that
Initial Value Problems for Ordinary Diﬀerential Equations 269
is, expanding y
′
= f in a Taylor polynomial of degree ω −1) is of the form
h
ω+1
(ω + 1)!
d
ω
f(ξ, y(ξ))
dt
ω
(7.21)
for some ξ ∈ [t
k
, t
k
+h
k
]. The actual derivative in (7.21) is a linear combination
of products of partial derivatives of f of various orders with respect to t and
the components of the vector y, but values of it can be obtained eﬀectively
with automatic diﬀerentiation. If the automatic diﬀerentiation uses interval
arithmetic with the interval t = [t
k
, t
k
+h
k
] and an interval bound y (obtained
and veriﬁed in various ways), we obtain mathematically rigorous bounds on
the error. This has proven eﬀective in simulations of particle beams in atomic
accelerators and other applications [8], but general software based on it is not
yet publicly available.
7.8 Multistep, Implicit, and Predictor-Corrector Meth-
ods
In multistep methods, values y
ℓ
with ℓ < k, in addition to y
k
, are used to
obtain y
k+1
. A common class of multistep methods is the class of Adams–
Bashforth methods, in which f is approximated by an interpolating poly-
nomial on y
k−n
, . . . , y
k
, then the interpolating polynomial is integrated to
obtain y
k+1
. For instance, to obtain a so-called “3-step method,” in which 3
previous values of the solution are used, we pass an interpolating polynomial
through f
k
= f(t
k
, y
k
), f
k−1
= f(t
k−1
, y
k−1
), and f
k−2
= f(t
k−2
, y
k−2
). The
corresponding Lagrange form representation (see (4.4) on page 149) is
p
2
(t) = ℓ
k
(t)f
k
+ℓ
k−1
(t)f
k−1
+ℓ
k−2
(t)f
k−2
,
where
ℓ
k
(t) =
(t −(t
k
−h
k
))(t −(t
k
−h
k
−h
k−1
))
h
k
(h
k
+ h
k−1
)
,
ℓ
k−1
(t) = −
(t −t
k
)(t −(t
k
−h
k
−h
k−1
))
h
k
h
k−1
, and
ℓ
k−2
(t) =
(t −t
k
)(t −(t
k
−h
k
))
h
k−1
(h
k
+h
k−1
)
.
270 Applied Numerical Methods
The next approximation y
k+1
≈ y(t
k+1
) is then deﬁned by
y
k+1
= y
k
+
_
t
k
+h
k+1
t
k
p
2
(t)dt
= f
k
_
t
k
+h
k+1
t
k
ℓ
k
(t)dt
+f
k−1
_
t
k
+h
k+1
t
k
ℓ
k−1
(t)dt +f
k−2
_
t
k
+h
k+1
t
k
ℓ
k−2
(t)dt.
Under the simplifying assumption
7
that h
k+1
= h
k
= h
k−1
= h, we have
_
t
k
+h
t
k
ℓ
k
(t)dt =
23
12
h,
_
t
k
+h
t
k
ℓ
k−1
(t)dt = −
4
3
h, and
_
t
k
+h
t
k
ℓ
k−1
(t)dt =
5
12
h,
so
y
k+1
= h
_
23
12
f
k
−
4
3
f
k−1
+
5
12
f
k−2
_
. (7.22)
This is known as the Adams–Bashforth 3-step method. This method has
order ω = 3 (and in general, the s-step adams Bashforth method, involving s
previously computed values of f, has order ω = s).
Adams–Bashforth methods cannot compute y
1
through y
s−1
on their own,
since they do not have the required previously computed values of f for these
initial points. Generally, a separate order s or higher method, such as an
order s Runge–Kutta method, is used to start the process.
Under certain conditions, namely, when the system is stiﬀ , we may want to
use as-yet-unknown information to perform the step from t
k
to t
k
+ h
k
. For
example, we may pass an interpolating polynomial of degree 2 through f
k+1
(as yet unknown), f
k
, and f
k−1
, to obtain
q
2
(t) =
˜
ℓ
k+1
(t)f
k+1
+
˜
ℓ
k
(t)f
k
+
˜
ℓ
k−1
(t)f
k−1
,
where
˜
ℓ
k+1
(t) =
(t −t
k
)(t −(t
k
−h
k
)
h
k+1
(h
k
+h
k+1
)
,
˜
ℓ
k
(t) = −
(t −(t
k
+h
k+1
))(t −(t
k
−h
k
))
h
k
h
k+1
, and
˜
ℓ
k−1
(t) =
(t −(t
k
+h
k+1
))(t −t
k
)
h
k
(h
k
+h
k+1
)
,
7
good here for illustration, but not made in practical software
Initial Value Problems for Ordinary Diﬀerential Equations 271
and where, as before,
y
k+1
= y
k
+
_
t
k
+h
k+1
t
k
q
2
(t)dt.
We integrate the
˜
ℓ
k
as before, to obtain the coeﬃcients of the formula. Under
the simplifying assumption that h
k+1
= h
k
= h, we have
y
k+1
= h
_
5
12
f
k+1
+
2
3
f
k
−
1
12
f
k−1
_
. (7.23)
This is called the Adams–Moulton implicit method of order 3. (The Adams–
Moulton implicit method of order ω uses f
k+1
, f
k
, . . . , f
k−ω+2
.) For vector
y and f, computing y
k+1
in an implicit method involves solving a system of
in general nonlinear equations in the components of y
k+1
.
Example 7.9
Let y
′
(t) = t + y, y(1) = 2, with h = 0.1, as in Example 7.6 (on page 263).
The order 3 Adams–Moulton method for this example reduces to
y
k+1
== y
k
+ (0.1)
_
5
12
(t
k+1
+y
k+1
) +
2
3
(t
k
+y
k
) −
1
12
(t
k−1
+y
k−1
)
_
.
For the purposes of illustration, we may solve this equation symbolically for
y
k+1
(although, in general, numerical methods, such as the multivariate New-
ton method we describe in Chapter ?? are used to compute solve the nonlinear
system for the components of y
k+1
). We obtain
y
k+1
=
1
115
(8t
k
−t
k−1
+ 5t
k+1
+ 128y
k
−y
k−1
).
We have t
0
= 1, y
0
= 2. If we use the fourth-order Runge–Kutta method
as in Example 7.6 to get starting values, we obtain t
1
= 1.1, t
2
= 1.2, and
y
1
≈ 2.32068. Applying the order-3 Adams–Moulton method then gives
y
2
≈ 2.685626,
which compares favorably with the Runge–Kutta method in Example 7.6
Implicit methods are appropriate for stiﬀ systems, which we discuss in
the next section. Adams–Bashforth methods are used when a high-order
method is needed, but evaluations of f are expensive. (In a high-order Adams–
Bashforth method, only one additional evaluation of f is required per step,
since previous values are recycled. In contrast, in a Taylor series method,
values and many derivatives are required, and, in a Runge–Kutta method,
many function values are required per step.)
272 Applied Numerical Methods
Another way that implicit and explicit methods are used is in predictor-
corrector methods. In such a method, an explicit formula is used to compute
an approximation ˆ y
k+1
to y(t
k+1
). The approximation ˆ y
k+1
is then used in
the right side of an implicit formula (generally of higher order than the explicit
formula) to obtain a better approximation y
k+1
to y(t
k+1
).
The matlab function ode113 implements predictor-corrector Adams–Bashforth
and Adams–Moulton methods of various orders. In particular, not only is the
step size adjusted, but the software also uses heuristics to change the order.
Example 7.10
We will use ode113 to solve the predator-prey system of Example 7.8 (on
page 267). We have
>> [T,Y] = ode113(’predator_prey’,[0,.5], [1000,100]);
>> plot(T,Y(:,1),T,Y(:,2))
>> hold
Current plot held
>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...
’MarkerEdgeColor’,’red’,’Markersize’,4)
>> size(T)
ans = 70 1
with ﬁgure:
0 0.1 0.2 0.3 0.4 0.5
−200
0
200
400
600
800
1000
1200
We see that only 70 steps are taken, instead of 129, and the steps are farther
apart on the smooth part of the graph. For this illustrative example, the
diﬀerence in performance is not signiﬁcant on a modern laptop computer, but
the diﬀerence can be signiﬁcant for certain larger problems.
We give a theoretical analysis of explicit and implicit multistep methods in
general, as well as of predictor-corrector methods, in [1].
Initial Value Problems for Ordinary Diﬀerential Equations 273
7.9 Stiﬀ Systems
Stiﬀ systems are common both in primary applications and in approximat-
ing partial diﬀerential equations by systems of ordinary diﬀerential equations.
We begin our study of stiﬀ systems with an explanation of a simpliﬁed context.
7.9.1 Stiﬀ Systems and Linear Systems
To understand the basic ideas about stiﬀ systems, we think of approximat-
ing general initial value problem
y
′
(t) = f(t, y(t)), t ≥ 0, y(0) = y
0
.
by the linear problem
y
′
(t) = Ay(t), t ≥ 0, y(0) = y
0
. (7.24)
In particular, the system (7.24) is a model for nonlinear systems y
′
= f(t, y).
The matrix A is a model for the Jacobi matrix ∂f/∂y, i.e., expanding in a
Taylor series about ﬁxed ˜ y,
f(t, y) ≈ f(t, ˜ y) +
∂f
∂y
(t, ˜ y)(y − ˜ y).
To further simplify our study, we will assume that the matrix A has simple
eigenvalues, that is, that A has n distinct eigenvaluesλ
i
, 1 ≤ i ≤ n, and
thus has n corresponding linearly independent eigenvectors v
i
, 1 ≤ i ≤ n.
(Further analysis can show that the systems behave similarly without these
simpliﬁcations, but the basic ideas are clear in the simpler context.) In our
simpliﬁed context, if we form the n by n matrix V whose i-th column is v
i
,
we have
AV = V Λ, or V
−1
AV = Λ,
where Λ is the diagonal matrix such that its i-th diagonal entry is the i-th
eigenvalue λ
i
. (See Example 5.1 on page 192.) If we make the change of
dependent variables z = V
−1
y, or y = V z, and we observe (V z)
′
= V z
′
, we
have
V z
′
= A(V z), or z
′
= (V
−1
AV )z = Λz.
Interpreted component-by-component, this last system is simply
z
′
i
= λ
i
z
i
, 1 ≤ i ≤ n,
which has solution
z
i
= c
i
e
λit
.
274 Applied Numerical Methods
The vector equation y = V z can thus be written componentwise as
y(t) =
n

i=1
c
i
e
λit
v
i
. (7.25)
The c
i
can then be found by solving the linear system
n

i=1
c
i
v
i
= V c = y
0
for the vector c = (c
1
, . . . , c
n
)
T
.
Example 7.11
As an illustrative example, take the equation.
u
′′
+u
′
+u = 0, u(0) = 1, u
′
(0) = 2.
(This equation is a simpliﬁed model of a damped mechanical system, such
as automobile springs with shock absorbers.) Converting to a system with
y
1
(t) = u(t), y
2
(t) = y
′
1
(t), we obtain the system of equations
y
′
=
_
y
′
1
y
′
2
_
=
_
y
2
−y
1
−y
2
_
=
_
0 1
−1 −1
__
y
1
y
2
_
.
Using matlab, we obtain
>> A = [0 1;-1 -1]
A =
0 1
-1 -1
>> [V,Lambda] = eig(A)
V =
0.7071 0.7071
-0.3536 + 0.6124i -0.3536 - 0.6124i
Lambda =
-0.5000 + 0.8660i 0
0 -0.5000 - 0.8660i
>> y0 = [1;2]
y0 =
1
2
>> c = V\y0
c =
0.7071 - 2.0412i
0.7071 + 2.0412i
>>
In fact, it can be veriﬁed that the exact eigenvalues of A are the roots of the
characteristic equation
λ
2
+λ + 1 = 0
for the original second-order linear diﬀerential equation, namely,
λ
1
= −
1
2
−
√
3
2
i and λ
2
= −
1
2
+
√
3
2
i.
Initial Value Problems for Ordinary Diﬀerential Equations 275
Thus,
z
1
(t) = e
(−1/2−
√
3/2i)t
and z
2
(t) = e
(−1/2+
√
3/2i)t
,
and the solution to the initial value problem is
y(t) ≈ (0.7071 −2.0412i)e
(−1/2−
√
3/2i)t
v
1
+(0.7071 + 2.0412i)e
(−1/2+
√
3/2i)t
v
2
≈ (0.7071 −2.0412i)e
(−1/2−
√
3/2i)t
_
0.7071
−0.3536 + 0.6124i
_
+(0.7071 + 2.0412i)e
(−1/2+
√
3/2i)t
_
0.7071
−0.3536 −0.6124i
_
.
Simplifying using e
a+bi
= e
a
e
bi
and Euler’s formula
e
bi
= cos(b) +i sin(b),
we obtain
y(t) ≈ e
−t/2
_
cos
_
√
3
2
t
_
+
5
√
3
3
sin
_
√
3
2
t
__
.
(In fact, by solving the original equation symbolically as a linear second order
equation, we obtain exactly this solution.)
The term stiﬀ system originated in the study of mechanical systems with
springs. A spring is “stiﬀ” if its damping constant is large; in such a me-
chanical system, motions of the spring will damp out fast relative to the time
scale on which we are studying the system. In the numerical solution of initial
value problems, “stiﬀness” has come to mean that the solution to the ODE
has some components that vary or die out rapidly in relation to the other
components, or in relation to the time interval over which the integration
proceeds. For example, the scalar equation y
′
= −1000y might be considered
to be moderately stiﬀ when it is integrated for 0 ≤ t ≤ 1, but not stiﬀ if the
interval of integration is 0 ≤ t ≤ 0.001.
Example 7.12
Let’s consider the system
y
′
= Ay, t ≥ 0, y(0) = (1, 0, −1)
T
(7.26)
where
A =
_
_
_
_
−21 19 −20
19 −21 20
40 −40 −40
_
_
_
_
.
276 Applied Numerical Methods
The eigenvalues of A are λ
1
= −2, λ
2
= −40 + 40i and λ
3
= −40 − 40i and
the exact solution of (7.26) is
_
¸
¸
¸
¸
_
¸
¸
¸
¸
_
y
1
(t) =
1
2
e
−2t
+
1
2
e
−40t
(cos 40t + sin 40t),
y
2
(t) =
1
2
e
−2t
−
1
2
e
−40t
(cos 40t + sin 40t),
y
3
(t) = −e
−40t
(cos 40t −sin 40t),
(7.27)
This system is stiﬀ over the time interval in which we expect e
−2t
to die out,
since the component e
−40t
dies out much faster. The graphs of the solution to
this initial value problem are in Figure 7.1 (obtained using matlab’s stiﬀ ODE
routine ode15s). Notice that for 0 ≤ t ≤ .1, y
i
(t), 1 ≤ i ≤ 3, vary rapidly but
0 0.05 0.1 0.15 0.2 0.25 0.3
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
t
y
3
(t)
y
2
(t)
y
1
(t)
FIGURE 7.1: Actual solutions to the stiﬀ ODE system of Example 7.12.
for t ≥ 0.1, then y
i
vary slowly. Hence, a small time step must be used in the
interval [0, 0.1] for adequate resolution, whereas for t ≥ 0.1 large time steps
should suﬃce. Suppose, however we use Euler’s method starting at t = 0.2
with initial conditions taken as the exact values y
i
(0.2), 1 ≤ i ≤ 3. We obtain:
Initial Value Problems for Ordinary Diﬀerential Equations 277
For h = 0.04
j t
j
y
1j
y
2j
y
3j
0 0.2 0.335 0.335 -0.00028
5 0.4 0.218 0.223 0.0031
10 0.6 0.186 .106 -0.0283
15 0.8 -0.519 0.711 0.1436
20 1.0 9.032 -8.91 1.9236
21 1.1 -6.862 6.98 27.55
For h = 0.02
j t
j
y
1j
y
2j
y
3j
0 0.2 0.3353 0.3350 -0.00028
5 0.3 0.2734 0.2732 -0.000065
10 0.4 0.2229 0.2228 -0.0000054
Violent instability occurs for h = 0.04 but the method is stable for h = 0.02.
What happened? Why do we need h so small?
The answer lies in understanding the concept of stability, as in Deﬁnition 7.3
(on page 264).
7.9.2 Stability of Stiﬀ Systems
Earlier (Deﬁnition 7.3 on page 264), we deﬁned absolute stability of methods
for solving the IVP in terms of the scalar equation y
′
= λy. We now extend
the deﬁnition to systems.
DEFINITION 7.4 Let A satisfy the stated assumptions and suppose that
Reλ
i
< 0, 1 ≤ i ≤ n. A numerical method for solving the linear IVP (7.24)
is called absolutely stable for a particular value of the product λh if it yields
numerical solutions, y
j
, j ≥ 0, in R
n
such that y
j
→ 0 as j → ∞ for all
y
0
. As in Deﬁnition 7.3, we speak of the region of absolute stability as being
the set of λh in the complex plane for which the method, applied to a scalar
equation y
′
= λy, y ∈ R, is absolutely stable.
We now show in our simpliﬁed context why Deﬁnition 7.4 makes sense. In
particular, a method for a system is absolutely stable if and only if the method
is absolutely stable for the scalar equations z
′
= λ
i
z, for 1 ≤ i ≤ n. To see
this, consider, for example, the k-step method
k

l=0
α
l
y
l+j
= h
k

l=0
β
l
f
l+j
= h
k

l=0
β
l
Ay
l+j
.
278 Applied Numerical Methods
Thus,
k

l=0
(α
l
I −hβ
l
A)y
l+j
= 0, j ≤ 0.
Let V
−1
AV = Λ, where this decomposition is guaranteed if A has n simple
eigenvalues and Λ = diag(λ
1
, λ
2
, , λ
n
). We conclude that
k

l=0
(α
l
I −hβ
l
Λ)V
−1
y
l+j
= 0.
Setting z
j
= V
−1
y
j
, we see that
k

l=0
(α
l
−hβ
l
λ
i
)(z
l+j
)
i
= 0, 1 ≤ i ≤ n,
where (z
l+j
)
i
is the i-th component of z
l+j
. Since (z
j
)
i
→ 0, 1 ≤ i ≤ n, as
j → ∞ if and only if y
j
→ 0 as j → ∞, we see that the method will be
absolutely stable for system (7.24) if and only if it is absolutely stable for
the scalar equation z
′
= λ
i
z, 1 ≤ i ≤ n. In this case, it will be absolutely
stable provided that the roots of p(z, h; i) = ρ(z) −hλ
i
σ(z), 1 ≤ i ≤ n, satisfy
[z
l,i
[ < 1, 1 ≤ l ≤ k, 1 ≤ i ≤ n.
Example 7.13
Recall that, in Example 7.7 (on page 265), we found that the region of
absolute stability for Euler’s method (the Adams–Bashforth 1-step method)
is the open disk
¦λh : [1 +λh[ < 1¦, (7.28)
as depicted here:
Re
Im
+
-1 -2
λh-plane
(Recall y
j+1
= y
j
+ λhy
j
for Euler’s method applied to y
′
= λy gives y
j
→ 0
if [1 +λh[ < 1.)
Applying Euler’s method symbolically to
y
′
= −1000y, y(0) = ǫ
gives
y
k
= (1 −1000h)
k
ǫ.
The graph of the solution for 0 ≤ y ≤ 1 is indistinguishable from the graph of
the constant function y ≡ 0, if the y-scale is 0 ≤ y ≤ 1. However, examine the
Initial Value Problems for Ordinary Diﬀerential Equations 279
following simple matlab computation, representing steps of Euler’s method
for this problem with h = 0.1.
>> y = 1e-3;
>> h=0.1;
>> y = y - 1000*h*y
y = -0.0990 % t = 0.1
>> y = y - 1000*h*y
y = 9.8010 % t = 0.2
>> y = y - 1000*h*y
y = -970.2990 % t = 0.3
>> y = y - 1000*h*y
y = 9.6060e+004 % t = 0.4
>> y = y - 1000*h*y
y = -9.5099e+006 % t = 0.5
>> y = y - 1000*h*y
y = 9.4148e+008 % t = 0.6
>>
In fact, to avoid this kind of behavior, we would need h < 0.001, and thus
need 600 steps to go from t = 0 to t = 1.
In contrast, the implicit Euler method (that is, the Adams–Moulton method
order 1) has iteration equation deﬁned by
y
k+1
= y
k
+hf(t
k+1
, y
k+1
), (7.29)
which, for the test equation y
′
= λy used to determine stability, becomes
y
k+1
= y
k
+λhy
k+1
,
which, solving for y
k
, becomes
y
k+1
=
1
1 −λh
y
k
,
with a region of absolute stability deﬁned by
1
[1 −(x +iy)[
< 1.
Namely, the implicit Euler method is stable for the entire region outside of the
circle of radius 1 centered at (x, y) = (1, 0), and, in particular, in the entire
left half of the complex plane. We perform a simple matlab computation for
the implicit Euler method on y
′
= 1000y, y(0) = 1e −3 with h = 0.1:
>> y = 1e-3;
>> h=0.1;
>> y = (1/(1-1000*h))*y
y = -1.0101e-005
>> y = (1/(1-1000*h))*y
280 Applied Numerical Methods
y = 1.0203e-007
>> y = (1/(1-1000*h))*y
y = -1.0306e-009
>> y = (1/(1-1000*h))*y
y = 1.0410e-011
>> y = (1/(1-1000*h))*y
y = -1.0515e-013
>>
We see that, although the relative accuracy of the solution is not high, the
approximate solution tends to 0, as it should.
Example 7.14
Analyzing the Euler’s method computations for Example 7.12 in the same way,
we see that, for the numerical solutions to go to zero (that is, for absolute
stability), we must have [1 + λ
i
h[ < 1, 1 ≤ i ≤ 3. For i = 1 (λ
1
= −2), this
yields h < 1. However, i = 2, 3 (λ
2
= −40 + 40i, λ
3
= −40 − 40i) yields
h < 1/40 = .025 which is violated if h = .04. We conclude that, although
the terms with eigenvalues λ
2
, λ
3
contribute almost nothing to the solution
of (7.26) after t = .1, they force the selection of small time step h which must
satisfy [1 + λ
2
h[ < 1, [1 +λ
3
h[ < 1.
The implicit Euler method applied to Example 7.12 takes the form
y
k+1
= y
k
+Ay
k+1
, that is, y
k+1
= (I −A)
−1
y
k
. (7.30)
Without worrying about implementing the computations eﬃciently in this
simple example, we use matlab to iterate (7.30) directly, with h = 0.04:
>> A = [-21 19 -20; 19 -21 20; 40 -40 -40]
A =
-21 19 -20
19 -21 20
40 -40 -40
>> h = 0.04;
>> y = [1;0;-1]
y =
1
0
-1
>> I = eye(3);
>> y = (I-A)^(-1)*y % t=0.04
y =
0.1790
0.1543
-0.0003
>> y = (I-A)^(-1)*y % t=0.08
y =
0.0557
0.0554
0.0003
>> y = (I-A)^(-1)*y % t=0.12
y =
0.0185
0.0185
0.0000
>> y = (I-A)^(-1)*y % t=0.16
Initial Value Problems for Ordinary Diﬀerential Equations 281
y =
0.0062
0.0062
0.0000
>> y = (I-A)^(-1)*y % t=0.20
y =
0.0021
0.0021
0.0000
>> y = (I-A)^(-1)*y % t=0.24
y =
1.0e-003 *
0.6859
0.6859
-0.0000
>>
We see that the solutions are tending to 0, as they should.
7.9.3 Methods for Stiﬀ Systems
Generally, implicit methods, with regions of stability that contain the entire
negative real axis (or at least a large portion of it) are appropriate for stiﬀ
systems. We give further theory and other methods (including Pad´e methods)
in [1].
In matlab, routines for stiﬀ systems include ode15s, ode23s, ode23t, and
ode23tb. All of these matlab functions share the same arguments, and are
used in the same way as the other matlab functions, such as ode45, to
integrate initial value problems. We have already mentioned that we used
ode15s to produce Figure 7.1. Here are the matlab function and command-
window dialog:
function [f] = stiff_example(t,y)
A = [-21 19 -20
19 -21 20
40 -40 -40];
f = A*y;
>> [T,Y] = ode15s(’stiff_example’,[0,.3], [1,0,-1]);
>> plot(T,Y(:,1),T,Y(:,2),T,Y(:,3))
>> hold
Current plot held
>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...
’MarkerEdgeColor’,’black’,’Markersize’,4)
>> size(T)
ans =
74 1
Thus, 74 steps were taken, and we can see the steps on Figure 7.1. In contrast,
if we replace ode15s in this dialog by ode45 (but leave the other statements
the same), we obtain the following plot:
282 Applied Numerical Methods
0 0.05 0.1 0.15 0.2 0.25 0.3
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
We also obtained size(T) = 89 1, that is, 89 steps were taken, more that the
74 for ode15s. For this example, the diﬀerence is not particularly signiﬁcant,
since the system is only moderately stiﬀ over the interval t ∈ [0, 0.3]. In
Exercise 12 at the end of this chapter, you will try matlab’s various IVP
solvers on a stiﬀer problem.
7.10 Application to Parameter Estimation in Diﬀeren-
tial Equations
Many phenomena in the biological and physical sciences have been described
by parameter-dependent systems of diﬀerential equations such as those dis-
cussed previously in this chapter. Furthermore, some of the parameters in
these models cannot be directly measured from observed data. Thus, param-
eter estimation techniques discussed in this section are crucial to use such
diﬀerential equation models as prediction tools.
In this section we focus on the following question: Given the set of data
¦d
j
¦
n
j=1
at the respective points t
j
∈ [0, T], j = 1, . . . , n, ﬁnd the parameter
a ∈ Q where Q is a compact set contained in C[0, T] (the space of continuous
functions on [0, T]) which minimizes the least-squares index
n

i=1
[y(t
j
; a) −d
j
[
2
Initial Value Problems for Ordinary Diﬀerential Equations 283
subject to
dy
dt
= f(t, y; a), y(0; a) = y
0
,
where y(t; a) represents the parameter-dependent solution of the above initial-
value problem. We combine two methods discussed in this book to provide
a numerical algorithm for solving this problem. In particular, we will use
approximation theory together with numerical methods for solving diﬀerential
equations to present an algorithm for solving the least-squares problem. To
this end, divide the interval [0, T] into m equal size intervals and denote the
bin points by t
0
, t
1
, . . . , t
m
. Let ϕ
i
be a spline function (e.g., linear or cubic
spline) centered at t
i
, i = 0, . . . , m and deﬁne a
m
(t) =

m
i=0
c
i
ϕ
i
(t). Denote
by y
k
(a) the numerical approximation (using any of the numerical methods
discussed in this chapter) of the solution of the diﬀerential equation y(t
k
; a),
k = 1, . . . , N with t
k
−t
k−1
= h =
T
N
. Let y
N
(t; a) be a piecewise interpolant
(e.g., piecewise linear) of y
k
(a) at the points t
k
. Then one can deﬁne an
approximating problem of the above constrained least-squares problem as
follows: Find the parameter a
m
∈ Q
m
where Q
m
is the space spanned by
the m+1 spline elements ϕ
0
, . . . , ϕ
m
which minimizes the least-squares index
n

j=1
[y
N
(t
j
; a
m
) −d
j
[
2
.
Clearly, the above problem is a ﬁnite dimensional minimization problem and
is equivalent to the problem: Find ¦c
i
¦
m
i=1
⊂ R
m+1
which minimizes the
least-squares index

n
i=1
[y
N
(t
j
; c
0
, . . . , c
m
) − d
j
[
2
. One can apply many op-
timization routines to solve this problem (e.g., the nonlinear least-squares
routine “lsqnonlin,” available in matlab, works well for such a problem).
7.11 Application for Solving an SIRS Epidemic Model
The following ordinary diﬀerential equations (7.31) is an epidemic model
which describes the spread of an infectious disease after it starts within a
population.
dS
dt
= −
β
N
SI +νR,
dI
dt
=
β
N
SI −γI,
dR
dt
= γI −νR,
(7.31)
with initial conditions S(0) > 0, I(0) > 0, R(0) ≥ 0.
Here, we assume that the total population is a constant with no births or
deaths. Thus, the total size of the population, N, equals S(0) +I(0) +R(0).
284 Applied Numerical Methods
And any individual in the population is in one of the three distinct classes:
the susceptible, S, who are not infected, but can contract the disease; the
infected, I, who has the disease and are capable of transmitting the disease;
the removed, R, who had recovered from the disease or permanently immune,
or isolated until recovered.
The SIRS model (7.31) implies the following: First, the susceptible class
moves to the infected class at a rate of βSI/N, where β is the average adequate
contacts made by an infected individual per time [4], hence βSI/N gives the
total number of infections caused by the infected class per time. Second, the
infected class are recovered from the infected class at a rate of γI, where γ
is deﬁned as the recovery rate and thus, 1/γ is the average time spent in the
infectious class [4]. Third, the removed individuals lose immunity at a rate of
νR and become susceptible again.
The ratio β/γ, called as the basic reproduction number[4] and denoted by
R
0
, is the number of secondary infections caused by an infective individual
during his/her infectious period([4], [6]). If there is more than one secondary
infections produced by one infective individual during his/her infectious pe-
riod, that is, R
0
> 1, the disease becomes endemic; and if R
0
< 1, then the
disease fades out, the whole population becomes healthy but susceptible[14].
Now, we use a second-order Runge–Kutta method to ﬁnd the numerical
solutions to the SIRS model (7.31) with two sets of parameters. We choose
the initial size of susceptible, infected and removed individuals to be 5, 3 and
7, respectively. Hence the total size of the population is 15. And we pick
β = 2, γ = 1.5 and ν = 0.2. So in this case R
0
= β/γ > 1. Using the following
matlab program and functions, we obtain a graph of the population sizes of
each class in 500 iterations. Graph (a) shows the infected population stays
positive and hence the epidemic continues. Then we set β to be 1 without
changing other parameters, so R
0
< 1. Run the program other time, we get
graph (b), which shows all individuals become susceptible and the infected
population becomes zero.
function [f] = SIRS (t, y)
N=y(1)+y(2)+y(3);
b=2;
r=1.5;
m=0.2;
f(1)=-b/N*y(1)*y(2)+m*y(3);
f(2)= b/N*y(1)*y(2)-r*y(2);
f(3)= r*y(2)-m*y(3);
return
function [t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, f, n_steps)
%
% [t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, f, n_steps) performs
% n_steps of the modified Euler method (explained in Section 7.3.4.2 of the
% text), on the system represented by the right-hand-side function f, with
% constant step size h = (tf - t0)/nsteps, and starting with initial
% independent variable value t0 and initial dependent variable values y0.
% The corresponding independent and dependent variable values are returned
% in t(1:n_steps+1) and y(1:n_steps+1,:), respectively.
Initial Value Problems for Ordinary Diﬀerential Equations 285
h = (tf - t0) / n_steps;
t=linspace(t0,tf,n_steps+1);
y(1,:) = y0 % y(1,:) are the initial values at t=t0
for i=1:n_steps
k1=h*feval(f,t(i),y(i,:));
k2=h*feval(f,t(i)+h,y(i,:)+k1);
y(i+1,:)=y(i,:)+(k1+k2)/2;
end % k1, k2, f, and y(i,:) are vectors
% Matlab script run_Runge_Kutta_2_for_systems.m
%
clear
clf
t0 = 0;
tf = 50;
n_steps = 500;
y0(1) = 5;
y0(2) = 3;
y0(3) = 7;
[t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, ’SIRS’, n_steps);
set(gca,’fontsize’,15,’linewidth’,1.5);
plot(t,y(:,1),’g-’,t,y(:,2),’b--’,t,y(:,3),’r-.’,’linewidth’,1.5)
axis([0,tf,0,20]);
xlabel(’Time’)
ylabel(’Populations’)
0 10 20 30 40 50
0
5
10
15
20
Time
P
o
p
u
l
a
t
i
o
n
s
Susceptible
Infected
Removed
(a)
0 10 20 30 40 50
0
5
10
15
20
Time
P
o
p
u
l
a
t
i
o
n
s
Susceptible
Infected
Removed
(b)
7.12 Exercises
1. Suppose we consider an example of the initial value problem (7.1) (on
page 253), such that a = 0, b = 1, such that y and f are scalar valued,
and such that f(t, y(t)) = f(t), that is, f is a function of the independent
variable t only, and not of the dependent variable. In that case,
y(1) = y(0) +
_
1
t=0
f(t)dt.
286 Applied Numerical Methods
(a) To what method of approximating the integral does Euler’s method
correspond?
(b) In view of your answer to item 1a, do you think Euler’s method is
appropriate to use in practice for accuracy and eﬃciency?
2. Show that Euler’s method fails to approximate the solution y(x) =
_
2
3
x
_3
2
of the initial value problem y
′
(x) = y
1
3
, y(0) = 0. Explain why.
3. Consider Euler’s method for approximating the IVP y
′
(x) = f(x, y), 0 <
x < a, y(0) = α. Let y
h
(x
i+1
) = y
h
(x
i
) + hf(x
i
, y
h
(x
i
)) for i =
0, 1, . . . , N where y
h
(0) = α. It is known that y
h
(x
i
) − y(x
i
) = c
1
h +
c
2
h
2
+ c
3
h
3
+ . . . where c
m
, m = 1, 2, 3, . . . depend on x
i
but not on
h. Suppose that y
h
(a), yh
2
(a), yh
3
(a) have been calculated using interval
width: h,
h
2
,
h
3
, respectively. Find an approximation ˆ y(a) to y(a) that is
accurate to order h
3
.
4. Duplicate the table on page 260, but for h = 0.05 and h = 0.01. (You will
probably want to write a short computer program to do this. You also
may need to display more digits than in the original table on page 260.)
By taking ratios of errors, illustrate that the global error in the order
two Taylor series method is O(h
2
).
5. Suppose that
y
′′′
(t) = t+2ty
′′
+2t
2
y(t), 1 ≤ t ≤ 2, y(1) = 1, y
′
(1) = 2, y
′′
(1) = 3.
Convert this third order equation problem into a ﬁrst-order system and
compute y
k
for k = 1, 2 for Euler’s method with step length h = 0.1.
6. Calculate the real part of the region for the absolute stability of the
fourth order Runge–Kutta method (7.18).
7. Consider the Runge–Kutta method
y
i+1
= y
i
+hf(t
i
+
h
8
, y
i
+
h
8
f(t
i
, y
i
)).
Apply this method to y
′
= λy to ﬁnd the interval of absolute stability
of the method. (Assume that λh < 0.)
8. Find the region of absolute stability for
(a) Trapezoidal method:
y
j+1
= y
j
+
h
2
(f(t
j
, y
j
) +f(t
j+1
, y
j+1
)) , j = 0, 1, , N −1.
(b) Backward Euler method:
y
j+1
= y
j
+hf(t
j+1
, y
j+1
), j = 0, 1, , N −1.
Initial Value Problems for Ordinary Diﬀerential Equations 287
9. Consider solving the initial value problem y
′
= λy, y(0) = α, where
λ < 0, by the implicit trapezoid method, given by
y
0
= α, y
i+1
= y
i
+
h
2
[f(t
i+1
, y
i+1
) +f(t
i
, y
i
)] , 0 ≤ i ≤ N −1,
t
i
= ih, h = T/N. Prove that any two numerical solutions y
i
and ˆ y
i
satisfy
[y
i
− ˆ y
i
[ ≤ e
K
[y
0
− ˆ y
0
[
for 0 ≤ t
i
≤ T, assuming that λh ≤ 1, where K = 3λT/2 and y
0
, ˆ y
0
are respective initial values with y
0
,= ˆ y
0
. (That is, y
i
and ˆ y
i
satisfy the
same diﬀerence equations except for diﬀerent initial values.)
10. Consider the initial-value system
dy
dt
= (I −Bt)
−1
y, y(0) = y
0
, y(t) ∈ R
n
, 0 ≤ t ≤ 1,
where B is an n n matrix with |B|
∞
≤ 1/2. Euler’s method for
approximating y(t) has the form
y
i+1
= y
i
+h(I −Bt
i
)
−1
y
i
= (I +h(I −Bt
i
)
−1
)y
i
, i = 0, 1, , N−1,
where t
i
= ih and h = 1/N. Noting that |Bt
i
|
∞
≤ 1/2 for all i, prove
that
|y
i+1
|
∞
≤ (1 + 2h)|y
i
|
∞
for i = 0, 1, , N −1 and
|y
N
|
∞
≤ e
2
|y
0
|
∞
for any value of N ≥ 1.
11. Consider the following time-dependent logistic model for t ∈ [0, 2]:
dy
dt
= a(t)y(1 −
y
5
), y(0) = 4.
(a) Find parameters c
i
to approximate the time-varying coeﬃcient
a(t) ≈
2
i=0
c
i
ϕ
i
(t). Here, ϕ
i
denotes the hat function centered at
t
i
, with respect to the nodes [t
0
, t
1
, t
2
] = [0, 1, 2]. (See page 161.)
Compute those c
i
which provide the best least-squares ﬁt for the
(t, a) data set:
¦(0.3, 5), (0.6, 5.2), (0.9, 4.8), (1.2, 4.7), (1.5, 5.5), (1.8, 5.2), (2, 4.9)¦.
(b) Solve the resulting initial value problem numerically. Somehow
estimate the error in your numerical solution.
288 Applied Numerical Methods
12. Try the matlab routines ode45, ode15s, ode23s, ode23t, and ode23tb
on the following initial value problems
(a) y
′
= −10
4
y, y(0) = 1, and
(b) y
′′
= −10
4
y, y(0) = 1, y
′
(0) = 0.
In each case, integrate from t = 0 to t = 1. Graph the solutions given,
and form a table of the number of steps each of the routines took.
13. Experiment with the predator-prey model in examples 7.8 and 7.10.
(You may use the function predator prey on page 267 and script on
page 268.) In particular, it is known that, for some values of α, β, γ,
and δ, the populations of rabbits and foxes oscillate, instead of dying
out. Use your intuition to ﬁnd such values, and display the results. (For
instance, to decrease the possibility that the rabbits will die out, you
can increase the birth rate α of the rabbits, or decrease the predation
rate β. Similarly, to decrease the chances that the foxes will die out,
you can decrease the resource competition factor γ or increase the fox
growth rate δ.) You may ﬁnd some solutions where the two populations
oscillate, and some solutions where the population of foxes dies out and
the population of rabbits increases exponentially. Print your graphs as
PDF ﬁles, and supply written explanation.
14. The following matlab script and function implement the discretization
described in Example 7.4 for an arbitrary number of subintervals N.
(N is set on the second line of the script.) The script generates a two-
dimensional plot, by selecting some of the time steps (N t divisions of
them), and also prints the total number of time steps taken (represented
as size(T)). The script is:
global N
N=8
N_t_divisions = 8;
[T,Y] = ode45(’example_7p4_func’,[0,5], [0;0;0;0;0;0;0]);
stride = floor(size(T,1)/N_t_divisions);
Yplot = zeros(N_t_divisions,N-1);
Yplot(1,:) = Y(1,:);
Xsurf(1) = T(1);
for i=2:N_t_divisions
Yplot(i,:) = Y(1+i*stride,:);
Xsurf(i) = T(1+i*stride);
end
for i=1:N-1;Ysurf(i) = i/(N);end;
surf(Ysurf,Xsurf,Yplot)
size(T)
while the function is:
Initial Value Problems for Ordinary Diﬀerential Equations 289
function [ f ] = example_7p4_func( t, u )
% Function for example 7.4 of the manuscript.
global N;
f = zeros(N-1,1);
h = 1/N;
f(1) = N^2*(u(2) - 2*u(1));
for i=2:N-2
f(i) = N^2*(u(i+1)-2*u(i)+u(i-1));
end
f(N-1) = N^2*(t - 2*u(N-1) + u(N-2));
end
Try using N = 8, 50, 100, and 500, using ode45 (if practical) and
ode15s. Make a table of the number of steps taken in each case, and
compare the surface plots obtained.
Chapter 8
Numerical Solution of Systems of
Nonlinear Equations
In this chapter, we study numerical methods for solution of nonlinear systems.
That is, we study numerical methods for ﬁnding x = (x
1
, x
2
, x
n
)
T
∈ D ⊂
R
N
that solves
F(x) = 0, (8.1)
where F(x) = (f
1
(x), f
2
(x), f
3
(x), f
n
(x))
T
, F : D ⊆ R
n
→R
n
.
Example 8.1
The following system of two equations in two unknowns arises from a problem
in phase stability in chemical engineering
1
. Find x
1
and x
2
such that
f
1
(x
1
, x
2
) = x
2
1
+x
1
x
3
2
−9 = 0,
f
2
(x
1
, x
2
) = 3x
2
1
x
2
−x
3
2
−4 = 0.
This system is interesting because it has four solutions.
8.1 Introduction
A basic tool in computing solutions to nonlinear systems of equations is a
multivariate version of Newton’s method. In turn, central to the concept of a
multivariate Newton method is that of a Jacobian matrix.
1
This problem was communicated by Alberto Copati in 1999.
291
292 Applied Numerical Methods
DEFINITION 8.1 The matrix of partial derivatives
F
′
(x) =
_
_
_
_
_
_
_
_
_
_
_
∂f
1
∂x
1
(x)
∂f
1
∂x
2
(x) . . .
∂f
1
∂x
n
(x)
∂f
2
∂x
1
(x)
∂f
2
∂x
2
(x) . . .
∂f
2
∂x
n
(x)
.
.
.
.
.
.
∂f
n
∂x
1
(x)
∂f
n
∂x
2
(x) . . .
∂f
n
∂x
n
(x)
_
_
_
_
_
_
_
_
_
_
_
. (8.2)
is called the Jacobian matrix for the function F. The Jacobian matrix of
F is sometimes denoted by J(F)(x). It is also an instance of the Frech´et
derivative, which we deﬁne in this context in [1, Chapter 8].
Example 8.2
The Jacobian matrix for the function F(x) = (f
1
(x
1
, x
2
), f
2
(x
1
, x
2
))
T
from
Example 8.1 is
F
′
(x) =
_
2x
1
+x
3
2
3x
1
x
2
2
6x
1
x
2
3x
2
1
−3x
2
2
_
.
Just as for functions of one variable, we can form linear models of functions
F with n components, each component of which is a function of n-variables.
In particular, if x ∈ R
n
and x
(0)
∈ R
n
, we have
F(X) = F(x
(0)
) +F
′
(x
(0)
)(x −x
(0)
) +O(|x −x
(0)
|
2
), (8.3)
that is, there is a constant c such that, for all x suﬃciently close to x
(0)
,
|F(X) −
_
F(x
(0)
) +F
′
(x
(0)
)(x −x
(0)
)
_
| ≤ c(|x −x
(0)
|
2
.
In fact, the i-th component of F(x
(0)
)+F
′
(x
(0)
)(x−x
(0)
) is the tangent plane
approximation to f
i
(x) at x
(0)
.
Example 8.3
The linear approximation to the function F from Examples 8.1 and 8.2 at the
point x
(0)
= (1, 2) is
_
f
1
(x
1
, x
2
)
f
2
(x
1
, x
2
)
_
≈ L(x) = F(1, 2) +F
′
(1, 2)
_
x −
_
1
2
__
=
_
0
−6
_
+
_
10 12
12 −9
___
x
1
x
2
_
−
_
1
2
__
=
_
10(x
1
−1) + 12(x
2
−2)
−6 + 12(x
1
−1) − 9(x
2
−2)
_
.
Numerical Solution of Systems of Nonlinear Equations 293
Related to such linear approximations, the following multivariate version of
the mean value theorem can lead to insight.
THEOREM 8.1
(A multivariate mean value theorem) Suppose F : D ⊂ R
n
→R
n
has contin-
uous ﬁrst-order partial derivatives, and suppose that x ∈ D, ˇ x ∈ D, and the
line segment ¦ˇ x +t(x − ˇ x) [ t ∈ [0, 1]¦ is in D. Then
F(x) = F(ˇ x) +A(x − ˇ x), (8.4)
where A is some matrix whose i-th row is of the form
_
∂f
i
∂x
1
(c
i
),
∂f
i
∂x
2
(c
i
), . . . ,
∂f
i
∂x
n
(c
i
)
_
,
where the c
i
∈ R
n
, 1 ≤ i ≤ n are (possibly distinct) points on the line between
ˇ x and x.
We can think of the linear approximation 8.3 as a degree-1 multivariate
Taylor polynomial. Higher-order Taylor expansions are of the form
F(x) = F(ˇ x) +F
′
(ˇ x)(x − ˇ x) +
1
2
F
′′
(ˇ x)(x − ˇ x)(x − ˇ x) +. . . , (8.5)
where F
′
is the Jacobian matrix as in (8.2) and F
′′
, F
′′′
, etc. are higher-order
derivative tensors. For example, F
′′
(x) can be viewed as a matrix of matrices,
whose (i, j, k)-th element is
∂
2
f
i
∂x
j
∂x
k
(x),
and where F
′′
(ˇ x)(x − ˇ x) can be viewed as a matrix whose (i, j)-th entry is
computed as
_
F
′′
(ˇ x)(x − ˇ x)
_
i,j
=
n

k=1
∂
2
f
i
∂x
j
∂x
k
(ˇ x)(x
k
− ˇ x
k
).
Just as in univariate Taylor expansions, if we truncate the expansion in Equa-
tion 8.5 by taking terms only up to and including the k-th Fr´echet derivative,
then the resulting multivariate Taylor polynomial T
k
(x) satisﬁes
F(x) = T
k
(x) +O(|x − ˇ x|
k+1
).
294 Applied Numerical Methods
8.2 Newton’s Method
The multivariate Newton method, for ﬁnding solutions of systems of equa-
tions such as in Example 8.1, can be viewed in the same way as the univariate
Newton method, namely, we replace the function by its tangent-line approx-
imation, then repeatedly ﬁnd where the approximation is equal to zero. In
the multivariate case, setting the linear approximation to the function to zero
gives
F(x
(0)
) +F
′
(x
(0)
)(x −x
(0)
) = 0,
that is,
F
′
(x
(0)
)(x −x
(0)
) = −F(x
(0)
). (8.6)
Equation 8.6 is a linear system of equations in the unknown vector v = x−x
(0)
.
If F
′
(x
(0)
) is non-singular, this system of equations can be solved for v, with a
value x = x
(1)
= x
(0)
+v. In the case, such as Example 8.1, of two equations
in two unknowns, x
(1)
would represent the intersection of the tangent planes
to f
1
and f
2
at x
(0)
with the (x
1
, x
2
)-plane.
Equation 8.6, combined with practical considerations, leads to the following
algorithm for Newton’s method.
ALGORITHM 8.1
(Newton’s method)
INPUT:
(a) an initial guess x
(0)
;
(b) a maximum number of iterations M.
(c) a domain stopping tolerance ǫ
d
and a range stopping tolerance ǫ
r
OUTPUT: either “success” or “failure.” If “success,” then also output the
number of iterations k and the approximation x
(k+1)
to the solution x
∗
.
1. “success” ← “false”.
2. FOR k = 0 to M.
(a) Evaluate F
′
(x
(k)
). (That is, evaluate the corresponding n
2
partial
derivatives at x
(k)
.)
(b) F
′
(x
(k)
)v
(k)
= −F(x
(k)
) for v
(k)
.
◦ IF F
′
(x
(k)
)v
(k)
= −F(x
(k)
) cannot be solved (such as when
F
′
(x
(k)
) is numerically singular) THEN EXIT.
(c) x
(k+1)
← x
(k)
+v
(k)
.
(d) IF
_
|v
(k)
| < ǫ
d
or |F(x
(k+1)
)| < ǫ
r
_
THEN
Numerical Solution of Systems of Nonlinear Equations 295
i. “success” ← “true”.
ii. EXIT.
END FOR
END ALGORITHM 8.1.
Example 8.4
We will use matlab to do a few iterations of the multivariate Newton method,
with x
(0)
= (1, 2)
T
, for Example 8.1. We obtain the following (condensed to
save space, and with comments added):
x =
1
2
>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]
F = 0
-6
>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]
Fprime = 10 12
12 -9
>> v = -Fprime \ F
v = 0.3077
-0.2564
>> x=x+v
x = 1.3077 % (k = 1)
1.7436
>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]
F =-0.3583
-0.3558
>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]
Fprime = 7.9161 11.9266
13.6805 -3.9901
>> v = -Fprime \ F
v = 0.0291
0.0107
>> x=x+v
x = 1.3368 % (k = 2)
1.7543
>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]
F = 0.0045
0.0063
>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]
Fprime = 8.0726 12.3424
14.0711 -3.8714
>> v = -Fprime \ F
v = 1.0e-003 *
-0.4651
-0.0601
>> x=x+v
x = 1.3364 % (k = 3)
1.7542
>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]
F =
1.0e-005 *
0.0500
0.1343
>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]
Fprime = 8.0711 12.3373
14.0657 -3.8745
>> v = -Fprime \ F
v =
296 Applied Numerical Methods
1.0e-007 *
-0.9037
0.1863
>> x=x+v
x = 1.3364 % (k = 4)
1.7542
We see that Newton’s method converges rapidly, with ﬁve displayed digits
unchanging after four iterations. In fact, if we look at the ratios
|F(x
(k+1)
)|/|F(x
(k)
)|
2
,
we see that these ratios are approximately constant, indicating quadratic con-
vergence.
An interesting aspect of this example is that diﬀerent starting points lead
to diﬀerent approximate solutions. For example, with x
(0)
= (3, 0.5)
T
, the
method converges to x
(5)
≈ (2.9984, 0.1484)
T
.
The multivariate Newton method is subject to the same pitfalls as the
univariate Newton method, as illustrated in Figure 2.6 (on page 55), where
the analog to a horizontal tangent line is a singular (or ill-conditioned) F
′
.
The following matlab function implements a simpliﬁed version of Algo-
rithm 8.1. (An electronic copy is available at http://interval.louisiana.
edu/Classical-and-Modern-NA/newton_sys.m.)
function [x_star,success] = newton_sys (x0 ,f , f_prime, eps, maxitr)
%
% [x_star,success] = newton_sys(x0,f,f_prime,eps,maxitr)
% does iterations of Newton’s method for systems,
% using x0 as initial guess, f (a character string giving
% an m-file name) as function, and f_prime (also a character
% string giving an m-file name) as the derivative of f.
% iteration stops successfully if ||f(x)|| < eps, and iteration
% stops unsuccessfully if maxitr iterations have been done
% without stopping successfully or if a zero derivative
% is encountered.
% On return:
% success = 1 if iteration stopped successfully, and
% success = 0 if iteration stopped unsuccessfully.
% x_star is set to the approximate solution to f(x) = 0
% if iteration stopped successfully, and x_star
% is set to x0 otherwise.
success = 0;
x = x0;
for i=1:maxitr;
fval = feval(f,x);
i
x
norm_fval = norm(fval,2)
Numerical Solution of Systems of Nonlinear Equations 297
if norm_fval < eps;
success = 1;
x_star = x;
return;
end;
fpval = feval(f_prime,x);
if fpval == 0;
x_star = x0;
end;
v = fpval \(-fval);
x = x +v;
end;
x_star =x0;
if (~success)
disp(’Warning: Maximum number of iterations reached’);
end
The following theorem tells us that we can often expect Newton’s method
to be locally quadratically convergent.
THEOREM 8.2
Assume that F is deﬁned on a subset of R
n
, that the partial derivatives
of each of the n components of F are continuous, that F(x
(∗)
) = 0 (where
0 is interpreted to be the 0-vector here), that F
′
(x
(∗)
is non-singular, and
(x
(∗)
is in the interior of the domain of deﬁnition of F. Then, Newton’s
method will converge to x
(∗)
for all initial guesses x
(0)
suﬃciently close to
x
(∗)
. Additionally, if for some constant ˆ c,
|F
′
(x) −F
′
(x
∗
)| ≤ ˆ c|x −x
∗
| (8.7)
for all x in some neighborhood of x
∗
, then there exists a positive constant c
such that
|x
(k+1)
−x
(k)
| ≤ c|x
(k)
−x
∗
|
2
. (8.8)
A proof of a generalization of this theorem can be found in [1, Chapter 8].
A classic theorem on convergence of Newton’s method is the Newton–
Kantorovich Theorem, which we do not give here, but which can also be
found in our second-level text [1, Chapter 8].
The multivariate Newton method is applied in a broad range of contexts.
For example, nonlinear systems of equations, with n very large, arise in the dis-
cretization of nonlinear partial diﬀerential equations, and Newton’s method,
combined with use of banded or sparse matrix structures to solve the resulting
linear systems, is used to solve these nonlinear systems. Newton’s method is
also used in implicit methods for solving stiﬀ initial value problems involving
nonlinear systems of ordinary diﬀerential equations.
Although there may be signiﬁcant computational cost to evaluating the
Jacobian matrix at each iteration of Newton’s method, actually deriving it and
298 Applied Numerical Methods
coding it in a programming language is often not necessary today when using
modern packages (such as those that incorporate automatic diﬀerentiation).
Use of divided diﬀerences to evaluate the partial derivatives in the Jacobian
matrix is still sometimes used, although that technique not only can be more
costly but it is unclear in particular problems how to choose the step h, and
the technique can lead to signiﬁcantly inaccurate values.
For general nonlinear systems, it may be diﬃcult to choose a starting guess
x
(0)
. Furthermore, as is the case for Example 8.1, the system may have more
than one solution, and all solutions may be desired. In such cases, Newton’s
method or related iterative methods are embedded in more sophisticated algo-
rithms. We will discuss these algorithms after we explain general multivariate
ﬁxed point iteration.
8.3 Multidimensional Fixed Point Iteration
Multidimensional ﬁxed-point iteration is a close multidimensional analogue
to the univariate ﬁxed point iteration method discussed in Section 2.2 on
page 47. It is used in various contexts. For example, the iterative methods
for linear systems we discussed in Section 3.5 are examples of ﬁxed point
iteration. In ﬁxed point iteration methods, just as in solutions of nonlinear
systems of equations, we have a function
G(x) = (g
1
(x
1
, . . . , x
n
), . . . , g
n
(x
1
, . . . , x
n
))
T
with n components, each of whose components is a function of n variables.
However, instead of ﬁnding a vector (x
1
, . . . , x
n
)
T
at which f
i
(x
1
, . . . , x
n
) = 0,
1 ≤ i ≤ n, we seek a vector x such that
x = G(x), (8.9)
that is, x
i
= g
i
(x), 1 ≤ i ≤ n. Such x with x = G(x) are called ﬁxed points of
G. Note that, if F(x) = G(x) − x, solutions to F(x) = 0 are precisely ﬁxed
points of G. The ﬁxed point equation (8.9) leads to the iteration scheme
x
(k+1)
= G(x
(k)
), k ≥ 0, x
(0)
given in R
n
. (8.10)
Example 8.5
Just as we saw on page 55 for the univariate Newton method, the multivariate
Newton method can be viewed as a multivariate ﬁxed point iteration, with
G(x) = x −(F
′
(x))
−1
F(x).
Numerical Solution of Systems of Nonlinear Equations 299
Example 8.6
In Example 3.18 (on page 93), we saw that we could discretize the boundary
value problem
x
′′
(t) = −sin(πt), x(0) = x(1) = 0.
by replacing the second derivative with a central diﬀerence approximation, to
obtain the linear system
_
_
2 −1 0
−1 2 −1
0 −1 2
_
_
_
_
x
1
x
2
x
3
_
_
=
1
16
_
_
_
sin(
π
4
)
sin(
π
2
)
sin(
3π
4
)
_
_
_.
We further saw in Example 3.30 (on page 121) that we could write this system
as
_
_
_
x
1
x
2
x
3
_
_
_ =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_+
1
32
_
_
_
sin(
π
4
)
sin(
π
2
)
sin(
3π
4
)
_
_
_,
This represents a ﬁxed point iteration, with
G(x) =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_x +
1
32
_
_
_
sin(
π
4
)
sin(
π
2
)
sin(
3π
4
)
_
_
_.
In this case, G(x) happens to be a linear function
2
. (Note that the function
G(x) here is not to be confused with the matrix G we used in explaining
iterative methods in Chapter 3, where we used G to denote the iteration
matrix for historical purposes
3
.)
Example 8.7
If, instead of x
′′
= −sin(πt) as in Example 8.6, the diﬀerential equation were
x
′′
= −e
x
, replacing x
′′
by its central diﬀerence approximation as before gives
the system
_
_
_
x
1
x
2
x
3
_
_
_ =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_+
1
32
_
_
_
e
x1
e
x2
e
x3
_
_
_.
The ﬁxed-point iteration function
G(x) =
_
_
_
0
1
2
0
1
2
0
1
2
0
1
2
0
_
_
_x +
1
32
_
_
_
e
x1
e
x2
e
x3
_
_
_
2
although it is not a linear operator, since it has the constant term
3
See the notation used in [44].
300 Applied Numerical Methods
is now non-linear. We do a small experiment in matlab with ﬁxed point
iteration for this G(x):
>> x = [0;0;0]
x = 0
0
0
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.0313 % k = 1
0.0313
0.0313
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.0479 % k = 2
0.0635
0.0479
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.0645 % k = 3
0.0812
0.0645
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.0739 % k = 4
0.0984
0.0739
.
.
.
x = 0.1053 % k = 22
0.1412
0.1053
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.1053 % k = 23
0.1413
0.1053
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.1054 % k = 24
0.1413
0.1054
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.1054 % k = 25
0.1414
0.1054
>> x = [(1/2)*x(2) + (1/32)*exp(x(1));
(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]
x = 0.1054 % k = 26
0.1414
0.1054
>>
Linear convergence can be inferred from these computations.
When will such a nonlinear multivariate ﬁxed point iteration converge? This
question is answered by a multivariate version of the contraction mapping
theorem
4
.
4
We saw the univariate version on page 49.
Numerical Solution of Systems of Nonlinear Equations 301
DEFINITION 8.2 A mapping G : D ⊂ R
n
→ R
n
is a contraction on a
set D
0
⊂ D if there is an α < 1 such that |G(x) − G(y)| ≤ α|x − y| for all
x, y ∈ D
0
. (Here, the norm can be any norm.)
THEOREM 8.3
(Contraction Mapping Theorem) Suppose that G : D ⊂ R
n
→ R
n
is a con-
traction on a closed set
5
D
0
⊂ D and that G : D
0
→ D
0
, i.e., if x ∈ D
0
, then
G(x) ∈ D
0
. Then G has a unique ﬁxed point x
∗
∈ D
0
. Moreover, for any
x
(0)
∈ D
0
, the iterates ¦x
(k)
¦ deﬁned by x
(k+1)
= G(x
(k)
) converge to x
∗
. We
also have the error estimates
|x
(k)
−x
∗
| ≤
α
1 −α
|x
(k)
−x
(k−1)
|, k = 1, 2, (8.11)
|x
(k)
−x
∗
| ≤
α
k
1 −α
|G(x
(0)
) −x
(0)
|, (8.12)
where α is as in Deﬁnition 8.2.
The proof is similar to the proof of Theorem 2.3 on page 49, the univariate
contraction mapping theorem, which we also gave without proof. People
interested can in the proof can consult [1] or other references.
Showing that G is a contraction can be facilitated by the following Theorem.
The theorem requires the set over which G be deﬁned be convex:
DEFINITION 8.3 A set D
0
is said to be convex, provided
λx + (1 −λ)y ∈ D
0
whenever x ∈ D
0
, y ∈ D
0
, and λ ∈ [0, 1].
For example, hyper-rectangles (that is, interval vectors, deﬁned by lower
and upper bounds on each variable) are convex.
THEOREM 8.4
Let D
0
be a convex subset of R
n
and G be a mapping of D
0
into R
n
whose
components g
1
, g
2
, , g
n
have continuous and bounded derivatives of ﬁrst
order on D
0
. Then the mapping G satisﬁes the Lipschitz condition
|G(x) −G(y)| ≤ L|x −y| for all x, y ∈ D
0
, (8.13)
where L = sup
w∈D0
|G
′
(w)|, where G
′
is the Jacobian matrix of G. If L ≤ α < 1,
then G is a contraction on D
0
.
5
Recall that closed sets in n-space are simply sets that contain their boundaries, that is,
sets that contain all of their limit points.
302 Applied Numerical Methods
Here, | | signiﬁes any vector norm and the corresponding induced matrix
norm, i.e., |A| = sup
x=0
Ax
x
. Thus, |Ax| ≤ |A||x|.
The reader can consult our second-level text [1] for a proof of Theorem 8.4..
Example 8.8
We will apply Theorem 8.4 and Theorem 8.3 to Example 8.7. Take
D
0
= x = ([−1, 1], [−1, 1], [−1, 1])
T
.
We have
G
′
(x) =
_
_
_
_
_
1
32
e
x1
1
2
0
1
2
1
32
e
x2
1
2
0
1
2
1
32
e
x3
_
_
_
_
_
.
Applying interval arithmetic
6
over x, we obtain the naive interval evaluation
of G
′
:
G
′
(x) ⊆
_
_
_
_
_
[0.0114, 0.0850] 0.5 0
0.5 [0.0114, 0.0850] 0.5
0 0.5 [0.0114, 0.0850]
_
_
_
_
_
.
Storing this matrix Gprime in matlab, the norm function for interval matrices
provided with intlab gives
>> norm(Gprime,2)
intval ans = [ 0.6791, 0.7526]
This means that the largest induced 2-norm of any matrix in G
′
(x) is at
most 0.7526 < 1, so Theorem 8.4 shows that G is a contraction over x = D
0
.
Furthermore, we perform an interval evaluation of G, with a mean value
extension over x and centered at x
(0)
= (0, 0, 0)
T
as follows:
>> x = [infsup(-1,1);infsup(-1,1);infsup(-1,1)]
intval x = [ -1.0000, 1.0000]
[ -1.0000, 1.0000]
[ -1.0000, 1.0000]
>> G0 = [1/32;1/32;1/32]
G0 = 0.0313
0.0313
0.0313
>> G = G0 + Gprime*x
intval G = [ -0.5537, 0.6162]
[ -0.5537, 0.6162]
[ -0.5537, 0.6162]
6
We used intlab to obtain the interval enclosure to the range (1/32)e
[−1,1]
.
Numerical Solution of Systems of Nonlinear Equations 303
Since [−0.5537, 0.6162] ⊂ [−1, 1], this shows that G maps x = D
0
into D
0
.
Therefore, Theorem 8.3 implies that G has a unique ﬁxed point in D
0
, and
the iterations deﬁned by x
(k+1)
= G(x
(k)
) converge to this unique ﬁxed point
from any starting point whose coordinates are between −1 and 1.
8.4 Multivariate Interval Newton Methods
Multivariate interval Newton methods are similar to univariate interval
Newton methods (as presented in Section 2.4, starting on page 56), in the
sense that they provide rigorous bounds on solutions, in addition to existence
and uniqueness proofs [20, 29]. Because of this, multivariate interval Newton
methods have a good potential for computing mathematically rigorous bounds
on a solution to a nonlinear system of equations, given an approximate solu-
tion (computed, say, by a point Newton method). Interval Newton methods
are also used as parts of more involved algorithms to ﬁnd all solutions to a
nonlinear system, or for global optimization. (See [1, Section 9.6.3].)
Most multivariate interval Newton methods follow a form similar to that of
the multivariate point method seen in steps 2b and 2c of Algorithm 8.1. We
summarize the algorithm and theoretical properties here. For generalizations
and details, see [1, Chapter 8].
Interval Newton methods can now be viewed as follows
DEFINITION 8.4 Suppose F : D ⊆ R
n
→ R
n
, suppose x ∈ D is an
interval n-vector, suppose and suppose that F
′
(x) is an interval extension of
the Jacobian matrix
7
of F (for example, by evaluating each component with
interval arithmetic). Then a multivariate interval Newton operator F is any
mapping N(F, x, ˇ x) from the set of ordered pairs (x, ˇ x) of interval n-vectors
x and point n-vectors ˇ x to the set of interval n-vectors, such that
˜ x ← N(F, x, ˇ x) = ˇ x +v, (8.14)
where v ∈ IR
n
is any box that bounds the solution set to the linear interval
system
F
′
(x)v = −F(ˇ x). (8.15)
In implementations of interval Newton methods on computers, the vector
F(ˇ x) is evaluated using interval arithmetic, even though the value sought is
at a point. This is to take account of roundoﬀ error, so the results will be
7
The matrix can be somewhat more general than an interval extension of the Jacobian
matrix; see, for example, [1], or, for even more details, [29].
304 Applied Numerical Methods
mathematically rigorous. Bounding the solution set to (8.15) may be done
by interval Gaussian elimination (see Section 3.3.4 on page 111) the interval
Gauss–Seidel method (see Section 3.5.5 on page 130), variants of these, or
other methods, such as the Krawczyk method. See [1], [29], etc.
The following facts (stated as a theorem) form the basis of computations to
prove existence and uniqueness of solutions, as well as to obtain mathemati-
cally rigorous bounds on solutions.
THEOREM 8.5
Suppose F has continuous ﬁrst-order partial derivatives, and N(F, x, ˇ x) is
the image under an interval Newton method of the box x. Then
1. any solutions x
∗
∈ x of F(x) = 0 must also lie in N(F, x, ˇ x).A unique-
ness theorem can be stated in general for any interval Newton operator.
2. If, in addition, N(F, x, ˇ x) ⊂ x, then there exists an x ∈ x such that
F(x) = 0, and that x is unique.
This theorem is related to the contraction mapping theorem through a mul-
tivariate version of the mean value theorem and through the range inclusion
properties of interval arithmetic. For details, see [1, Chapter 8], or, for a more
comprehensive consideration, [29].
Example 8.9
Take the discretization from Example 8.7, but, instead of splitting it as in
the Jacobi method, write the discretized system as
F(x) =
_
_
_
−1
1
2
0
1
2
−1
1
2
0
1
2
−1
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_+
1
32
_
_
_
e
x1
e
x2
e
x3
_
_
_ = 0.
The Jacobian matrix is then
F
′
(x) =
_
_
_
_
_
_
−1 +
1
32
e
x1
1
2
0
1
2
−1 +
1
32
e
x2
1
2
0
1
2
−1 +
1
32
e
x3
_
_
_
_
_
_
_
_
_
_
_
_
x
1
x
2
x
3
_
_
_
_
_
_
We ﬁrst use newton sys (see page 296) to ﬁnd an approximate solution. The
function and Jacobian matrix are programmed as follows.
function [ F ] = F_nonlinear_BVP( x )
A = [-1, 1/2, 0
1/2, -1, 1/2
Numerical Solution of Systems of Nonlinear Equations 305
0, 1/2, -1];
F = A*x + (1/32)*[exp(x(1));exp(x(2));exp(x(3))];
end
function [ Fp ] = F_prime_nonlinear_BVP( x )
Fp = [ -1 + (1/32)*exp(x(1)), 1/2, 0
1/2, -1 + (1/32)*exp(x(2)), 1/2
0, 1/2, -1 + (1/32)*exp(x(3))]
end
The matlab dialog (abridged) for Newton’s method, using starting point
x
(0)
= (0, 0, 0)
T
, is then as follows:
>> [x_star, success] = newton_sys([0;0;0], ...
’F_nonlinear_BVP’,’F_prime_nonlinear_BVP’,1e-15,30)
i = 1
x = 0
0
0
norm_fval = 0.0541
Fp =
-0.9688 0.5000 0
0.5000 -0.9688 0.5000
0 0.5000 -0.9688
.
.
.
i = 4
x = 0.1054
0.1414
0.1054
norm_fval = 1.3055e-016
x_star =
0.1054
0.1414
0.1054
success = 1
We now construct a box around x star, then use intlab’s overloading of the
“¸” operator
8
to perform a step of an interval Newton method:
>> xx = midrad(x_star,0.1)
intval xx =
[ 0.0054, 0.2055]
[ 0.0414, 0.2415]
[ 0.0054, 0.2055]
>> xp = midrad(x_star,0)
intval xp =
0.1054
0.1414
0.1054
>> Fstar = F_nonlinear_BVP(xp)
intval Fstar =
1.0e-015 *
[ 0.0416, 0.0556]
[ 0.1040, 0.1180]
8
See the intlab help ﬁle for an explanation of what method is used to bound the solution
set, when the backslash operator is used with interval data.
306 Applied Numerical Methods
[ 0.0416, 0.0556]
>> Fp = F_prime_nonlinear_BVP(xx)
intval Fp =
-0.97__ 0.5000 0.0000
0.5000 -0.96__ 0.5000
0.0000 0.5000 -0.97__
>> v = -Fp\Fstar
intval v =
1.0e-015 *
[ 0.2104, 0.2653]
[ 0.3248, 0.3991]
[ 0.2104, 0.2653]
>> xx_new = x_star + v
intval xx_new =
0.1054
0.1414
0.1054
>> intvalinit(’DisplayInfsup’)
>> format long
>> xx_new
intval xx_new =
[ 0.10544866558254, 0.10544866558255]
[ 0.14144676492828, 0.14144676492829]
[ 0.10544866558254, 0.10544866558255]
This computation shows that there is a unique solution to the nonlinear system
of equations within xx, a box with diameter approximately 0.2 and centered
on x star, and that the coordinates of this solution lie within the bounds
given by xx new. Note that this means that we have at least 13 digits correct.
If we wanted to demonstrate uniqueness within a larger box, we could do so:
>> xx = midrad(x_star,2)
intval xx =
[ -1.8946, 2.1055]
[ -1.8586, 2.1415]
[ -1.8946, 2.1055]
>> Fp = F_prime_nonlinear_BVP(xx)
intval Fp =
[ -0.9954, -0.7434] [ 0.5000, 0.5000] [ 0.0000, 0.0000]
[ 0.5000, 0.5000] [ -0.9952, -0.7340] [ 0.5000, 0.5000]
[ 0.0000, 0.0000] [ 0.5000, 0.5000] [ -0.9954, -0.7434]
>> v = -Fp\Fstar
intval v =
1.0e-014 *
[ -0.1409, 0.2184]
[ -0.1983, 0.3136]
[ -0.1409, 0.2184]
>> xx_new = x_star + v
intval xx_new =
[ 0.1054, 0.1055]
[ 0.1414, 0.1415]
[ 0.1054, 0.1055]
>> format long
>> xx_new
intval xx_new =
[ 0.10544866558254, 0.10544866558256]
[ 0.14144676492828, 0.14144676492830]
[ 0.10544866558254, 0.10544866558256]
This shows that the solution is unique within a box of radius 2 centered on
x star.
Note that we have proved existence and uniqueness, as well as have com-
puted bounds, for the solution to the nonlinear system of equations arising
from the discretization of the boundary value problem. To show existence and
Numerical Solution of Systems of Nonlinear Equations 307
uniqueness to the solution of the original boundary value problem, more work
would need to be done. Indeed, discretizations sometimes have solutions not
present in the original problem. Existence and uniqueness of solutions to the
original problem can sometimes be proven, and bounds on the solutions to the
original problem can sometimes be obtained, using interval-Newton methods.
However, the error in the discretization needs to be taken into account, and
the process is more sophisticated than that given here
9
.
8.5 Quasi-Newton (Multivariate Secant) Methods
In the 1960’s and 1970’s, many eﬀorts were put into developing methods
that had the advantage of Newton’s method but avoided computing the ma-
trix of partial derivatives. This was to avoid the inaccuracy of ﬁnite diﬀerence
approximations (with uncertainty in choosing the stepsize h) and the man-
ual labor and possibility for blunder in deriving partial derivatives by hand.
Automatic diﬀerentiation (see Section 6.2 on page 215) and other technology
have removed many of the concerns with using Newton’s method directly since
then, but replacement of the Jacobian matrix F
′
by approximations is still
advisable in many situations, such as for very large systems, as an alternative
to methods such as the conjugate gradient method
10
, and such quasi-Newton
methods are found in many widely-available software packages, where they are
combined with step controls to try to prevent the erratic divergence behavior
of Newton’s method. Such methods can be considered to be generalizations
to multiple equations and variables of the secant method for one variable.
Quasi-Newton methods have the general form of Newton’s method, namely,
_
v
(k)
= −
_
B
(k)
_
−1
F(x
(k)
)
x
(k+1)
= x
(k)
+v
(k)
t
(k)
for k = 0, 1, 2, ,
(8.16)
where B
(k)
is an n n matrix and t
(k)
is a scalar. Using B
(k)
= (F
′
(x
(k)
))
−1
and t
(k)
= 1, (8.16) gives Newton’s method.
A commonly used quasi-Newton method is Broyden’s method. Broyden’s
method is designed with the following conditions:
secant condition: B
(k+1)
_
x
(k+1)
= x
(k)
_
= F(x
(k+1)
) −F(x
(k)
). (H repro-
duces the function change in the direction of the step.)
9
Michael Plum, Mitsuhiro Nakao, and others have been active in developing such algo-
rithms.
10
See [1, Section 3.4.10] or numerous other references
308 Applied Numerical Methods
orthogonality or least change condition: B
(k+1)
z = B
(k)
z if
z
T
(x
(k+1)
− x
(k)
) = 0. (The eﬀect of H isn’t altered in directions or-
thogonal to the step.)
These two conditions imply that B
(k+1)
is given by
B
(k+1)
= B
(k)
+
_
y
(k)
−B
(k)
s
(k)
_
(s
(k)
)
T
(s
(k)
)
T
s
(k)
. (8.17)
where
y
(k)
= F(x
(k+1)
) −F(x
(k)
)
and
s
(k)
= x
(k+1)
−x
(k)
.
Equation (8.17) provides what is known as the Broyden update to the approx-
imate Jacobian matrix. Note that it is a rank-one update in the sense that
B
(k+1)
is obtained from B
(k)
by adding a rank-one matrix. See our second-
level text [1] for details of our derivation, as well as other quasi-Newton up-
dates, such as symmetric and rank-two updates appropriate for optimization.
Information can also be found in the research and review literature, such as
[12], or texts such as [11]. It is also possible to update the inverses of these
matrices (say computing
_
B
(k)
_
−1
F(x
(k)
)) in less than O(n
3
) operations; see
the aforementioned references.
Interestingly, Broyden’s method has a convergence rate that is faster than
linear but does not exhibit the order-2 convergence of Newton’s method. This
is analogous to the convergence of the secant method for one variable, which
exhibits a convergence order between 1 and 2.
Example 8.10
We will illustrate Broyden’s method in matlab using F from Example 8.9,
but without using F
′
. As before, we start the computation with x
(0)
=
(0, 0, 0)
T
, but we also must have an initial H = B
(0)
. Traditionally, ﬁnite
diﬀerences can be used to approximate the Jacobian matrix at x
(0)
, or B
(0)
can be set to the identity matrix. We’ll try the latter, and we’ll try t
(k)
= 1.
(Convergence may occur once a suﬃcient number of steps have been taken
to build up a reasonable approximation to the action of F
′
. Also, t
(k)
is
often chosen to minimize |F| in the direction of v
(k)
.) We use the following
matlab function
function [ xstar, success ] = simple_Broyden( x0, B0, F, eps, maxitr )
% [ xstar, success ] = simple_Broyden( x0, B0, F, eps, maxitr )
% is analogous to newton_sys: It computes up to maxitr iterations of
% Broyden’s method for the function F, with starting point x0 and
% starting matrix B0. It stops when either the norm of F is less than
% eps or maxitr iterations have been exceeded.
%
Numerical Solution of Systems of Nonlinear Equations 309
xk = x0;
Fk = feval(F,xk);
norm_Fk = norm(Fk,2);
B = B0;
success=0;
for k=1:maxitr
k
if (norm_Fk < eps)
success = 1;
xstar = xk;
return
end
s = -B\Fk;
xkp1 = xk+s;
Fkp1 = feval(F,xkp1);
y =Fkp1 - Fk;
B = B + (y - B*s)*s’/(s’*s)
xk = xkp1
Fk = Fkp1;
norm_Fk = norm(Fk,2)
end
success = 0;
xstar = x;
end
with the following results (abridged):
>> [xstar,success] = simple_Broyden([0;0;0],eye(3),...
’F_nonlinear_BVP’,1e-10,50)
k = 1, norm_Fk = 0.0716
k = 2, norm_Fk = 0.1114
k = 3, norm_Fk = 0.2657
k = 4, norm_Fk = 7.9479e-004
k = 5, norm_Fk = 1.5246e-004
k = 6, norm_Fk = 1.5021e-005
k = 7, norm_Fk = 1.2445e-007
k = 8,
B = 0.0806 0.5263 -0.9194
0.3942 -1.0081 0.3942
-0.9194 0.5263 0.0806
xk =0.1054
0.1414
0.1054
norm_Fk = 2.3587e-013
k = 9,
xstar =
0.1054
0.1414
310 Applied Numerical Methods
0.1054
success = 1
>> F_prime_nonlinear_BVP(xstar)
ans =
-0.9653 0.5000 0
0.5000 -0.9640 0.5000
0 0.5000 -0.9653
This illustrates the fast convergence. It also illustrates the fact that the matrix
B
(k)
may not converge to F
′
, even though x
(k)
converges quickly to a solution
of F(x) = 0.
8.6 Nonlinear Least Squares
Nonlinear least squares is a special type of optimization problem, involving
nonlinear systems of equations, that is important in applications. The non-
linear case is similar to the linear case, which we treated in (3.18) (page 117)
and (4.18) (page 172). We have m data points (t
i
, y
i
), as well as a model
having n parameters ¦x
j
¦
n
j=1
, and, in general m is much larger than n. In
nonlinear least squares, however, instead of having a model of the form
y ≈ f(t) =
n

j=1
x
j
ϕ
j
(t),
we assume that the parameters x
i
occur nonlinearly, so the model is of the
more general form
y ≈ f(t) = f(x
1
, . . . , x
n
; t),
while the optimization problem remains of the form
11
ϕ(x) =
1
2
min
{xj}
n
j=1
m

i=1
_
y
i
−f(x
1
, . . . , x
n
; t
i
)
_
2
. (8.18)
The corresponding system of equations
f
i
(x) = f(x
1
, . . . , x
n
; t
i
) −y
i
, 1 ≤ i ≤ m (8.19)
still has more equations than unknowns, but now is, additionally, nonlinear
instead of linear. The “matrix” for this system of equations in this case can
11
Here, we include the factor of 1/2, even though it does not aﬀect the optimal x, for
simplicity, since a factor of 2 is introduced when we diﬀerentiate.
Numerical Solution of Systems of Nonlinear Equations 311
be viewed not as the matrix
_
_
_
_
_
_
_
_
ϕ
1
(t
1
) ϕ
2
(t
1
) ϕ
n
(t
1
)
ϕ
1
(t
2
) ϕ
2
(t
2
) ϕ
n
(t
2
)
.
.
.
ϕ
1
(t
m
) ϕ
2
(t
m
) ϕ
n
(t
m
)
_
_
_
_
_
_
_
_
as in the linear case, but is the Jacobian matrix of F = (f
1
(x), . . . , f
m
(x))
T
.
The components f
i
of F are known as the residuals of the ﬁt.
Example 8.11
Find a least squares ﬁt of the form
y(t) = x
1
e
x2t
to the data
t y
0 1.7
1 2.6
2 7.3
Here n = 2 and m = 3, and the least squares problem is nonlinear, since the
parameter x
2
occurs nonlinearly in the expression for y(t). We have
F(x) =
_
_
x
1
− 1.7
x
1
e
x2
− 2.6
x
1
e
2x2
− 7.3
_
_
=
_
_
0
0
0
_
_
,
and the Jacobian matrix is
F
′
(x) =
_
_
_
_
_
_
1 0
e
x2
x
1
e
x2
e
2x2
2x
1
e
2x2
_
_
_
_
_
_
.
In the Gauss–Newton method, we do an iteration of the form (8.16), except
we use the pseudo-inverse (see page 135) of F
′
in place of
_
B
(k)
_
−1
. This
is equivalent to replacing B
(k)
by the matrix (F
′
)
T
F
′
corresponding to the
normal equations
12
.
12
The normal equations are introduced in formula (3.22) on page 118.
312 Applied Numerical Methods
Example 8.12
We will apply several iterations of the Gauss–Newton method to the nonlinear
least squares problem of (8.11). We have programmed F in
function [ F ] = nonlinear LS example( x )
and have programmed F
′
in
function [ Fp ] = F prime nonlinear LS example( x ).
Using the pattern in newton sys and simple Broyden, we have programmed
the simple Gauss–Newton method as follows.
function [ xstar, success ] = simple_Gauss_Newton..
( x0, F, Fprime, eps, maxitr )
% [ xstar, success ] = simple_Gauss_Newton( x0, F, Fprime, eps, maxitr )
% is analogous to newton_sys: It computes up to maxitr iterations of
% the Gauss--Newton method for the function F, with starting point x0 and
% Jacobian matrix Fprime. It stops when either the norm the step is less
% than eps or maxitr iterations have been exceeded.
%
% This is an illustrative function for Chapter 8 of Ackleh and Kearfott,
% "Applied Numerical Methods."
x = x0;
F = feval(F,x);
norm_step = norm(F,2);
B = feval(Fprime,x);
success=0;
for k=1:maxitr
k
s = -pinv(B)*F;
x = x+s;
norm_step = norm(s,2)
if (norm_step < eps)
success = 1;
xstar = x;
return
end
F = feval(F,x);
B = feval(Fprime,x);
end
success = 0;
xstar = x;
end
We obtain the following output (abridged):
>> [xstar,success] = simple_Gauss_Newton([0;0], ...
’F_nonlinear_LS_example’,’F_prime_nonlinear_LS_example’, 1e-10,30)
k = 1, norm_step = 3.8667
k = 2, norm_step = 2.8921
k = 3, norm_step = 0.2766
k = 4, norm_step = 0.0483
k = 5, norm_step = 0.0113
k = 6, norm_step = 0.0015
k = 7, norm_step = 1.3465e-004
k = 8, norm_step = 1.2115e-005
Numerical Solution of Systems of Nonlinear Equations 313
k = 9, norm_step = 1.0948e-006
k = 10, norm_step = 9.8967e-008
k = 11, norm_step = 8.9469e-009
k = 12, norm_step = 8.0882e-010
k = 13, norm_step = 7.3119e-011
xstar =
1.2342
0.8832
success = 1
>>
Analyzing the convergence rate in the following table, we observe the conver-
gence to be linear:
k |s
(k+1)
|/|s
(k)
|
1 0.7480
2 0.0956
3 0.1746
4 0.2340
5 0.0311
6 0.0898
7 0.0900
8 0.0904
9 0.0904
10 0.0904
11 0.0904
12 0.0904
A further computation illustrates that using the pseudo-inverse (through the
matlab function pinv) is equivalent to computing the step s
(k)
by solving
the system
(F
′
(x))
T
F
′
(x)s = −(F
′
(x))
T
F(x)
that corresponds to the normal equations:
>> x = rand(2,1)
x =
0.8936
0.0579
>> Fp = F_prime_nonlinear_LS_example(x)
Fp =
1.0000 0
1.0596 0.9469
1.1228 2.0067
>> F= F_nonlinear_LS_example(x)
F =
-0.8064
-1.6531
-6.2967
314 Applied Numerical Methods
>> s = - pinv(Fp)*F
s =
0.1913
2.7578
>> s_tilde = -(Fp’*Fp)\(Fp’*F)
s_tilde =
0.1913
2.7578
These computations of the step are equivalent when (F
′
(x))
T
F
′
(x) is non-
singular.
In fact, the Gauss–Newton method is in general only linearly convergent,
with a convergence rate that depends on the norm of the residuals at the
solution x
(∗)
to which the iteration is converging, with convergence slower
when the minimum residual norm is larger. For faster convergence, Newton’s
method can be applied to the gradient of the function ϕ in (8.18). However,
the Jacobian matrix of the gradient of ϕ contains second-order partial deriva-
tives. In particular, if ϕ is as in (8.18), and F = (f
1
, . . . , f
m
)
T
is as in (8.19),
that is, if
ϕ(x) =
1
2
m

i=1
f
2
i
(x),
we have the nonlinear system of equations
∇ϕ(x) =
_
∂ϕ
∂x
1
(x), . . . ,
∂ϕ
∂x
n
(x)
_
T
= (F
′
(x))
T
F(x) = G(x) = 0, (8.20)
and the Jacobian matrix of this system is
H(x) = G
′
(x) = (F
′
(x))
T
F
′
(x) +
m

i=1
f
i
(x)H
i
(x), (8.21)
where
H
i
(x) =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
∂
2
f
i
∂x
2
1
(x)
∂
2
f
i
∂x
1
∂x
2
(x) . . .
∂
2
f
i
∂x
1
∂x
n
(x)
∂
2
f
i
∂x
2
∂x
1
(x)
∂
2
f
i
∂x
2
2
(x) . . .
∂
2
f
i
∂x
2
∂x
n
(x)
.
.
.
.
.
.
∂
2
f
i
∂x
n
∂x
1
(x)
∂
2
f
n
∂x
n
∂x
2
(x) . . .
∂
2
f
i
∂x
2
n
(x)
_
_
_
_
_
_
_
_
_
_
_
_
_
_
(8.22)
is known as the Hessian matrix of f
i
, while H(x) is the Hessian matrix of ϕ.
Numerical Solution of Systems of Nonlinear Equations 315
Example 8.13
We will apply Newton’s method to the system G(x) = 0 corresponding to
Example 8.11. We have
H
1
(x) =
_
0 0
0 0
_
, H
2
(x) =
_
0 e
x2
e
x2
x
1
e
x2
_
, H
3
(x) =
_
0 2e
2x2
2e
2x2
4x
1
e
2x2
_
.
To apply Newton’s method to the nonlinear system of equations G(x) = 0, we
have created gradient NLS example and Hessian NLS example as follows
13
:
function [ G ] = gradient_NLS_example( x )
F = F_nonlinear_LS_example(x);
Fp = F_prime_nonlinear_LS_example(x);
G = Fp’*F;
end
function [ H ] = Hessian_NLS_example( x )
Hi = zeros(2,2,3);
Hi(:,:,1) = [0 0
0 0];
Hi(:,:,2) = [ 0 exp(x(2))
exp(x(2)) x(1)*exp(x(2))];
Hi(:,:,3) = [ 0 2*exp(2*x(2))
2*exp(2*x(2)) 4*x(1)*exp(2*x(2))];
Fp = F_prime_nonlinear_LS_example(x);
F = F_nonlinear_LS_example(x);
H = Fp’*Fp + F(1)*Hi(:,:,1) + F(2)*Hi(:,:,2) + F(3)*Hi(:,:,3);
end
In fact, when newton sys is started at x
(0)
= (0, 0)
T
, a point for which the
Gauss–Newton method converged, Newton’s method does not converge:
>> [xstar,success] = newton_sys([0;0],...
’gradient_NLS_example’,’Hessian_NLS_example’,1e-10,20)
i = 1
x = 0
0
13
This is not the most eﬃcient way of programming this, since there are redundant cal-
culations. However, this presentation makes the underlying mathematics clearer than the
most eﬃcient way.
316 Applied Numerical Methods
i = 2
x = 0
-0.6744
i = 3
x = 0
-1.6364
i = 4
x = 0
-3.9796
1.7512
i = 5
x = 0
-36.5880
i = 6
x = 1.0e+015 *
0
-5.0751
i =
7
x =
1.7000
NaN
However, we observe quadratic convergence when we start Newton’s method
suﬃciently close to the solution:
>> [xstar,success] = newton_sys([1.2;0.8],...
’gradient_NLS_example’,’Hessian_NLS_example’,1e-10,20)
i = 1, norm_fval = 17.4291
i = 2, norm_fval = 10.3042
i = 3, norm_fval = 0.6856
i = 4, norm_fval = 0.4781
i = 5, norm_fval = 0.0018
i = 6, norm_fval = 7.7870e-006
i = 7, norm_fval = 4.0175e-013
xstar =
1.2342
0.8832
success =
1
Numerical Solution of Systems of Nonlinear Equations 317
8.7 Methods for Finding All Solutions
To this point, we have discussed iterative methods for ﬁnding approxima-
tions to a single solution to a nonlinear system of equations. In many appli-
cations, ﬁnding all solutions to a nonlinear system is required. Salient among
these are homotopy methods and branch and bound methods.
8.7.1 Homotopy Methods
In a homotopy method, one starts with a simple function g(x), g : D ⊆
R
n
→ R
n
such that every point with g(x) = 0 is known, then transforms
the function into the f(x), f : D ⊆ R
n
→ R
n
for which all points satisfying
f(x) = 0 are desired. During the process, one solves various intermediate
systems, using the solution to the previous system in an initial guess for an
iterative method for the next system. An example of such a transformation
is
H(x, t) = (1 −t)g(x) +tf(x), (8.23)
so H(x, 0) = g(x) and H(x, 1) = f(x). One way of following the curves
H(x, t) = 0 from t = 0 to t = 1 is to consider y = (x, t) ∈ R
n+1
, and to
diﬀerentiate (8.23), obtaining
H
′
(y)y
′
= 0, (8.24)
where H
′
(z) is the n by n + 1 Jacobian matrix of H. Equation (8.23) along
with some
14
normalization condition N(y) = 0 (representing a parametriza-
tion of the curve y(t)) deﬁnes a derivative y
′
, so, in principle, methods and
software for ﬁnding solutions to initial value problems for ordinary diﬀeren-
tial equations can be used to follow the curves of the homotopy. Indeed, this
approach has been used.
Determining an appropriate homotopy H is crucial when ﬁnding all solu-
tions to a system of equations. Particularly interesting is ﬁnding such H for
polynomial systems of equations, where there is an interplay between numer-
ical analysis and algebraic geometry. Signiﬁcant results were obtained during
the 1980’s; for example, see [28]. In such techniques, the homotopy is generally
deﬁned in a space derived from complex n-space, rather than real n-space.
We say more about homotopy methods in our section on software.
8.7.2 Branch and Bound Methods
Branch and bound methods, which we explain in some detail in [1, Sec-
tion 9.6.3] in the context of global optimization, can also be used to solve
14
possibly implicit
318 Applied Numerical Methods
systems of nonlinear equations. In this context, the equations F(x) = 0 can
be considered as constraints, and the objective function can be

n
i=1
f
2
i
(x),
for example.
8.8 Software
Much software, both proprietary and public, that contains the basic meth-
ods we have introduced here as part of more sophisticated schemes, is avail-
able. In the matlab optimization toolbox, the function fsolve solves a
system of nonlinear equations, using a trust region algorithm, in which heuris-
tics
15
are used to modify the length and direction of the Newton step −(F
′
(x))
−1
F(x)
to make it more likely the iteration will converge for starting points x
(0)
far
away from a solution.
Example 8.14
Let us solve the problem from Example 8.7 with fsolve. fsolve has an
option to use the Jacobian matrix, or to approximate the Jacobian matrix with
ﬁnite diﬀerences, if it is not available. We will use the Jacobian matrix, but
we need to provide fsolve with a function whose output arguments contain
both F and F
′
:
function [ F, Fp ] = F_and_Fp_nonlinear_BVP( x )
F = F_nonlinear_BVP(x);
FP = F_prime_nonlinear_BVP(x);
end
We then have
>> options = optimset(’Jacobian’,’on’);
>> xstar = fsolve(’F_and_FP_nonlinear_BVP’,[0;0;0])
Equation solved.
fsolve completed because the vector of function values is near zero
as measured by the default value of the function tolerance, and
the problem appears regular as measured by the gradient.
xstar =
0.1054
0.1414
0.1054
15
that is, rules of thumb
Numerical Solution of Systems of Nonlinear Equations 319
A freely available homotopy-method-based package for solving polynomial
systems of equations is POLSYS PLP [43].
Our GlobSol software [21] implements a branch and bound algorithm for
ﬁnding all solutions to small global optimization problems and ﬁnding all
solutions to nonlinear systems of equations. GlobSol is freely available, al-
though it continues to be under development as of the writing of this work.
On problems when GlobSol is able to ﬁnish its computations, it uses a domain
subdivision process and interval arithmetic techniques to ﬁnd tight mathemat-
ically rigorous bounds on all solutions within a particular hyper-rectangle. As
an example of this, let us reconsider Example 8.1, for which we previously
found an approximation to one of the solutions using Newton’s method.
Example 8.15
The function in Example 8.1 is programmed in GlobSol as a Fortran-90
program. (See [21] for details.) Starting with initial bounds x
1
∈ [−10, 10],
x
2
∈ [−10, 10], and with conﬁguration of GlobSol appropriate for solving
nonlinear systems of equations, we obtain the following output from GlobSol
(abridged):
Output from FIND_GLOBAL_MIN on 06/29/2010 at 06:51:15.
Box data file name is: copatiopt.DT1
Initial box: [ -10, 10 ], [ -10, 10 ]
LIST OF BOXES CONTAINING VERIFIED FEASIBLE POINTS:
Box no.: 1
Box coordinates: [ -3.01, -3 ], [ .147, .149 ]
Box no.: 2
Box coordinates: [ -.902, -.901 ], [ -2.09, -2.08 ]
Box no.: 3
Box coordinates: [ 1.33, 1.34 ], [ 1.75, 1.76 ]
Box no.: 4
Box coordinates: [ 2.99, 3 ], [ .148, .149 ]
Number of bisections: 10
Total number of boxes processed in loop: 23
Overall CPU time: .06
The function lsqnonlin in matlab’s optimization toolbox computes so-
lutions to nonlinear least squares problems, using a choice of Gauss-Newton
320 Applied Numerical Methods
algorithm or other techniques, combined with trust regions. An example of
its use is in the following applications section.
8.9 Applications
In this section we give an example which involves parameter estimation us-
ing the nonlinear least-squares routine “lsqnonlin” in matlab. The follow-
ing nonlinear system [3] is a stage-structured discrete-time model describing
the dynamics of a population whose life cycle can be divided into three stages:
juvenile, non-breeder adult and breeder adult. For example, in a green tree
frog population, each individual can be classiﬁed as a tadpole, tadpole frog
or a sexually mature frog. In the equations, J(t), N(t) and B(t) denote the
size of the juvenile, non-breeder, and breeder populations at time t, respec-
tively. The survivorship functions of each stage s
i
, i = 1, 2, 3, and the birth
rate function b(t) are assumed to be time-dependent functions. Parameters
γ
1
, γ
2
∈ (0, 1] represent the fraction of juveniles that become non-breeders and
non-breeders that become breeders, respectively. We have
_
¸
¸
_
¸
¸
_
J(t + 1) = b(t)B(t) + (1 −γ
1
)s
1
(J(t))J(t),
N(t + 1) = γ
1
s
1
(J(t))J(t) + (1 −γ
2
)s
2
(N(t) +B(t))N(t),
B(t + 1) = γ
2
s
2
(N(t) +B(t))N(t) +s
3
(N(t) +B(t))B(t).
(8.25)
Now, suppose the survivorship functions are of the form
s
1
(J(t)) =
a
1
1 +k
1
J(t)
,
s
2
(N(t) +B(t)) =
a
2
1 +k
2
(N(t) +B(t))
,
s
3
(N(t) +B(t)) =
a
3
1 +k
3
(N(t) +B(t))
,
where a
i
and k
i
are unknown positive parameters, for i = 1, 2, 3. We also
assume that the breeders give births periodically. For our speciﬁc problem,
we let b(t) = b
max
> 0, for t = 1, 2, . . . , 26 and b(t) = 0, for t = 27, 28, . . . , 52,
and so forth. Here, we choose one week as the time unit, then we start counting
the population from the beginning of its breeding season, which lasts for about
26 weeks, and during the next 26 weeks, the population does not have new
births.
Next, we want to show how to estimate the seven parameters b
max
, a
i
and
k
i
for i = 1, 2, 3 by using a set of data points. Instead of using real data in
our example, we generate the data points as follows: First, we prearrange the
Numerical Solution of Systems of Nonlinear Equations 321
values of the parameters in the nonlinear system (8.25) and get the solutions.
We use the total number of adults (N + B) in the population and allow the
data values to be close but have small random deviations from the solutions.
We then invoke the lsqnonlin function in matlab to ﬁnd best estimates for
the parameters, then compare to the actual ones.
The following matlab code with comments added demonstrates the steps
above. In our experiment, we set γ
1
= 1/5 (it takes about 5 weeks on average
for a tadpole to become an immature frog), γ
2
= 1/52. (It takes about a year
on average for a sexually immature frog to becomes mature.) We prearrange
the other parameters to be b
max
= 30, a
1
= 0.8, k
1
= 0.002, a
2
= 0.9,
k
2
= 0.001, a
3
= 0.7, k
3
= 0.004.
clear all
J(1)=5;
N(1)=15;
B(1)=10;
S(1)=N(1)+B(1); % set up the initial values
gamma_1=1/5;
gamma_2=1/52;
b_max=30; % prearrange the values of the parameters
a1=0.8;
a2=0.9;
a3=0.7;
k1=0.002;
k2=0.001;
k3=0.004;
T=150; % number of iterations
for t=1:T
b(t)=b_max*(1-mod(floor((t-1)/26),2)); % periodic birth rate
s1(t)=a1/(1+k1*J(t));
s2(t)=a2/(1+k2*(N(t)+B(t)));
s3(t)=a3/(1+k3*(N(t)+B(t)));
J(t+1)=b(t)*B(t)+(1-gamma_1)*s1(t)*J(t);
N(t+1)=gamma_1*s1(t)*J(t)+(1-gamma_2)*s2(t)*N(t);
B(t+1)=gamma_2*s2(t)*N(t)+s3(t)*B(t);
S(t+1)=N(t+1)+B(t+1);
end
Q = S.*(1+0.1*randn(1,T+1)); % add small deviation
x=1:(T+1);
plot(x,S,’r-’,x,Q,’linewidth’,1.5) % plot the actual values and
%% the generated data
global Q
lb=zeros(1,7); % lower bound for the estimates
322 Applied Numerical Methods
ub=[50,1,1,1,1,1,1]; % upper bound for the estimates
v=rand(1,7);
P_0=lb.*(1-v)+v.*ub; % initial guess of the parameter vector
[parameters, LS, o]=lsqnonlin(@lsquare_errors, P_0, lb, ub);
parameters % display the estimate of the parameters
LS % shows the sum of the least squares errors
JJ(1)=5; % use the estimate of the parameters to
NN(1)=15; %% compute the corresponding solutions
BB(1)=10;
for t=1:T
bb(t)=parameters(1)*(1-mod(floor((t-1)/26),2));
ss1(t)=parameters(2)/(1+parameters(3)*JJ(t));
ss2(t)=parameters(4)/(1+parameters(5)*(NN(t)+BB(t)));
ss3(t)=parameters(6)/(1+parameters(7)*(NN(t)+BB(t)));
JJ(t+1)=bb(t)*BB(t)+(1-gamma_1)*ss1(t)*JJ(t);
NN(t+1)=gamma_1*ss1(t)*JJ(t)+(1-gamma_2)*ss2(t)*NN(t);
BB(t+1)=gamma_2*ss2(t)*NN(t)+ss3(t)*BB(t);
SS(t+1)=NN(t+1)+BB(t+1);
end
figure
plot(x,S,’r-’,x,SS,’b-.’,’linewidth’,2) % plot both the solutions with
%% the original and the estimated
%% parameters.
function L=lsquare_errors(p) % the function computes the sum of
%% the least square errors
global Q
J(1)=5;
N(1)=15;
B(1)=10;
gamma_1=1/5;
gamma_2=1/52;
T=150;
for t=1:T
b(t)=p(1)*(1-mod(floor((t-1)/26),2));
s1(t)=p(2)/(1+p(3)*J(t));
s2(t)=p(4)/(1+p(5)*(N(t)+B(t)));
s3(t)=p(6)/(1+p(7)*(N(t)+B(t)));
J(t+1)=b(t)*B(t)+(1-gamma_1)*s1(t)*J(t);
N(t+1)=gamma_1*s1(t)*J(t)+(1-gamma_2)*s2(t)*N(t);
B(t+1)=gamma_2*s2(t)*N(t)+s3(t)*B(t);
S(t+1)=N(t+1)+B(t+1);
end
L=0;
Numerical Solution of Systems of Nonlinear Equations 323
LS=0;
for i=1:T
L(i+1)=(S(i+1)-Q(i+1));
LS=LS+L(i)^2;
end
Note that we will get diﬀerent results because the generated data are ran-
dom and not unique. Below are several results displayed on the the command
window after we run the code:
Maximum number of function evaluations exceeded;
increase options.MaxFunEvals
parameters =
39.9582 0.8215 0.0020 0.8885 0.0008 0.6380 0.0127
LS =
3.8691e+003
Maximum number of function evaluations exceeded;
increase options.MaxFunEvals
parameters =
41.2113 0.8232 0.0022 0.8989 0.0009 0.5560 0.0092
LS =
3.6080e+003
Maximum number of function evaluations exceeded;
increase options.MaxFunEvals
parameters =
44.2568 0.7712 0.0020 0.8935 0.0008 0.6417 0.0139
LS =
5.0750e+003
The following ﬁgure is based on our ﬁrst result, i.e.,
parameters =
39.9582 0.8215 0.0020 0.8885 0.0008 0.6380 0.0127
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
120
140
t
P
o
p
u
l
a
t
i
o
n

o
f

t
h
e

a
d
u
l
t
s
Actual
Generated
0 20 40 60 80 100 120 140 160
0
20
40
60
80
100
120
140
t
P
o
p
u
l
a
t
i
o
n

o
f

t
h
e

a
d
u
l
t
s
Actual
Estimated
The plot on the left side compares the generated data and the adult pop-
ulation sizes obtained by the original nonlinear system, and the one on the
right shows that how well the equations using the estimated parameters ﬁts
the given data set.
324 Applied Numerical Methods
8.10 Exercises
1. Use the univariate mean value theorem (which you can ﬁnd stated as
Theorem 1.4 on page 5) to prove the multivariate mean value theorem
(stated as Theorem 8.1 on page 293).
2. Write down the degree 2 Taylor polynomials for f
1
(x
1
, x
2
) and f
2
(x
1
, x
2
),
centered at ˇ x = (ˇ x
1
, ˇ x
2
) = (0, 0), for F as in Example 8.2. Lumping
terms together in an appropriate way, interpret your values in terms of
the Jacobian matrix and a second-derivative tensor.
3. Let F be as in Example 8.2 (on page 292), and deﬁne
G(x) = x −Y F(x), where Y ≈
_
2.0030 1.2399
0.0262 0.0767
_
.
Do several iterations of ﬁxed point iteration, starting with initial guess
x
(0)
= (8.0, −0.9)
T
. What do you observe?
4. The nonlinear system
x
2
1
−10x
1
+x
2
2
+ 8 = 0,
x
1
x
2
2
+x
1
−10x
2
+ 8 = 0
can be transformed into the ﬁxed-point problem
x
1
= g
1
(x
1
, x
2
) =
x
2
1
+x
2
2
+ 8
10
,
x
2
= g
2
(x
1
, x
2
) =
x
1
x
2
2
+x
1
+ 8
10
.
Perform 4 iterations of the ﬁxed-point method on this problem, with
initial vector x
(0)
= (0.5, 0.5)
T
.
5. Univariate Newton iteration applied to ﬁnd complex roots
f(x +iy) = u(x, y) +iv(x, y) = 0
is equivalent to multivariate Newton iteration with functions
f
1
(x, y) = u(x, y) = 0 and
f
2
(x, y) = v(x, y) = 0.
(a) Repeat Exercise 12 on page 67, except doing the iterations on the
corresponding system u(x, y) = 0, v(x, y) = 0 of two equations in
two unknowns.
Numerical Solution of Systems of Nonlinear Equations 325
(b) Compare the results, number by number, to the results you ob-
tained in Exercise 12 on page 67.
6. Consider solving the nonlinear system
x
2
1
−10x
1
+x
2
2
+ 8 = 0,
x
1
x
2
2
+x
1
−10x
2
+ 8 = 0.
Experiment with Newton’s method, with various initial vectors, and
discuss what you observe.
7. Consider ﬁnding the minimum of
f(x
1
, x
2
) = e
x1
+e
x2
−x
1
x
2
+x
2
1
+x
2
2
−x
1
−x
2
+ 4
on R
2
. Experiment with Newton’s method:
x
(k+1)
= x
(k)
−
_
∇
2
f(x
(k)
)
_
−1
∇f(x
(k)
).
(That is, try Newton’s method to compute zeros of the gradient.) What
can you surmise from your experiments?
8. Let F be as in Exercise 5 on page 324, let x = ([−0.1, 0.2], [0.8, 1.1])
T
,
and ˇ x = (0.05, 0.95)
T
.
(a) Apply several iterations of the interval Gauss–Seidel method; in-
terpret your results.
(b) Apply several iterations of the Krawczyk method; interpret your
results.
(c) Apply several iterations of the interval Newton method you obtain
by using the linear system solution bounder verifylss in intlab.
9. Let F be as in Exercise 5 on page 324. Do several iterations of Broyden’s
method, using the same starting points as you did for Exercise 5; observe
not only x
(k)
, but also B
k
. What do you observe? Do you observe
superlinear convergence?
10. Consider the nonlinear system of equations from Example 8.1 at the
beginning of this chapter. In Example 8.15 on page 319, we presented
mathematically proven enclosures for the four solutions of this nonlinear
system obtained with the GlobSol software system.
(a) Use the raw Newton’s method (as in newton sys on page 296) with
diﬀerent starting guesses, to see if you can ﬁnd approximations to
each of the four solutions. Describe what you have found.
(b) Proceed as in part (a), but using fsolve from the matlab opti-
mization toolbox, without using a Jacobian matrix.
326 Applied Numerical Methods
(c) Proceed as in part (b), but using fsolve, with the Jacobian matrix.
11. Satisfactory solutions to some (but not all) algebraic systems of equa-
tions can be obtained symbolically, without roundoﬀ error, using com-
puter algebra systems such as Mathematica

or Maple

. For example,
to solve the system in Example 8.15 in Mathematica, one possibility
would be to use Solve as follows.
solutions = Solve[{x1^2 + x1*x2^3 - 9 == 0,
3x1^2*x2 - x2^3 - 4 == 0},
{x1, x2}]
If you have access to a computer algebra system, try obtaining solutions
to the system in Example 8.15. Explain what you observe.
12. Proceeding as in Example 8.7 (on page 299), compute approximate so-
lutions to the boundary value problem x
′′
= −e
x
, x(0) = x(1) = 0 with
N = 8, 50, 100, and 500. (Hint: Write a routine to do this, rather
than repeating commands in the command window. Also, use matlab’s
sparse matrix structure, for it will otherwise be too lengthy. Plot your
results using matlab’s plot routine, and comment on the results.
13. Consider the data from Example 8.11 (on page 311).
(a) Use the routine fminimax from matlab’s optimization toolbox to
ﬁnd the minimax ﬁt to the data. Do this by deﬁning the functions
f
i
(t) = y(t
i
) −y
i
, i = 1, 2, 3, and no constraints.
(b) Use fmincon to ﬁnd the minimax solution. Do this by reformulat-
ing the problem as a constrained optimization problem as follows:
minimize v
subject to:
v ≥ y(t
1
) −y
1
,
v ≥ −(y(t
1
) −y
1
)
v ≥ y(t
2
) −y
2
,
v ≥ −(y(t
2
) −y
2
)
v ≥ y(t
3
) −y
3
,
v ≥ −(y(t
3
) −y
3
).
(c) Redo Example 8.11 in the following ways.
(i) Use fminunc with objective function
f(x) =
3

i=1
(y
i
−y(t
i
))
2
.
Numerical Solution of Systems of Nonlinear Equations 327
(ii) Use lsqcurvefit from matlab’s optimization toolbox.
(iii) Use lsqnonlin from matlab’s optimization toolbox.
(d) Compare the solutions from (a) and (b). (They should be the same
to within the stopping tolerances used.) Also compare the three
solutions from (c); the solutions from (c)(i), (c)(ii), and (c)(iii)
should be the same to within the stopping tolerances.
(e) Plot the solutions from (b) and (c) and the points ¦(t
i
, y
i
)¦
3
i=1
on
the same plot, using matlab’s plot routine. (Examples of the use
of plot are Example 4.11, page 167, etc.)
14. Consider the nonlinear systems in problems 6, 7, and from Example 8.1.
(a) If the symbolic math toolbox from matlab is available to you,
reprogram newton sys (on page 296) to create a function
newton sys symbolic that does not need the argument f prime.
Make the program as general as you can (that is, so that it possibly
handles n variables, with n not speciﬁed beforehand). Hint: you
may wish to use the function jacobian, and you may want to look
at the section “Generating Code from Symbolic Expressions” in the
symbolic math toolbox. Try your function on the three examples.
Does it work the way you expected? Are the generated functions
for the Jacobian matrix similar to the ones you would write by
hand?
(b) Use INTLAB’s “gradient” data type to modify newton sys as in
part (a) of this problem, using automatic diﬀerentiation instead of
symbolic diﬀerentiation, to create a function
newton sys automatic that does not need the argument f prime.
You can consult the “Gradients: automatic diﬀerentiation” sec-
tion of the INTLAB demo package for the syntax you can use. Try
newton sys symbolic on the same three nonlinear systems as in
part (a). Does newton sys automatic give the same results as
newton sys symbolic and as newton sys?
(c) Use central diﬀerences
∂f
i
∂x
j
(x) ≈
f
i
(x
1
, . . . , x
j
+h, . . . x
n
) −f
i
(x
1
, . . . , x
j
−h, . . . x
n
)
2h
in a routine newton sys central difference that does not need
the argument f prime. Does newton sys central difference
give the same results as newton sys symbolic or newton sys automatic?
(Try it on the three problems, with diﬀerent h.)
References
[1] Azmy S. Ackleh, Edward J. Allen, Ralph Baker Kearfott, and Padman-
abhan Seshaiyer. Classical and Modern Numerical Analysis: Theory,
Methods, and Practice. Taylor and Francis, Boca Raton, Florida, 2009.
[2] Azmy S. Ackleh, Jacoby Carter, Lauren Cole, Tom Nguyen, Jay Monte,
and Claire Pettit. Measuring and modeling the seasonal changes of an
urban green treefrog (Hyla cinerea) population. Ecological Modelling,
221(2):281 – 289, 2010.
[3] Azmy S. Ackleh and Patrick De Leenheer. Discrete three-stage popula-
tion model: persistence and global stability results. Journal of Biological
Dynamics, 2(4):415–427, October 2008.
[4] Linda J. S. Allen. An Introduction to Mathematical Biology. Pearson
Prentice Hall, New Jersey, 2006.
[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and
D. Sorensen. LAPACK’s User’s Guide. Society for Industrial and Ap-
plied Mathematics, Philadelphia, PA, USA, 1992.
[6] R. M. Anderson and R. M. May. Infectious Diseases of Humans, Dy-
namics and Control. Oxford University Press, Oxford, 1991.
[7] N. S. (Asai) Asaithambi. Numerical Analysis Theory and Practice. Har-
court Brace College Publishers, Orlando, Florida, February 1995.
[8] Martin Berz, Kyoko Makino, Khodr Shamseddine, and Weishi Wan.
Modern Map Methods in Particle Beam Physics. Academic Press, San
Diego, 1999.
[9] Garrett Birkhoﬀ and Carl de Boor. Error bounds for spline interpolation.
Journal of Mathematics and Mechanics, 13:827–836, 1964.
[10] George F. Corliss and Louis B. Rall. Adaptive, self-validating numeri-
cal quadrature. SIAM Journal on Scientiﬁc and Statistical Computing,
8(5):831–847, September 1987.
[11] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Un-
constrained Optimization and Nonlinear Equations, volume 16 of Clas-
sics in Applied Mathematics. SIAM, Philadelphia, PA, 1996.
329
330 References
[12] John E. Dennis, Jr. and Robert B. Schnabel. Least change secant up-
dates for quasi-Newton methods. SIAM Review, 21(4):443–459, October
1979.
[13] Iain S. Duﬀ, Albert M. Erisman, C. William Gear, and John K. Reid.
Sparsity structure and Gaussian elimination. SIGNUM Newsl., 23(2):2–
8, 1988.
[14] Leah Edulstein-Kechet. Mathematical Models in Biology. SIAM,
Philadelphia, 2005.
[15] George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler. Com-
puter Methods for Mathematical Computations. Prentice–Hall Profes-
sional Technical Reference, 1977.
[16] Gene H. Golub and Charles F. van Loan. Matrix Computations. Johns
Hopkins University Press, third edition, 1996.
[17] Andreas Griewank. Evaluating Derivatives: Principles and Techniques
of Algorithmic Diﬀerentiation. Number 19 in Frontiers in Appl. Math.
SIAM, Philadelphia, PA, 2000.
[18] Richard Wesley Hamming. Numerical Methods for Scientists and Engi-
neers (second edition). Dover Publications, Inc., New York, NY, USA,
1986 (originally 1973).
[19] Alan Jennings. Matrix Computation for Engineers and Scientists. Wiley,
New York, NY, USA, 1977.
[20] Ralph Baker Kearfott. Rigorous Global Search: Continuous Problems.
Number 13 in Nonconvex optimization and its applications. Kluwer Aca-
demic Publishers, Dordrecht, Netherlands, 1996.
[21] Ralph Baker Kearfott. GlobSol user guide. Optimization Methods and
Software, 24(4-5):687–708, August 2009.
[22] Ralph Baker Kearfott and G. William Walster. On stopping criteria in
veriﬁed nonlinear systems or optimization algorithms. ACM Transac-
tions on Mathematical Software, 26(3):373–389, September 2000.
[23] David Kincaid and Ward Cheney. Numerical Analysis: Mathematics of
Scientiﬁc Computing. Brooks / Cole, Paciﬁc Grove, California, third
edition, 2002.
[24] Peter Lancaster and Kestutis Salkauskas. Curve and Surface Fitting:
An Introduction. Academic Press, London, 1986.
[25] Cleve B. Moler. Numerical Computing with matlab. Society for Indus-
trial and Applied Mathematics, January 2004.
[26] R. E. Moore, R. B. Kearfott, and M. J. Cloud. Introduction to Interval
Analysis. SIAM, Philadelphia, PA, 2009.
References 331
[27] Ramon E. Moore. Methods and Applications of Interval Analysis. SIAM,
Philadelphia, PA, USA, 1979.
[28] Alexander Morgan and Andrew Sommese. A homotopy for solving gen-
eral polynomial systems that respects m-homogenous structures. Appl.
Math. Comput., 24(2):101–113, 1987.
[29] Arnold Neumaier. Interval Methods for Systems of Equations, volume 37
of Encyclopedia of Mathematics and its Applications. Cambridge Uni-
versity Press, Cambridge, UK, 1990.
[30] James. M. Ortega. Numerical Analysis: A Second Course. Academic
Press, New York, NY, USA, 1972.
[31] John Derwent Pryce and George F. Corliss. Interval arithmetic with
containment sets. Computing, 78(3):251–276, 2006.
[32] Siegfried M. Rump. Veriﬁcation methods for dense and sparse systems
of equations. In J¨ urgen Herzberger, editor, Topics in Validated Com-
putations: proceedings of IMACS-GAMM International Workshop on
Validated Computation, Oldenburg, Germany, 30 August–3 September
1993, volume 5 of Studies in Computational Mathematics, pages 63–136,
Amsterdam, The Netherlands, 1994. Elsevier.
[33] Siegfried M. Rump. INTLAB–INTerval LABoratory. In Tibor Csendes,
editor, Developments in Reliable Computing: Papers presented at the In-
ternational Symposium on Scientiﬁc Computing, Computer Arithmetic,
and Validated Numerics, SCAN-98, in Szeged, Hungary, volume 5(3)
of Reliable Computing, pages 77–104, Dordrecht, Netherlands, 1999.
Kluwer Academic Publishers. URL: http://www.ti3.tu-harburg.de/
rump/intlab/.
[34] Martin H. Schultz. Spline Analysis. Prentice–Hall, Englewood Cliﬀs, NJ
USA, 1973.
[35] Gilbert W. Stewart. Introduction to Matrix Computations. Academic
Press, New York, NY, USA, 1973.
[36] Josef Stoer and Roland Bulirsch. Introduction to Numerical Analysis.
Springer, New York, 1980. A third edition, 2002, is available.
[37] Friedrich Stummel. Forward error analysis of Gaussian elimination. I.
error and residual estimates. Numerische Mathematik, 46(3):365–395,
June 1985.
[38] Friedrich Stummel. Forward error analysis of Gaussian elimination. II.
stability theorems. Numerische Mathematik, 46(3):397–415, June 1985.
[39] Richard S. Varga. Matrix Iterative Analysis. Springer, New York, NY,
USA, second edition, 2000.
[40] David S. Watkins. Fundamentals of Matrix Computations. Wiley, New
York, NY, USA, 1991. A second edition is available, 2002.
[41] Burton Wendroﬀ. Theoretical Numerical Analysis. Academic Press,
Englewood Cliﬀs, NJ, USA, 1966.
[42] James Hardy Wilkinson. Rounding Errors in Algebraic Processes.
Prentice–Hall, Englewood Cliﬀs, NJ, USA, 1963.
[43] Steven M. Wise, Andrew J. Sommese, and Layne T. Watson. Algorithm
801: POLSYS PLP: a partitioned linear product homotopy code for
solving polynomial systems of equations. ACM Transactions on Mathe-
matical Software, 26(1):176–200, March 2000.
[44] David M. Young. Iterative Solution of Large Linear Systems. Academic
Press, New York, NY, USA, 1971.
Index
absolute error, 14
absolute stability, 264
methods for systems, 277
Adams–Bashforth methods, 269
3-step, 270
Adams–Moulton implicit method, 271
adaptive quadrature, 239
anonymous function, 244
augmented matrix, 80
automatic diﬀerentiation, 215
forward mode, 216
reverse mode, 218
B-spline, 164
back substitution, 92
back-substitution, 79
backward error analysis, 110
banded matrices, 97
basis
collocating, 149
basis functions, 117
big-O notation, 6
bisection
method of, 39
black box function, 60
boundary value problem, 258
branch and bound algorithm, 239
for nonlinear systems of equa-
tions, 317
Broyden update, 308
C, 30
C++, 30
Cauchy–Schwarz inequality, 101
central diﬀerence formula, 212
characteristic polynomial, 191
Chebyshev
equi-oscillation property, 157
polynomial, 157
Cholesky factorization, 93
clamped spline interpolant, 165
code list, 216
collocating basis, 149, 160
column vector, 70
compatible matrix and vector norms,
103
composite quadrature rule, 235
condition
ill, 106
number
generalized, 139
of a function, 17
of a matrix, 106
perfect, 108
contraction, 48
Contraction Mapping Theorem, 301
in one variable, 49
convergence
iterative method for linear sys-
tems, 123
linear, 7
of a sequence of vectors, 123
order of, 7
quadratic, 7
convex
set, 301
correct rounding, 24
Cramer’s rule, 86
cubic spline, 163
defective matrix, 193
dense matrix, 98
dependency, interval, 28
derivative tensor, 293
determinant, of a matrix, 86
diagonally dominant
333
strictly, 88
diﬀerentiation
automatic, 215
direct method, linear systems of equa-
tions, 69
distance
in a normed space, 101
divided diﬀerence
k-th order, 150
ﬁrst order, 150
Newton’s backward formula, 152
Newton’s formula, 151
dot product, 72
eigenvalue, 77
simple, 196
eigenvector, 77
elementary row operations
for linear systems, 79
equi-oscillation property, 157
equilibration, row, 109
equivalent norms, 101
error
absolute, 14
backward analysis, 110
forward analysis, 110
method, 10
relative, 14
roundoﬀ, 10
roundout, 26
truncation, 10
Euclidean norm, 101
Euler’s method, 254
excess width, 28
expansion by minors, 76
extended real numbers, 25
Fast Fourier Transform, 179
FFT, 179
ﬁll-in, sparse matrix, 99
ﬁxed point, 47, 298
iteration method, 47
ﬂoating point numbers, 11
fortran, 30
forward diﬀerence formula, 152, 211
forward error analysis, 14, 110
forward mode, automatic diﬀerenti-
ation, 216
Fourier analysis, 179
Frech´et derivative, 292
Frobenius norm, 103
full matrix, 98
full pivoting
Gaussian elimination, 90
full rank matrix, 75
function
matlab, 32
functional programming, 32
fundamental theorem of interval arith-
metic, 27
Gauss–Hermite quadrature, 229
Gauss–Laguerre formula, 228
Gauss–Laguerre quadrature, 228
Gauss–Legendre quadrature, 227
Gauss–Newton method, 311
Gauss–Seidel method, 124
Gaussian elimination, 79
full pivoting, 90
partial pivoting, 91
pivoting, 90
Gaussian quadrature, 223
2-point, 224
generalized condition number, 139
Gerschgorin’s Circle Theorem, 194
for Hermitian matrices, 196
Givens rotation, 206
global error
of a method for integrating an
initial value problem, 266
GlobSol, 181
hat functions, 161
heat equation, 258
Hermite polynomials, 229
Hermitian matrix, 76, 194
Hessenberg form, 204
Hessian matrix, 314
Hilbert matrix, 108
homotopy method, 317
HUGE, 20
identity matrix, 73
IEEE arithmetic, 19
ill-conditioned, 106
implicit
Euler method, 279
implicit trapezoid method, 287
improper integrals, 244
inﬁnite integrals, 244
initial value problems, 253
integration, 221
inﬁnite, 244
multiple, 242
singular, 244
interpolant
piecewise linear, 159
interpolating polynomial
Lagrange form, 149
Newton form, 151
interval arithmetic
fundamental theorem of, 27
operational deﬁnitions, 25
interval dependency, 28
interval extension
ﬁrst order, 29
second order, 29
interval Newton
operator, 303
univariate, 57
interval Newton method
multivariate, 303
quadratic convergence of, 59
univariate, 57
INTLAB, 31, 35, 63, 113, 132, 144,
188, 220, 246, 325
inverse
of a matrix, 73, 86
inverse midpoint matrix, 131
inverse power method, 202
invertible matrix, 73
iterative method
linear system of equations, 123
IVP, 253
Jacobi diagonalization, 205
Jacobi method, 123
for computing eigenvalues, 205
Jacobi rotation, 206
Jacobian matrix, 291, 292
Kantorovich Theorem, 297
Kronecker delta function, 87, 116
Lagrange
basis, 149
polynomial interpolation, 149
Laguerre polynomials, 228
Laplacian operator, 258
least squares
approximation, 117
least squares approximation, 172
left singular vector, 135
Legendre polynomials, 227
Lemar´echal’s technique, 173, 175
linear algebra, numerical, 69
linear convergence, 7
linear model, 292
linearly independent
vectors, 75
Lipschitz condition, 48
logistic equation, 247
LU
decomposition, 86
factorization, 86
m-ﬁle, 45
machine constants, 20
machine epsilon, 20
mag, 132
magnitude (of an interval), 132
mantissa, 11
Maple, 32
Mathematica, 32
Matlab, 30
function, 32, 45
m-ﬁle, 45
script, 32
matrix
banded, 97
dense, 98
determinant of, 86
full, 98
inverse of, 86
orthogonal, 116
permutation, 88
singular, 74
sparse, 98
ﬁll-in, 99
upper triangular, 82
Vandermonde, 147
matrix (deﬁnition), 70
matrix multiplication, 70
matrix norm, 103
compatible, 103
Frobenius, 103
induced, 104
natural, 104
mean value theorem
for integrals, 2
multivariate, 293
univariate, 5
method error, 10
method of bisection, 39
method of lines, 258
midpoint method
for solution of initial value prob-
lems, 261
midpoint rule
for quadrature, 227
mig, 132
mignitude (of an interval), 132
Moore–Penrose pseudo-inverse, 135
multiple integrals, 242
multiplication
matrix, 70
multivariate interval Newton opera-
tor, 303
multivariate mean value theorem, 293
NaN, 21
natural or induced matrix norm, 104
natural spline, 165
Newton’s backward diﬀerence formula,
152
Newton’s divided diﬀerence formula,
151
Newton’s forward diﬀerence formula,
152
Newton’s method
multivariate
local convergence of, 297
univariate, 54
Newton–Cotes formulas, 222
closed, 222
open, 222
Newton–Kantorovich Theorem, 297
nonlinear least squares, 310
Gauss–Newton method, 311
nonsingular matrix, 73
norm, 100
equivalent, 101
Euclidean, 101
important ones on C
n
, 100
matrix, 103
compatible, 103
Frobenius, 103
induced, 104
natural, 104
scaled, 101
normal distribution
standard, 151
normal equations, 118
not a number, 21
numerical linear algebra, 69
numerical stability, 100
object-oriented programming, 31
Octave, 31
operator overloading, 220
order
of a single-step method for solv-
ing an IVP, 256
of convergence, 7
origin shifts, 205
orthogonal
matrix, 116
orthogonal decomposition, 116
orthogonal vectors, 116
orthonormal vectors, 116
outlier, 186
outliers, 175
outward rounding, 26
overﬂow, 20
overloading, operator, 220
overrelaxation factor, 126
partial pivoting
Gaussian elimination, 91
perfectly conditioned, 108
permutation matrix, 88
piecewise linear interpolant, 159
pivoting, in Gaussian elimination, 90
plane rotation, 206
polynomial interpolation, 146
positive
deﬁnite, 76
semi-deﬁnite, 76
preconditioning, 130
predictor-corrector methods
for solving initial value problems,
272
product formula, 242
pseudo-inverse, 135
QR
decomposition, 116
factorization, 116
method, 204
quadratic convergence, 7
quadrature, 221
Gauss–Hermite, 229
Gauss–Laguerre, 228
Gaussian, 223
2-point, 224
midpoint rule, 227
Newton–Cotes, 222
product formula, 242
quadrature rule
composite, 235
quasi-Newton methods, 307
R-stage Runge–Kutta method, 262
rank
of a matrix, 75
rank-one update, 308
Rayleigh quotient, 203
recursion, 240
regression
robust, 177
relative error, 14
residual, 80
residuals, 311
Richardson extrapolation, 236
right singular vector, 135
robust ﬁt, 177
Romberg integration, 236
round
down, 11
to nearest, 12
to zero, 12
up, 12
rounding modes, 20
roundoﬀ error, 10
in Gaussian elimination, 110
roundout error, 26
row equilibration, 109
Runge’s function, 155
Runge–Kutta method
fourth order classic, 263
R-stage, 262
Runge–Kutta methods, 261
Runge–Kutta–Fehlberg method, 267
scalar, 70
scalar multiplication, 77
scaled norm, 101
Schwarz inequality, 101
Scilab, 31
script, matlab, 32
script, Matlab, 45
secant method, 61
convergence of, 61
semi-deﬁnite, 76
signiﬁcant digits, 18
similarity transformation, 194
simple eigenvalue, 196
Simpson’s rule, 223
sinc function, 238
sine integral, 237
single use expression, 28
single-step method, 255
order of a, 256
singular integrals, 244
singular matrix, 74
singular vector
left, 135
right, 135
smoothness, 238
solution set, 112
SOR
matrix, 126
method, 125
sparse matrices, 98
spectral radius, 77, 104, 192
spectrum, 191
spline
B-, 164
clamped, 165
cubic, 163
natural, 165
stability
numerical, 100
of a method for initial value prob-
lems, 264, 277
standard normal distribution, 151
stiﬀ
system of ODE’s, 259, 270, 271,
275
subdistributivity, 26
successive overrelaxation, 125
successive relaxation method, 124
SUE, 28
symmetric matrix, 76
tape, 216
Taylor polynomial
approximation by, 145
multivariate, 293
Taylor series methods
for solving IVP’s, 259
Taylor’s theorem, 3
tensor
derivative, 293
TINY, 20
trapezoid method
implicit, 287
trapezoidal rule, 249
triangle inequality, 100
triangular
decomposition, 86
factorization, 86
trigonometric polynomials, 179
truncation error, 10
trust region algorithm, 318
two-point compactiﬁcation, 25
underﬂow, 20
underrelaxation factor, 126
unitary matrix, 108
upper triangular matrix, 82
Vandermonde matrix, 147
Vandermonde system, 147

Undergraduate Text

Comments

Content

Sponsor Documents

Recommended