Fundamentals of Deep Learning

Published on May 2016 | Categories: Documents | Downloads: 38 | Comments: 0 | Views: 337
of x
Download PDF   Embed   Report

Fundamentals of deep learning

Comments

Content

Fundamentals of Deep Learning

Designing Next Generation Artificial Intelligence
Algorithms

Nikhil Buduma

Boston

Fundamentals of Deep Learning
by Nikhil Buduma
Copyright © 2015 Nikhil Buduma. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected] .

Editors: Mike Loukides and Shannon Cutt
Production Editor: FILL IN PRODUCTION EDI‐
TOR

Copyeditor: FILL IN COPYEDITOR

November 2015:

Proofreader: FILL IN PROOFREADER
Indexer: FILL IN INDEXER
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest

First Edition

Revision History for the First Edition
2015-06-12
2015-07-23

First Early Release
Second Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491925614 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Deep Learning, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author(s) have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibil‐
ity for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.

978-1-491-92561-4
[FILL IN]

Table of Contents

Preface Title. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. The Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Building Intelligent Machines
The Limits of Traditional Computer Programs
The Mechanics of Machine Learning
The Neuron
Expressing Linear Perceptrons as Neurons
Feed-forward Neural Networks
Linear Neurons and their Limitations
Sigmoid, Tanh, and ReLU Neurons
Softmax Output Layers
Looking Forward

9
10
11
15
17
18
21
21
23
24

2. Training Feed-Forward Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The Cafeteria Problem
Gradient Descent
The Delta Rule and Learning Rates
Gradient Descent with Sigmoidal Neurons
The Backpropagation Algorithm
Stochastic and Mini-Batch Gradient Descent
Test Sets, Validation Sets, and Overfitting
Preventing Overfitting in Deep Neural Networks
Summary

25
27
29
31
33
36
38
45
49

3. Implementing Neural Networks in Theano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
What is Theano and why are we using it?
Installing Theano

51
52
v

Basic Algebra in Theano
Theano Graph Structures
Shared Variables and Side-Effects
Randomness in Theano
Computing Derivatives Symbolically
Expressing a Logistic Regression Network in Theano
Using Theano to Train a Logistic Regression Network
Multilayer Models in Theano
Summary

vi

|

Table of Contents

53
54
56
57
58
59
66
75
84

Preface Title

Congratulations on starting your new project! We’ve added some skeleton files for
you, to help you get started, but you can delete any or all of them, as you like. In the
file called chapter.html, we’ve added some placeholder content showing you how to
markup and use some basic book elements, like notes, sidebars, and figures.

vii

CHAPTER 1

The Neural Network

Building Intelligent Machines
The brain is the most incredible organ in the human body. It dictates the way we per‐
ceive every sight, sound, smell, taste, and touch. It enables us to store memories,
experience emotions, and even dream. Without it, we would be primitive organ‐
isms, incapable of anything other than the simplest of reflexes. The brain is, inher‐
ently, what makes us intelligent.
The infant brain only weighs a single pound, but somehow, it solves problems that
even our biggest, most powerful supercomputers find impossible. Within a matter of
days after birth, infants can recognize the faces of their parents, discern discrete
objects from their backgrounds, and even tell apart voices. Within a year, they’ve
already developed an intuition for natural physics, can track objects even when they
become partially or completely blocked, and can associate sounds with specific mean‐
ings. And by early childhood, they have a sophisticated understanding of grammar
and thousands of words in their vocabularies.
For decades, we’ve dreamed of building intelligent machines with brains like ours robotic assistants to clean our homes, cars that drive themselves, microscopes that
automatically detect diseases. But building these artificially intelligent machines
requires us to solve some of the most complex computational problems we have ever
grappled with, problems that our brains can already solve in a manner of microsec‐
onds. To tackle these problems, we’ll have to develop a radically different way of pro‐
gramming a computer using techniques largely developed over the past decade. This

9

is an extremely active field of artificial computer intelligence often referred to as deep
learning.

The Limits of Traditional Computer Programs
Why exactly are certain problems so difficult for computers to solve? Well it turns
out, traditional computer programs are designed to be very good at two things: 1)
performing arithmetic really fast and 2) explicitly following a list of instructions. So if
you want to do some heavy financial number crunching, you’re in luck. Traditional
computer programs can do just the trick. But let’s say we want to do something
slightly more interesting, like write a program to automatically read someone’s hand‐
writing.

Figure 1-1. Image from MNIST handwritten digit dataset
Although every digit in Figure 1-1 is written in a slightly different way, we can easily
recognize every digit in the first row as a zero, every digit in the second row as a one,
etc. Let’s try to write a computer program to crack this task. What rules could we use
to tell a one digit from another?
Well we can start simple! For example, we might state that we have a zero if our image
only has a single closed loop. All the examples in Figure 1-1 seem to fit this bill, but

10

|

Chapter 1: The Neural Network

this isn’t really a sufficient condition. What if someone does’t perfectly close the loop
on their zero? And, as in Figure 1-2, how do you distinguish a messy zero from a six?

Figure 1-2. A zero that’s difficult to distinguish from a six algorithmically
You could potentially establish some sort of cutoff for the distance between the start‐
ing point of the loop and the ending point, but it’s not exactly clear where we should
be drawing the line. But this dilemma is only the beginning of our worries. How do
we distinguish between threes and fives? Or between fours and nines? We can add
more and more rules, or features, through careful observation and months of trial and
error, but it’s quite clear that this isn’t going to be an easy process.
There many other classes of problems that fall into this same category: object recog‐
nition, speech comprehension, automated translation, etc. We don’t know what pro‐
gram to write because we don’t know how it’s done by our brains. And even if we did
know how to do it, the program might be horrendously complicated.

The Mechanics of Machine Learning
To tackle these classes of problems we’ll have to use a very different kind of approach.
A lot of the things we learn in school growing up have a lot in common with tradi‐
tional computer programs. We learn how to multiply numbers, solve equations, and
take derivatives by internalizing a set of instructions. But the things we learn at an
extremely early age, the things we find most natural, are learned by example, not by
formula.
For example, when we were two years old, our parents didn’t teach us how to recog‐
nize a dog by measuring the shape of its nose or the contours of its body. We learned
The Mechanics of Machine Learning

|

11

to recognize a dog by being shown multiple examples and being corrected when we
made the wrong guess. In other words, when we were born, our brains provided us
with a model that described how we would be able to see the world. As we grew up,
that model would take in our sensory inputs and make a guess about what we’re expe‐
riencing. If that guess was confirmed by our parents, our model would be reinforced.
If our parents said we were wrong, we’d modify our model to incorporate this new
information. Over our lifetime, our model becomes more and more accurate as we
assimilate billions of examples. Obviously all of this happens subconsciously, without
us even realizing it, but we can use this to our advantage nonetheless.
Deep learning is a subset of a more general field of artificial intelligence
called machine learning, which is predicated on this idea of learning from example. In
machine learning, instead of teaching a computer the a massive list rules to solve the
problem, we give it a model with which it can evaluate examples and a small set of
instructions to modify the model when it makes a mistake. We expect that, over
time, a well-suited model would be able to solve the problem extremely accurately.
Let’s be a little bit more rigorous about what this means so we can formulate this idea
mathematically. Let’s define our model to be a function h �, θ . The input � is an
example expressed in vector form. For example, if � were a greyscale image, the vec‐
tor’s components would be pixel intensities at each position, as shown in Figure 1-3.

Figure 1-3. The process of vectorizing an image for a machine learning algorithm
The input θ is a vector of the parameters that our model uses. Our machine learning
program tries to perfect the values of these parameters as it is exposed to more and
more examples. We’ll see this in action in more detail in a future section.
To develop a more intuitive understanding for machine learning models, let’s walk
through a quick example. Let’s say we wanted to figure out wanted to determine how
12

|

Chapter 1: The Neural Network

to predict exam performance based on the number of hours of sleep we get and the
number of hours we study the previous day. We collect a lot of data, and for each data
T
point � = x1 x2 , we record the number of hours of sleep we got (x1), the number of
hours we spent studying (x2), and whether we performed above average or below the
class average. Our goal, then, might be to learn a model h �, θ with parameter vector
T
θ = θ0 θ1 θ2 such that:

−1 if �T ·
h �, θ =
1

if �T ·

θ1
θ2
θ1
θ2

+ θ0 < 0
+ θ0 ≥ 0

In other words, we guess that the blueprint for our model h �, θ is as described
above (geometrically, this particular blueprint describes a linear classifier that divides
the coordinate plane into two halves). Then, we want to learn a parameter vec‐
tor θ such that our model makes the right predictions (-1 if we perform below aver‐
age, and 1 otherwise) given an input example �. This is model is called a linear
perceptron. Let’s assume our data is as shown in Figure 1-4.
Then it turns out, by selecting θ = −24 3 4 T , our machine learning model makes
the correct prediction on every data point:

h �, θ =

−1 if 3x1 + 4x2 − 24 < 0
1

if 3x1 + 4x2 − 24 ≥ 0

An optimal parameter vector θ positions the classifier so that we make as many cor‐
rect predictions as possible. In most cases, there are many (or even infinitely
many) possible choices for θ that are optimal. Fortunately for us, most of the time
these alternatives are so close to one another that the difference in their performance
is negligible. If this is not the case, we may want to collect more data to narrow our
choice of θ.

The Mechanics of Machine Learning

|

13

Figure 1-4. Sample data for our exam predictor algorithm and a potential classifier
This is pretty cool, but there are still some pretty significant questions that remain.
First off, how do we even come up with an optimal value for the parameter vec‐
tor θ in the first place? Solving this problem requires a technique commonly known
as optimization. An optimizer aims to maximize the performance of a machine learn‐
ing model by iteratively tweaking its parameters until the error is minimized. We’ll
begin to tackle this question of learning parameter vectors in more detail in the next
chapter, when we describe the process of gradient descent. In later chapters, we’ll try
to find ways to make this process even more efficient.
Second, it’s quite clear that this particular model (the linear perceptron model) is
quite limited in the relationships it can learn. For example, the distributions of data
shown in Figure 1-5 cannot be described well by a linear perceptron.
14

|

Chapter 1: The Neural Network

Figure 1-5. As our data takes on more complex forms, we need more complex models to
describe them
But these situations are only the tip of the iceberg. As we move onto much more
complex problems such as object recognition and text analysis, our data not only
becomes extremely high dimensional, but the relationships we want to capture also
become highly nonlinear. To accommodate this complexity, recent research in
machine learning has attempted to build models that highly resemble the structures
utilized by our brains. It’s essentially this body of research, commonly referred to
as deep learning, that has had spectacular success in tackling problems in computer
vision and natural language processing. These algorithms not only far surpass other
kinds of machine learning algorithms, but also rival (or even exceed!) the accuracies
achieved by humans.

The Neuron
The foundational unit of the human brain is the neuron. A tiny piece of of the brain,
about the size of grain of rice, contains over 10,000 neurons, each of which forms an
average of 6,000 connections with other neurons. It’s this massive biological network
that enables us to experience the world around us. Our goal in this section will be to
use this natural structure to build machine learning models that solve problems in an
analogous way.
At its core, the neuron is optimized to receive information from other neurons, pro‐
cess this information in a unique way, and send its result to other cells. This process is
summarized in Figure 1-6. The neuron receives its inputs along antennae-like struc‐
tures called dendrites. Each of these incoming connections is dynamically strength‐
ened or weakened based on how often it is used (this is how we learn new concepts!),
and it’s the strength of each connection that determines the contribution of the input
to the neuron’s output. After being weighted by the strength of their respective con‐

The Neuron

|

15

nections, the inputs are summed together in the cell body. This sum is then trans‐
formed into a new signal that’s propagated along the cell’s axon and sent off to other
neurons.

Figure 1-6. A functional description of a biological neuron’s structure
We can translate this functional understanding of the neurons in our brain into an
artificial model that we can represent on our computer. Such a model is described
in Figure 1-7. Just as in biological neurons, our artificial neuron takes in some num‐
ber of inputs, x1, x2, . . . , xn, each of which is multiplied by a specific weight,
w1, w2, . . . , wn. These weighted inputs are, as before, summed together to produce
the logit of the neuron, z = ∑ni = 0 wixi. In many cases, the logit also includes a bias,
which is a constant (not shown in figure). The logit is then passed through a func‐
tion f to produce the output y = f z . This output can be transmitted to other neu‐
rons.

Figure 1-7. Schematic for a neuron in an artificial neural net
16

|

Chapter 1: The Neural Network

In Example 1-1, we show how a neuron might be implemented in Python. A few
quick notes on implementation. Throughout this book, we’ll be constantly using a
couple of libraries to make our lives easier. One of these is NumPy, a fundamental
library for scientific computing. Among other things, NumPy will allow us to quickly
manipulate matrices and vectors with ease. In Example 1-1, NumPy enables us to
painlessly take the dot product of two vectors (inputs and self.weights). Another
library that we will use further down the road is Theano. Theano integrates closely
with NumPy and allows us to define, optimize, and evaluate mathematical expres‐
sions. These two libraries will serve as a foundation for tools we explore in future
chapters, so it’s worth taking some time to gain some familiarity with them.
Example 1-1. Neuron Implementation
import numpy as np
#####################################################
# Assume inputs and weights are 1-dimensional numpy #
# arrays and bias is a number
#
#####################################################
class Neuron:
def __init__(self, weights, bias, function):
self.weights = weights
self.bias = bias
self.function = function
def forward(self, inputs):
logit = np.dot(inputs, self.weights) + self.bias
output = self.function(logit)
return output

Expressing Linear Perceptrons as Neurons
In the previous section we talked about how using machine learning models to cap‐
ture the relationship between success on exams and time spent studying and sleeping.
To tackle this problem, we constructed a linear perceptron classifier that divided the
Cartesian coordinate plane into two halves:

h �, θ =

−1 if 3x1 + 4x2 − 24 < 0
1

if 3x1 + 4x2 − 24 ≥ 0

Expressing Linear Perceptrons as Neurons

|

17

As shown in Figure 1-4, this is an optimal choice for θ because it correctly classifies
every sample in our dataset. Here, we show that our model h is easily using a neuron.
Consider the neuron depicted in Figure 1-8. The neuron has two inputs, a bias, and
uses the function:

f z =

−1 if z < 0
1 if z ≥ 0

It’s very easy to show that our linear perceptron and the neuronal model are perfectly
equivalent. And in general, it’s quite simple to show singular neurons are strictly
more expressive than linear perceptrons. In other words, every linear perceptron can
be expressed as a single neuron, but single neurons can also express models that can‐
not be expressed by any linear perceptron.

Figure 1-8. Expressing our exam performance perceptron as a neuron

Feed-forward Neural Networks
Although single neurons are more powerful than linear perceptrons, they’re not
nearly expressive enough to solve complicated learning problems. There’s a reason
our brain is made of more than one neuron. For example, it is impossible for a single
neuron to differentiate hand-written digits. So to tackle much more complicated
tasks, we’ll have to take our machine learning model even further.
The neurons in the human brain are organized in layers. In fact the human cerebral
cortex (the structure responsible for most of human intelligence) is made of six lay‐
ers. Information flows from one layer to another until sensory input is converted into
18

|

Chapter 1: The Neural Network

conceptual understanding. For example, the bottom-most layer of the visual cortex
receives raw visual data from the eyes. This information is processed by each layer
and passed onto the next until, in the sixth layer, we conclude whether we are looking
at a cat, or a soda can, or an airplane.

Figure 1-9. A simple example of a feed-forward neural network with 3 layers (input, one
hidden, and output) and 3 neurons per layer
Borrowing these concepts, we can construct an artificial neural network. A neural
network comes about when we start hooking up neurons to each other, the input
data, and to the output nodes, which correspond to the network’s answer to a learn‐
ing problem. Figure 1-9 demonstrates a simple example of an artificial neural net‐
work. The bottom layer of the network pulls in the input data. The top layer of
neurons (output nodes) computes our final answer. The middle layer(s) of neurons
are called the hidden layers, and we let wi,kj be the weight of the connection between
the ith neuron in the kth layer with the jth neuron in the k + 1st layer. These weights
constitute our parameter vector, θ, and just as before, our ability to solve problems
with neural networks depends on finding the optimal values to plug into θ.

Feed-forward Neural Networks

|

19

We note that in this example, connections only traverse from a lower layer to a higher
layer. There are no connections between neurons in the same layer, and there are no
connections that transmit data from a higher layer to a lower layer. These neural net‐
works are called feed-forward networks, and we start by discussing these networks
because they are the simplest to analyze. We present this analysis (specifically, the
process of selecting the optimal values for the weights) in the next chapter. More
complicated connectivities will be addressed in later chapters.
In the final sections, we’ll discuss the major types of layers that are utilized in feedforward neural networks. But before we proceed, here’s a couple of important notes
to keep in mind:
1. As we mentioned above, the layers of neurons that lie sandwiched between the
first layer of neurons (input layer) and the last layer of neurons (output layer), are
called the hidden layers. This is where most of the magic is happening when the
neural net tries to solve problems. Whereas (as in the handwritten digit example)
we would previously have to spend a lot of time identifying useful features, the
hidden layers automate this process for us. Often times, taking a look at the activ‐
ities of hidden layers can tell you a lot about the features the network has auto‐
matically learned to extract from the data.
2. Although, in this example, every layer has the same number of neurons, this is
neither necessary nor recommended. More often than not, hidden layers often
have fewer neurons than the input layer to force the network to learn compressed
representations of the original input. For example, while our eyes obtain raw
pixel values from our surroundings, our brain thinks in terms of edges and con‐
tours. This is because the hidden layers of biological neurons in our brain force
us to come up with better representations for everything we perceive.
3. It is not required that every neuron has its output connected to the inputs of all
neurons in the next layer. In fact, selecting which neurons to connect to which
other neurons in the next layer is an art that comes from experience. We’ll dis‐
cuss this issue in more depth as we work through various examples of neural net‐
works.
4. The inputs and outputs are vectorized representations. For example, you might
imagine a neural network where the inputs are the individual pixel RGB values in
an image represented as a vector (refer to Figure 1-3). The last layer might have 2
neurons which correspond to the answer to our problem: 1, 0 if the image con‐
tains a dog, 0, 1 if the image contains a cat, 1, 1 if it contains both, and 0, 0 if
it contains neither.

20

| Chapter 1: The Neural Network

Linear Neurons and their Limitations
Most neuron types are defined by the function f they apply to their logit z. Let’s first
consider layers of neurons that use a linear function in the form of f z = az + b. For
example, a neuron that attempts to estimate a cost of a meal in a cafeteria would use a
linear neuron where a = 1 and b = 0. In other words, using f z = z and weights
equal to the price of each item, the linear neuron in Figure 1-10, would take in some
ordered triple of servings of burgers, fries, and sodas, and output the price of the
combination.

Figure 1-10. An example of a linear neuron
Linear neurons are easy to compute with, but they run into serious limitations. In
fact, it can be shown that any feed-forward neural network consisting of only linear
neurons can be expressed as a network with no hidden layers. This is problematic
because as we discussed before, hidden layers are what enable us to learn important
features from the input data. In other words, in order to learn complex relationships,
we need to use neurons that employ some sort of nonlinearity.

Sigmoid, Tanh, and ReLU Neurons
There are three major types of neurons that are used in practice that introduce nonli‐
nearities in their computations. The first of these is the sigmoid neuron, which uses
the function:

Linear Neurons and their Limitations

|

21

f z =

1

1 + e−z

Figure 1-11. The output of a sigmoid neuron as z varies
Intuitively, this means that when the logit is very small, the output of a logistic neu‐
ron is very close to 0. When the logit is very large, the output of the logistic neuron is
close to 1. In between these two extremes, the neuron assumes an s-shape, as shown
in Figure 1-11.

Figure 1-12. The output of a tanh neuron as z varies
22

|

Chapter 1: The Neural Network

Tanh neurons use a similar kind of s-shaped nonlinearity, but instead of ranging from
0 to 1, the output of tanh neurons range from -1 to 1. As one would expect, they
use f z = tanh z . The resulting relationship between the output y and the logit z is
described by Figure 1-12. When s-shaped nonlinearities are used, the tanh neuron is
often preferred over the sigmoid neuron because it is zero-centered.
A different kind of nonlinearity is used by the restricted linear unit (ReLU) neuron. It
uses the function f z = max 0, z , resulting in a characteristic hockey stick shaped
response as shown in Figure 1-13.

Figure 1-13. The output of a ReLU neuron as z varies
The ReLU has recently become the neuron of choice for many tasks (especially in
computer vision) because of a number of reasons, despite some drawbacks. We’ll dis‐
cuss these reasons in Chapter 5 as well as strategies to combat the potential pitfalls.

Softmax Output Layers
Often times, we want our output vector to be a probability distribution over a set of
mutually exclusive labels. For example, let’s say we want to build a neural network to
recognize handwritten digits from the MNIST data set. Each label (0 through 9) is
mutually exclusive, but it’s unlikely that we will be able to recognize digits with 100%
confidence. Using a probability distribution gives us a better idea of how confident
we are in our predictions. As a result, the desired output vector is of the form below,
where ∑9i = 0 pi = 1:
Softmax Output Layers

|

23

p0 p1 p2 p3 . . . p9
This is achieved by using a special output layer called a softmax layer. Unlike in other
kinds of layers, the output of a neuron in a softmax layer depends on the outputs of
all of the other neurons in its layer. This is because we require the sum of all the out‐
puts to be equal to 1. Letting zi be the logit of the ith softmax neuron, we can achieve
this normalization by setting its output to:

yi =

z
e i
z
∑je j

A strong prediction would have a single entry in the vector close to 1 while the
remaining entries were close to 0. A weak prediction would have multiple possible
labels that are more or less equally likely.

Looking Forward
In this chapter, we’ve built a basic intuition for machine learning and neural net‐
works. We’ve talked about the basic structure of a neuron, how feed-forward neural
networks work, and the importance of nonlinearity in tackling complex learning
problems. In the next chapter we will begin to build the mathematical background
necessary to train a neural network to solve problems. Specifically, we will talk about
finding optimal parameter vectors, best practices while training neural networks, and
major challenges. In future chapters, we will take these foundational ideas to build
more specialized neural architectures.

24

|

Chapter 1: The Neural Network

CHAPTER 2

Training Feed-Forward Neural Networks

The Cafeteria Problem
We’re beginning to understand how we can tackle some interesting problems using
deep learning, but one big question still remains - how exactly do we figure out what
the parameter vectors (the weights for all of the connections in our neural network)
should be? This is accomplished by a process commonly referred to as training. Dur‐
ing training, we show the neural net a large number of training examples and itera‐
tively modify the weights to minimize the errors we make on the training examples.
After enough examples, we expect that our neural network will be quite effective at
solving the task it’s been trained to do.

Figure 2-1. This is the neuron we want to train for the Dining Hall Problem
25

Let’s continue with the example we mentioned in the previous chapter involving a lin‐
ear neuron. As a brief review, every single day, we purchase a meal from the dining
hall consisting of burgers, fries, and sodas. We buy some number of servings for each
item. We want to be able to predict how much a meal is going to cost us, but the items
don’t have price tags. The only thing the cashier will tell us is the total price of the
meal. We want to train a single linear neuron to solve this problem. How do we do it?
One idea is to be smart about picking our training cases. For one meal we could buy
only a single serving of burgers, for another we could only buy a single serving of
fries, and then for our last meal we could buy a single serving of soda. In general,
choosing smart training cases is a very good idea. There’s lots of research that shows
that by engineering a clever training set, you can make your neural network a lot
more effective. The issue with this approach is that in real situations, it rarely ever
gets you close to the solution. For example, there’s no clear analog of this strategy in
image recognition. It’s just not a practical solution.
Instead, we try to motivate a solution that works well in general. Let’s say we have a
bunch of training examples. Then we can calculate what the neural network will out‐
put on the ith training example using the simple formula in the diagram. We want to
train the neuron so that we pick the optimal weights possible - the weights that mini‐
mize the errors we make on the training examples. In this case, let’s say we want to
minimize the square error over all of the training examples that we encounter. More
formally, if we know that t i is the true answer for the ith training example and y i is
the value computed by the neural network, we want to minimize the value of the
error function E:
1

E = 2 ∑i t i − y i

2

The squared error is zero when our model makes a perfectly correct prediction on
every training example. Moreover, the closer E is to 0, the better our model is. As a
result, our goal will be to select our parameter vector θ (the values for all the weights
in our model) such that E is as close to 0 as possible.
Now at this point you might be wondering why we need to bother ourselves with
error functions when we can treat this problem as a system of equations. After all, we
have a bunch of unknowns (weights) and we have a set of equations (one for each

26

|

Chapter 2: Training Feed-Forward Neural Networks

training example). That would automatically give us an error of 0 assuming that we
have a consistent set of training example.
That’s a smart observation, but the insight unfortunately doesn’t generalize well.
Remember that although we’re using a linear neuron here, linear neurons aren’t used
very much in practice because they’re constrained in what they can learn. And the
moment we start using nonlinear neurons like the sigmoidal, tanh, or ReLU neurons
we talked about at the end of the previous chapter, we can no longer set up a system
of equations! Clearly we need a better strategy to tackle the training process.

Gradient Descent
Let’s visualize how we might minimize the squared error over all of the training
examples by simplifying the problem. Let’s say our linear neuron only has two inputs
(and thus only two weights, w1 and w2). Then we can imagine a 3-dimensional space
where the horizontal dimensions correspond to the weights w1 and w2, and the verti‐
cal dimension corresponds to the value of the error function E. In this space, points
in the horizontal plane correspond to different settings of the weights, and the height
at those points corresponds to the incurred error. If we consider the errors we make
over all possible weights, we get a surface in this 3-dimensional space, in particular, a
quadratic bowl as shown in Figure 2-2.

Gradient Descent

|

27

Figure 2-2. The quadratic error surface for a linear neuron
We can also conveniently visualize this surface as a set of elliptical contours, where
the minimum error is at the center of the ellipses. In this setup, we are working in a 2dimensional plane where the dimensions correspond to the two weights. Contours
correspond to settings of w1 and w2 that evaluate to the same value of E. The closer
the contours are to each other, the steeper the slope. In fact it turns out that the direc‐
tion of the steepest descent is always perpendicular to the contours. This direction is
expressed as a vector known as the gradient.

28

| Chapter 2: Training Feed-Forward Neural Networks

Figure 2-3. Visualizing the error surface as a set of contours
Now we can develop a high-level strategy for how to find the values of the weights
that minimizes the error function. Suppose we randomly initialize the weights of our
network so we find ourselves somewhere on the horizontal plane. By evaluating the
gradient at our current position, we can find the direction of steepest descent and we
can take a step in that direction. Then we’ll find ourselves at a new position that’s
closer to the minimum than we were before. We can re-evaluate the direction of
steepest descent by taking the gradient at this new position and taking a step in this
new direction. It’s easy to see that, as shown in Figure 2-3, following this strategy will
eventually get us to the point of minimum error. This algorithm is known as gradient
descent, and we’ll use it to tackle the problem of training individual neurons and the
more general challenge of training entire networks.

The Delta Rule and Learning Rates
Before we derive the exact algorithm for training our cafeteria neuron, we make a
quick note on hyperparameters. In addition to the weight parameters defined in our
neural network, learning algorithms also require a couple of additional parameters to
carry out the training process. One of these so-called hyperparameters is the learning
rate.

The Delta Rule and Learning Rates

|

29

In practice at each step of moving perpendicular to the contour, we need to deter‐
mine how far we want to walk before recalculating our new direction. This distance
needs to depend on the steepness of the surface. Why? The closer we are to the mini‐
mum, the shorter we want to step forward. We know we are close to the minimum,
because the surface is a lot flatter, so we can use the steepness as an indicator of how
close we are to the minimum. However, if our error surface is rather mellow, training
can potentially take a large amount of time. As a result, we often multiply the gradient
by a factor �, the learning rate. Picking the learning rate is a hard problem. As we just
discussed, if we pick a learning rate that’s too small, we risk taking too long during the
training process. But if we pick a learning rate that’s too big, we’ll mostly likely start
diverging away from the minimum. In the next chapter, we’ll learn about various
optimization techniques that utilize adaptive learning rates to automate the process of
selecting learning rates.

Figure 2-4. Convergence is difficult when our learning rate is too large
Now, we are finally ready to derive the delta rule for training our linear neuron. In
order to calculate how to change each weight, we evaluate the gradient, which is
essentially the partial derivative of the error function with respect to each of the
weights. In other words, we want:
∂E

Δwk = − � ∂w

k

30

|

Chapter 2: Training Feed-Forward Neural Networks



1
∑i
k 2

= − � ∂w

= ∑i � t i − y i

ti −yi

2

∂yi
∂wk

= ∑i �xki t i − y i
Applying this method of changing the weights at every iteration, we are finally able to
utilize gradient descent.

Gradient Descent with Sigmoidal Neurons
In this section and the next, we will deal with training neurons and neural networks
that utilize nonlinearites. We use the sigmoidal neuron as a model, and leave the deri‐
vations for other nonlinear neurons as an exercise for the reader. For simplicity, we
assume that the neurons do not use a bias term, although our analysis easily extends
to this case. We merely need to assume that the bias is a weight on an incoming con‐
nection whose input value is always one.
Let’s recall the mechanism by which logistic neurons compute their output value
from their inputs:
z = ∑k wkxk
y=

1
1 + e−z

The neuron computes the weighted sum of its inputs, the logit, z. It then feeds its logit
into the input function to compute y, its final output. Fortunately for us, these func‐
tions have very nice derivatives, which makes learning easy! For learning, we want to
compute the gradient of the error function with respect to the weights. To do so, we
start by taking the derivative of the logit with respect to the inputs and the weights.
∂z
∂wk

= xk

∂z
∂xk

= wk

Gradient Descent with Sigmoidal Neurons

|

31

Also, quite surprisingly, the derivative of the output with respect to the logit is quite
simple if you express it in terms of the output.

dy
dz

=
=

e−z
−z 2
1+e

=

e−z
−z
1 + e 1 + e−z
1
1

1 + e−z

1−

1

1 + e−z

= y 1−y
We then use the chain rule to get the derivative of the output with respect to each
weight:
∂y
∂wk

=

dy ∂z
dz ∂wk

= xk y 1 − y

Putting all of this together, we can now compute the derivative of the error function
with respect to each weight:

∂E
∂wk

= ∑i

∂E ∂y i
∂y i ∂wk

= − ∑i xki y i 1 − y i t i − y i

Thus, the final rule for modifying the weights becomes:
Δwk = ∑i �xki y i 1 − y i t i − y i
As you may notice, the new modification rule is just like the delta rule, except with
extra multiplicative terms included to account for the logistic component of the sig‐
moidal neuron.

32

|

Chapter 2: Training Feed-Forward Neural Networks

The Backpropagation Algorithm
Now we’re finally ready to tackle the problem of training multilayer neural networks
(instead of just single neurons). So what’s the idea behind backpropagation? We don’t
know what the hidden units ought to be doing, but what we can do is compute how
fast the error changes as we change a hidden activity. From there, we can figure out
how fast the error changes when we change the weight of an individual connec‐
tion. Essentially we’ll be trying to find the path of steepest descent! The only catch is
that we’re going to be working in an extremely high dimensional space. We start by
calculating the error derivatives with respect to a single training example.
Each hidden unit can affect many output units. Thus, we’ll have to combine many
separate effects on the error in an informative way. Our strategy will be one of
dynamic programming. Once we have the error derivatives for one layer of hidden
units, we’ll use them to compute the error derivatives for the activites of the layer
below. And once we find the error derivatives for the activities of the hidden units, it’s
quite easy to get the error derivatives for the weights leading into a hidden unit. We’ll
redefine some notation for ease of discussion and refer to the following diagram:

The Backpropagation Algorithm

|

33

Figure 2-5. Reference diagram for the derivation of the backpropagation algorithm
The subscript we use will refer to the layer of the neuron. The symbol y will refer to
the activity of a neuron, as usual. Similarly the symbol z will refer to the logit of the
neuron. We start by taking a look at the base case of the dynamic programming prob‐
lem. Specifically, we calculate the error function derivatives at the output layer:
1

E = 2 ∑ j ∈ output t j − y j

34

2

∂E
∂y j

= − t j − yj

| Chapter 2: Training Feed-Forward Neural Networks

Now we tackle the inductive step. Let’s presume we have the error derivatives for
layer j. We now aim to calculate the error derivatives for the layer below it, layer i. To
do so, we must accumulate information about how the output of a neuron in
layer i affects the logits of every neuron in layer j. This can be done as follows, using
the fact that the partial derivative of the logit with respect to the incoming output
data from the layer beneath is merely the weight of the connection wi j:
∂E
∂yi

∂E dz j
j dyi

= ∑ j ∂z

∂E

= ∑ j wi j ∂z

j

Furthermore, we observe the following:

∂E
∂z j

=

∂E dy j
∂y j dz j

= yj 1 − yj

∂E
∂y j

Combining these two together, we can finally express the error derivatives of layer i in
terms of the error derivatives of layer j:
∂E
∂yi

= ∑ j wi j y j 1 − y j

∂E
∂y j

Then once we’ve gone through the whole dynamic programming routine, having fil‐
led up the table appropriately with all of our partial derivatives (of the error function
with respect to the hidden unit activities), we can then determine how the error
changes with respect to the weights. This gives us how to modify the weights after
each training example:

∂E
∂wi j

=

∂z j ∂E
∂wi j ∂z j

= yi y j 1 − y j

∂E
∂y j

Finally to complete the algorithm, we, just as before, merely sum up the partial deriv‐
atives over all the training examples in our dataset. This gives us the following modi‐
fication formula:

The Backpropagation Algorithm

|

35

Δwi j = − ∑k ∈ dataset �yi k y jk 1 − y jk

∂E k
∂y jk

This completes our description of the backpropagation algorithm!

Stochastic and Mini-Batch Gradient Descent
In the algorithms we’ve described above, we’ve been using a version of gradient
descent known as batch gradient descent. The idea behind batch gradient descent is
that we use our entire dataset to compute the error surface and then follow the gradi‐
ent to take the path of steepest descent. For a simple quadratic error surface, this
works quite well. But in most cases, our error surface may be a lot more complicated.
Let’s consider the scenario in Figure 2-6 for illustration.

Figure 2-6. Batch gradient descent is sensitive to local minima
We only have a single weight, and we use random initialization and batch gradient
descent to find its optimal setting. The error surface, however, has a spurious local
minimum, and if we get unlucky, we might get stuck in a non-optimal minima.
Another potential approach is stochastic gradient descent, where at each iteration, our
error surface is estimated only with respect to a single example. This approach is
illustrated by Figure 2-7, where instead of a single static error surface, our error sur‐

36

|

Chapter 2: Training Feed-Forward Neural Networks

face is dynamic. As a result, descending on this stochastic surface significantly
improves our ability to avoid local minima.

Figure 2-7. The stochastic error surface fluctuates with respect to the batch error surface,
enabling local-minima avoidance
The major pitfall of stochastic gradient descent, however, is that looking at the error
incurred one example at a time may not be a good enough approximation of the error
surface. This, in turn, could potentially make gradient descent take a significant
amount of time. One way to combat this problem is using mini-batch gradient
descent. In mini-batch gradient descent, at every iteration, we compute the error sur‐
face with respect to some subset of the total dataset (instead of just a single example).
This subset is called a mini-batch, and in addition to the learning rate, mini-batch size
is another hyperparameter. Mini-batches strike a balance between the efficiency of
batch gradient descent and the local-minima avoidance afforded by stochastic gradi‐
ent descent. In the context of backpropagation, our weight update step becomes:

Δwi j = − ∑k ∈ mini − batch �yi k y jk 1 − y jk

∂E k
∂y jk

This is identical to what we derived in the previous section, but instead of summing
over all the examples in the dataset, we sum over the examples in the current minibatch.

Stochastic and Mini-Batch Gradient Descent

|

37

Test Sets, Validation Sets, and Overfitting
One of the major issues with artificial neural networks is that the models are quite
complicated. For example, let’s consider a neural network that’s pulling data from an
image from the MNIST database (28 by 28 pixels), feeds into two hidden layers with
30 neurons, and finally reaches a soft-max layer of 10 neurons. The total number of
parameters in the network is nearly 25,000. This can be quite problematic, and to
understand why, let’s take a look at the example data in Figure 2-8.

Figure 2-8. Two potential models that might describe our dataset - a linear model vs. a
degree 12 polynomial
Using the data, we train two different models - a linear model and a degree 12 poly‐
nomial. Which curve should we trust? The line which gets almost no training exam‐
ple correctly? Or the complicated curve that hits every single point in the dataset? At
this point we might trust the linear fit because it seems much less contrived. But just
to be sure, let’s add more data to our dataset! The result is shown in Figure 2-9.

38

| Chapter 2: Training Feed-Forward Neural Networks

Figure 2-9. Evaluating our model on new data indicates that the linear fit is a much bet‐
ter model than the degree 12 polynomial
Now the verdict is clear, the linear model is not only subjectively better, but now also
quantitatively performs better as well (measured using the squared error metric). But
this leads to a very interesting point about training and evaluating machine learning
models. By building a very complex model, it’s quite easy to perfectly fit our dataset.
But when we evaluate such a complex model on new data, it performs very poorly. In
other words, the model does not generalize well. This is a phenomenon called overfit‐
ting, and it is one of the biggest challenges that a machine learning engineer must
combat. This becomes an even more significant issue in deep learning, where our
neural networks have large numbers of layers containing many neurons. The number
of connections in these models is astronomical, reaching the millions. As a result,
overfitting is commonplace.
Let’s see how this looks in the context of a neural network. Let’s say we have a neural
network with two inputs, a soft-max output of size two, and a hidden layer with 3, 6,
or 20 neurons. We train these networks using mini-batch gradient descent (batch size
10), and the results, visualized using the ConvnetJS library, are show in Figure 2-10.
Test Sets, Validation Sets, and Overfitting |

39

Figure 2-10. A visualization of neural networks with 3, 6, and 20 neurons (in that order)
in their hidden layer.
It’s already quite apparent from these images that as the number of connections in
our network increases, so does our propensity to overfit to the data. We can similarly
see the phenomenon of overfitting as we make our neural networks deep. These
results are shown in Figure 2-11, where we use networks that have 1, 2, or 4 hidden
layers of 3 neurons each.

Figure 2-11. A visualization of neural networks with 1, 2, and 4 hidden layers (in that
order) of 3 neurons each.
This leads to three major observations. First, the machine learning engineer is always
working with a direct trade-off between overfitting and model complexity. If the
model isn’t complex enough, it may not be powerful enough to capture all of the use‐
ful information necessary to solve a problem. However, if our model is very complex
(especially if we have a limited amount of data at our disposal), we run the risk of
overfitting. Deep learning takes the approach of solving very complex problems with
complex models and taking additional countermeasures to prevent overfitting. We’ll
see a lot of these measures in this chapter as well as in later chapters.
Second, it is very misleading to evaluate a model using the data we used to train it.
Using the example in Figure 2-8, this would falsely suggest that the degree 12 polyno‐
40

|

Chapter 2: Training Feed-Forward Neural Networks

mial model is preferable to a linear fit. As a result, we almost never train our model
on the entire dataset. Instead, as shown in Figure 2-12 we split up our data into
a training set and a test set.

Figure 2-12. We often split our data into non-overlapping training and test sets in order
to fairly evaluate our model
This enables us to make a fair evaluation of our model by directly measuring how
well it generalizes on new data it has not yet seen. In the real world, large datasets are
hard to come by, so it might seem like a waste to not use all of the data at our disposal
during the training process. As a result, it may be very tempting to reuse training data
for testing or cut corners while compiling test data. Be forewarned. If the test set isn’t
well constructed, we won’t be able draw any meaningful conclusions about our
model.
Third, it’s quite likely that while we’re training our data, there’s a point in time where
instead of learning useful features, we start overfitting to the training set. As a result,
we want to be able to stop the training process as soon as we start overfitting to pre‐
vent poor generalization. To do this, we divide our training process into epochs. An
epoch is a single iteration over the entire training set. In other words, if we have a
training set of size d and we are doing mini-batch gradient descent with batch size b,
d
then an epoch would be equivalent to b model updates. At the end of each epoch, we
want to measure how well our model is generalizing. To do this, we use an addi‐
tional validation set, which is shown in Figure 2-13. At the end of an epoch, the vali‐
dation set will tell us how the model does on data it has yet to see. If the accuracy on
the training set continues to increase while the accuracy on the validation set stays the
same (or decreases), it’s a good sign that it’s time to stop training because we’re over‐
fitting.

Test Sets, Validation Sets, and Overfitting |

41

Figure 2-13. In deep learning we often include a validation set to prevent overfitting dur‐
ing the training process.
With this in mind, before we jump into describing the various ways to directly com‐
bat overfitting, let’s outline the workflow we use when building and training deep
learning models. The workflow is described in detail in Figure 2-14. It is a tad intri‐
cate, but it’s critical to understand the pipeline in order to ensure that we’re properly
training our neural networks.
First we define our problem rigorously. This involves determining our inputs, the
potential outputs, and the vectorized representations of both. For instance, let’s say
our goal was to train a deep learning model to identify cancer. Our input would be an
RBG image, which can be represented as a vector of pixel values. Our output would
be a probability distribution over three mutually exclusive possibilities: 1) normal, 2)
benign tumor (a cancer that has yet to metastasize), or 3) malignant tumor (a cancer
that has already metastasized to other organs).
After we build define our problem, we need to build a neural network architecture to
solve it. Our input layer would have to be of appropriate size to accept the raw data
from the image, and our output layer would have to be a softmax of size 3. We will
also have to define the internal architecture of the network (number of hidden layers,
the connectivities, etc.). We’ll further discuss the architecture of image recognition
models when we talk about convolutional neural networks in chapter 4. At this
point, we also want to collect a significant amount of data for training or model. This
data would probably be in the form of uniformly sized pathological images that have

42

| Chapter 2: Training Feed-Forward Neural Networks

been labeled by a medical expert. We shuffle and divide this data up into separate
training, validation, and test sets.

Test Sets, Validation Sets, and Overfitting |

43

44

| Chapter 2: Training Feed-Forward Neural Networks

Finally, we’re ready to begin gradient descent. We train the model on our training set
for an epoch at a time. At the end of each epoch, we ensure that our error on the
training set and validation set is decreasing. When one of these stops to improve, we
terminate and make sure we’re happy with the model’s performance on the test data.
If we’re unsatisfied, we need to rethink our architecture. If our training set error stop‐
ped improving, we probably need to do a better job of capturing the important fea‐
tures in our data. If our validation set error stopped improving, we probably need to
take measures to prevent overfitting.
If, however, we are happy with the performance of our model on the training data,
then we can measure its performance on the test data, which the model has never
seen before this point. If it is unsatisfactory, that means that we need more data in our
dataset because the test set seems to consist of example types that weren’t well repre‐
sented in the training set. Otherwise, we are finished!

Preventing Overfitting in Deep Neural Networks
There are several techniques that have been proposed to prevent overfitting during
the training process. In this section, we’ll discuss these techniques in detail.
One method of combatting overfitting is called regularization. Regularization modi‐
fies the objective function that we minimize by adding additional terms that penalize
large weights. In other words, we change the objective function so that it
becomes Error + λ f θ , where f θ grows larger as the components of θ grow larger
and λ is the regularization strength (another hyperparameter). The value we choose
for λ determines how much we want to protect against overfitting. A λ = 0 implies
that we do not take any measures against the possibility of overfitting. If λ is too large,
then our model will prioritize keeping θ as small as possible over trying to find the
parameter values that perform well on our training set. As a result, choosing λ is a
very important task and can require some trial and error.
The most common type of regularization is L2 regularization. It can be implemented
by augmenting the error function with the squared magnitude of all weights in the
neural network. In other words, for every weight w in the neural network, we
1
add 2 λw2 to the error function. The L2 regularization has the intuitive interpretation
of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. This
has the appealing property of encouraging the network to use all of its inputs a little
rather than using only some of its inputs a lot. Of particular note is that during the
gradient descent update, sing the L2 regularization ultimately means that every
Preventing Overfitting in Deep Neural Networks

|

45

weight is decayed linearly to zero. Expressed succinctly in NumPy, this is equivalent
to the line: W += -lambda * W. Because of this phenomenon, L2 regularization is also
commonly referred to as weight decay.
We can visualize the effects of L2 regularization using ConvnetJs. Similar to above, we
use a neural network with two inputs, a soft-max output of size two, and a hidden
layer with 20 neurons. We train the networks using mini-batch gradient descent
(batch size 10) and regularization strengths of 0.01, 0.1, and 1. The results can be seen
in Figure 2-15.

Figure 2-15. A visualization of neural networks trained with regularization strengths of
0.01, 0.1, and 1 (in that order).
Another common type of regularization is L1 regularization. Here, we add the
term λ w for every weight w in the neural network. The L1 regularization has the
intriguing property that it leads the weight vectors to become sparse during optimiza‐
tion (i.e. very close to exactly zero). In other words, neurons with L1 regularization
end up using only a small subset of their most important inputs and become quite
resistant to noise in the inputs. In comparison, weight vectors from L2 regularization
are usually diffuse, small numbers. L1 regularization is very useful when you want to
understand exactly which features are contributing to a decision. If this level of fea‐
ture analysis isn’t necessary, we prefer to use L2 regularization because it empirically
performs better.
Max norm constraints have a similar goal of attempting to restrict θ from becoming
too large, but they do this more directly. Max norm constraints enforce an absolute
upper bound on the magnitude of the incoming weight vector for every neuron and
use projected gradient descent to enforce the constraint. In other words, anytime a
gradient descent step moved the incoming weight vector such that w 2 > c, we
project the vector back onto the ball (centered at the origin) with radius c. Typical
values of c are 3 and 4. One of the nice properties is that the parameter vector cannot
46

|

Chapter 2: Training Feed-Forward Neural Networks

grow out of control (even if the learning rates are too high) because the updates to the
weights are always bounded.
Dropout is a very different kind of method for preventing overfitting that can often be
used in lieu of other techniques. While training, dropout is implemented by only
keeping a neuron active with some probability p (a hyperparameter), or setting it to
zero otherwise. Intuitively, this forces the network to be accurate even in the absence
of certain information. It prevents the network from becoming too dependent on any
one (or any small combination) of neurons. Expressed more mathematically, it pre‐
vents overfitting by providing a way of approximately combining exponentially many
different neural network architectures efficiently. The process of dropout is expressed
pictorially in Figure 2-16.

Figure 2-16. Dropout sets each neuron in the network as inactive with some random
probability during each mini-batch of training.
Dropout is pretty intuitive to understand, but there are some important intricacies to
consider. We illustrate these considerations through Python code. Let’s assume we are
working with a 3-layer ReLU neural network.
Example 2-1. Naïve Dropout Implementation
import numpy as np
# Let p = probability of keeping a hidden unit active
# A larger p means less dropout (p =1 --> no dropout)

Preventing Overfitting in Deep Neural Networks

|

47

network.p = 0.5
def train_step(network, X):
# forward pass for a 3-layer neural network
Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1)
# first dropout mask
Dropout1 = (np.random.rand(*Layer1.shape) < network.p
# first drop!
Layer1 *= Dropout1

Layer2 = np.maximum(0, np.dot(network.W2, Layer1) + network.b2)
# second dropout mask
Dropout2 = (np.random.rand(*Layer2.shape) < network.p
# second drop!
Layer2 *= Dropout2
Output = np.dot(network.W3, Layer2) + network.b3
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(network, X):
# NOTE: we scale the activations
Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1) * network.p
Layer2 = np.maximum(0, np.dot(network.W2, Layer) + network.b2) * network.p
Output = np.dot(network.W3, Layer2) + network.b3
return Output

One of the things we’ll realize is that during test-time (the predict function), we
multiply the output of each layer by network.p. Why do we do this? Well, we we’d like
the outputs of neurons during test-time to be equivalent to their expected outputs at
training time. For example, if p = 0 . 5, neurons must halve their outputs at test time
in order to have the same (expected) output they would have during training. This is
easy to see because a neuron’s output is set to 0 with probability 1 − p. This means
that if a neuron’s output prior to dropout was x, then after dropout, the expected out‐
put would be � output = px + 1 − p · 0 = px.
The naïve implementation of dropout is undesirable because it requires scaling of
neuron outputs at test-time. Test-time performance is extremely critical to model
evaluation, so it’s always preferable to use inverted dropout, where the scaling occurs
at training time instead of at test time. This has the additional appealing property that
the predict code can remain the same whether or not dropout is used. In other
words, only the train_step code would have to be modified.
48

| Chapter 2: Training Feed-Forward Neural Networks

Example 2-2. Inverted Dropout Implementation
import numpy as np
# Let network.p = probability of keeping a hidden unit active
# A larger network.p means less dropout (network.p == 1 --> no dropout)
network.p = 0.5
def train_step(network, X):
# forward pass for a 3-layer neural network
Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1)
# first dropout mask, note that we divide by p
Dropout1 = ((np.random.rand(*Layer1.shape) < network.p) / network.p
# first drop!
Layer1 *= Dropout1

Layer2 = np.maximum(0, np.dot(network.W2, Layer1) + network.b2)
# second dropout mask, note that we divide by p
Dropout2 = ((np.random.rand(*Layer2.shape) < network.p) / network.p
# second drop!
Layer2 *= Dropout2
Output = np.dot(network.W3, Layer2) + network.b3
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(network, X):
Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1)
Layer2 = np.maximum(0, np.dot(network.W2, Layer) + network.b2)
Output = np.dot(network.W3, Layer2) + network.b3
return Output

Summary
In this chapter, we’ve learned all of the basics involved in training feedforward neural
networks. We’ve talked about gradient descent, the backpropagation algorithm, as
well as various methods we can use to prevent overfitting. In the next chapter, we’ll
put these lessons into practice when we use the Theano library to efficiently imple‐
ment our first neural networks. Then in Chapter 4, we’ll return to the problem of
optimizing objective functions for training neural networks and design algorithms to
significantly improve performance. These improvements will enable us to process
much more data, which means we’ll be able to build more comprehensive models.

Summary

|

49

50

|

Chapter 2: Training Feed-Forward Neural Networks

CHAPTER 3

Implementing Neural Networks in Theano

What is Theano and why are we using it?
One of the most attractive aspects of the Python programming language is that it’s
very easy to use. There’s very little boilerplate, it’s expressive, and it’s great for rapid
prototyping. That’s why Python is one of the most popular languages among
researchers and why we use Python in this book. The biggest drawback, however, is
its performance. Python is slow, and that’s a problem for the deep learning engineer.
Deep learning models are very computationally intensive and use a lot of data, so if
we write inefficient code, our model could take weeks or even months to train com‐
pletely.
In order to write more efficient Python programs, developers have created a number
of libraries to optimize numerical computations. These libraries include
NumPy, numexpr, and Cython. Although these libraries are effective at solving the
problems they’re built for, they often turn our elegant program into a gargantuan
mess. And for the uninitiated, it’s just as likely that poor integrations with these libra‐
ries will just make things worse.
Theano was built as a solution to this core problem of achieving simplicity and per‐
formance simultaneously. Theano lets us to define, optimize, and evaluate mathemat‐
ical expressions, especially ones with multi-dimensional arrays. It often achieves
speeds comparable to hand-crafted C programs for problems involving large
amounts of data, such as training deep neural networks. It can also far surpass tradi‐
tional C programs by many orders of magnitude by utilizing GPU resources. Theano

51

manages data flow and computation under the hood so that we don’t have to worry
about it when writing code.
Theano allows us to define variables symbolically. We’ll talk about what this entails in
the next section, but intuitively, it means that using Theano feels a lot like being in
algebra class. Theano can also perform symbolic differentiation, which means that
once we define our cost functions, Theano can compute the appropriate gradients on
its own. This makes our life as a deep learning engineer much simpler.

Installing Theano
We’ll walk through the basics of installing Theano in this section. If you run into
trouble or would like to enable bleeding-edge features, you should check out the
detailed installation instructions online at http://deeplearning.net/software/theano/
install.html.
First, we have to make sure that we have certain pre-requisites already installed on
our machine to successfully install Theano:








Python >= 2.6
g++, python-dev
NumPy >= 1.6.2
SciPy >= 0.11
A BLAS installation (with Level 3 functionality and the development headers)
nose (to run Theano’s test suite)
pydot (to make pictures of Theano computation graphs)

We’ll also need a couple of additional libraries if we would like to use the GPU with
our installation:
• NVIDIA CUDA drivers and SDK
• libgpuarray
Once all of the pre-requisites are in order, installing Theano is as simple as running
the following command from the terminal:
$ pip install Theano

Then in the Python or iPython interpreter, we should be able to run the Theano test
suite:
>>> import theano
>>> theano.test()

52

|

Chapter 3: Implementing Neural Networks in Theano

In order to set up Theano to use a GPU (this is optional), we need to make a couple
of modifications to the .theanorc file. First, we must add a [cuda] section as follows
with the path to the NVIDIA CUDA root directory:
[cuda]
root = /path/to/cuda/root

Then, we need to change the device option in the [global] section to name the GPU
device on our computer and also set the default floating point computations to
float32:
[global]
device = gpu
floatX = float32

If our computer has multiple GPUs, device = gpu selects one of the GPUs (usually
gpu0). If you want to choose one of the GPUs specifically, you can instead spec‐
ify device=gpuX, with X the the corresponding GPU index (0, 1, 2, ...). To ensure that
the GPU is working as expected, we follow the instructions at http://deeplearn‐
ing.net/software/theano/tutorial/using_gpu.html.
If everything goes smoothly, we’re now ready to get started using Theano! In the rest
of this chapter, we’ll be understanding how to express operations in Theano, how
Theano optimizes these operations under the hood, and then we’ll start implement‐
ing some of our neural network models in Theano.

Basic Algebra in Theano
Let’s start off by trying something very simple. Let’s build a Theano module that will
add two numbers for us. Here’s some code that will do this for us:
>>>
>>>
>>>
>>>
>>>
>>>

import theano.tensor as T
from theano import function
a = T.dscalar('a')
b = T.dscalar('b')
c = a + b
f = function(inputs=[a, b], outputs=c)

Let’s try to understand how this code works! First, let’s take a close look at the first
two lines after the theano imports:
a = T.dscalar('a')
b = T.dscalar('b')

These lines create two symbols (variables) named a and b that represent the quantities
that we want to add. These symbols are of type dscalar, which is equivalent to a 64-bit

Basic Algebra in Theano

|

53

float. Other common types that we will be using are dvector (a vector of 64-bit
floats), dmatrix (a matrix of 64-bit floats), and dtensor3 (a 3-dimensional tensor, e.g.
representing a color image, of 64-bit floats).
The next line defines a new symbol named c in terms of a and b:
c = a + b

We can confirm that Theano has the correct representation for c by using the prettyprint out functionality:
>>> from theano import pp
>>> print pp(c)
'(a + b)'

Finally, we use these definitions for a, b, and c to build a function:
f = function(inputs=[a, b], outputs=c)

The inputs argument to function is a variable or list of variables that represents the
values we feed into the function. The outputs argument is a variable or list of vari‐
ables that represents what is returned. When this command is executed, Theano ana‐
lyzes the relationships between all of the symbols involved in the function and
compiles some optimized code to perform the calculation. Once it’s finished, we can
refer to f like a normal Python function:
>>> f(7, 3)
array(10.0)

In the next section we’ll talk some more about the optimizations that Theano auto‐
matically finishes for us under the hood by discussing Theano’s symbolic graph struc‐
tures.

Theano Graph Structures
To illustrate the optimizations that Theano performs under the hood, we need to
understand Theano’s graph structures. As we define new symbols and express them
in terms of other symbols via various operations, Theano generates a graph to keep
track of their relationships. For example, let’s take a look at what happens when we
define a symbol to be the sum of the cubes of two other symbols (specifically, two
vectors):
>>>
>>>
>>>
>>>

54

|

import theano.tensor as T
from theano import function
from theano.printing import pydotprint
a = T.dvector('a')

Chapter 3: Implementing Neural Networks in Theano

>>> b = T.dvector('b')
>>> c = a**3 + b**3
>>> f = theano.function(inputs=[a,b], outputs=c)
>>> f([1, 2, 3], [2, 3, 4])
array([ 9., 35., 91.])

To visualize the graph structure generated for c, we can use the following command:
pydotprint(c, outfile="unopt.png", var_with_name_simple=True)

The resulting graph in Figure 3-1 explains how c is generated from inputs a and b.
Let’s try to understand what this graph means. First, to raise each element of a and
each element of b to the power of 3, the scalar value must be “broadcasted” in order
to match the dimensions of the vectors (this is a NumPy operation). This is achieved
by the DimShuffle operations in the graph. Then the outputs of each DimShuffle are
combined with a and b through an element-wise pow operation. Finally, the output of
each pow is fed into an element-wise add to produce c (the node colored blue).

Figure 3-1.
Theano’s unoptimized graph structure for c in terms of a and b
Now to see the power of Theano’s under the hood optimizations, we can similarly
print out the graph structure for the compiled function f:
pydotprint(f, outfile="opt.png", var_with_name_simple=True)

The result is shown in Figure 3-2. All of the various operations from the unoptimized
graph are compressed into a single element operation. We also notice that the pow
operation is nowhere to be found. Instead, the more efficient square (sqr) operation
is used. With this reconfigured graph structure, Theano can now generate extremely

Theano Graph Structures

|

55

efficient code to compute f instead of having to rely one Python’s inefficient data
management and computation.

Figure 3-2. Theano’s optimized graph structure for c in terms of a and b
With this understanding of how Theano works, we’ll be better equipped to debug and
profile Theano programs (e.g. identify bottlenecks, missing optimizations, etc.). In
the following sections, we’ll learn more about more specialized Theano variables.

Shared Variables and Side-Effects
So far, we’ve talked about functions that are “stateless.” In other words, they hold no
memory of previous operations. In many cases, however, we require “stateful” com‐
putation. For example, if we’re training a machine learning model, we’d like every
iteration of training to modify the parameter vector, and we’d like these modifica‐
tions, or side-effects, to be remembered. Theano represents internal function state
through a special data type knowns as a shared variable.
To illustrate this concept, let’s imagine we’re using Theano to provide some computa‐
tionally intensive service to a customer. Say, for example, that the server has computes
a linear perceptron for sentiment classification, and the customer would like to know
whether his feature vector has positive or negative sentiment. In this situation, we
want to not only produce the result for the customer, but we also want to keep track
of the number of times we provided the customer the service (so we know how much
to charge them). Theano allows us to do this as follows:
>>>
>>>
>>>
>>>
>>>
>>>

56

|

import theano.tensor as T
from theano import function
from theano import shared
state = shared(0)
query = T.dvector('query')
W = T.dvector('W') # model parameter vector

Chapter 3: Implementing Neural Networks in Theano

>>> result = T.dot(W, query) > 0
>>> sentiment = function(inputs=[query],
..:
outputs=result,
..:
updates=[(state, state + 1)],
..:
givens={
..:
W : np.array([1, -2, 3, -0.5, 1])
..:
})

The output of the sentiment function should be zero if the sentiment is negative and
one if the sentiment is positive. We pass the parameter vector W as one of the givens
for our Theano function because its expression is fixed with respect to the inputs (it’s
not some arbitrary user-chosen value). This allows Theano to include the expression
in the compiled graph structure, enabling additional optimization. We’ll use this trick
to efficiently pass our training set to the Theano function responsible for training our
model. Moreover, we see that the state increments by 1 every time the function is
called. This is achieved by passing a set of side-effects to our function’s updates.
Specifically, our update instructions the function to replace state with the value
state + 1. The action of the sentiment function can be seen in an interactive
Python shell:
>>> state.get_value()
array(0)
>>> sentiment([1, 6, 0, 9, 0])
array(0, dtype=int8)
>>> state.get_value()
array(1)
>>> sentiment([1, -6, 0, -9, 0])
array(1, dtype=int8)
>>> state.get_value()
array(2)

We note that we can also share the state variable with other functions if we want
other functions to update its value. We can also choose to reset the value of the vari‐
able by calling state.set_value(0). We’ll use the shared variable feature of Theano
extensively as we build our own machine learning models in this book.

Randomness in Theano
Because Theano requires us to define everything symbolically a priori and then com‐
piles these expressions to produce a function, generating pseudorandom numbers is a
little bit more involved than it is in NumPy, but it isn’t too difficult. The way we can
think about introducing randomness into a Theano graph structure is by declaring a
random variable. Theano will allocate a NumPy RandomStream object (a random
number generator) for each random variable, and utilize it as necessary. We call this
Theano construct (and the sequence of pseudorandom numbers it generates) a ran‐
dom steam. At their core, random streams are shared variables, so they behave simi‐
larly to the state variable from the previous section.
Randomness in Theano

|

57

Let’s start off by looking at a simple example. We instantiated shared random number
generator shared_rng with an arbitrary seed value. We can then simulate rolling a
fair die as follows:
>>> import theano.tensor as T
>>> from theano.tensor.shared_randomstreams import RandomStreams
>>> from theano import function
>>> shared_rng = RandomStreams(seed=123)
>>> die = T.floor(shared_rng.uniform() * 6) + 1
>>> roll = function(inputs=[], outputs=die)
>>> roll()
array(1.0)
>>> roll()
array(5.0)
>>> roll()
array(3.0)
>>> roll()
array(2.0)

In this case, die is a new random variable generated from a uniform distribution
extracted from shared_rng. Every time we call roll(), we generate a new number
between 1 and 6 inclusive. But let’s suppose we want our function to select a random
number between 1 and 6 inclusive and fix it, instead of invoking shared_rng every
time it is called. We can do that by passing the special flag no_default_updates to
our function at compile time. As a quick note, we must be sure to instantiate a new
random variable for roll_once() in order to make sure that the roll() and
roll_once() functions do not interfere with each other.
>>> fixed_die = T.floor(shared_rng.uniform() * 6) + 1
>>> roll_once = function(inputs=[], outputs=fixed_die, no_default_updates=True)
>>> roll_once()
array(5.0)
>>> roll_once() # same value as before
array(5.0)
>>> roll() # test potential interference
array(6.0)
>>> roll_once() # interference test passes (should call multiple times)
array(5.0)

These examples may seem simplistic, but generating randomness is a critical part of
many deep learning models and we will use random streams extensively when imple‐
menting dropout layers in more complicated networks.

Computing Derivatives Symbolically
The final feature of Theano that we’ll cover in this chapter is symbolic differentiation.
This is one of the reasons why Theano is such a popular library among machine
58

|

Chapter 3: Implementing Neural Networks in Theano

learning researchers. We don’t have to manually write out the gradient updates for
our objective functions. Theano does this automatically for us!
Let’s start off with a simple example. Let’s say that we have a func‐
tion f � = x1, x2 = x21 + x22. Our goal is to compute the the gradient of this function:

∇f =


∂x1



x21 + x22 , ∂x x21 + x22 = 2x1, 2x2
2

So if we wanted to compute ∇ f at the point � = 1, 2 , we can just plug it into the
formula above to get 2, 4 . We can do the same computation elegantly in Theano
with the following code snippet:
>>> import theano.tensor as T
>>> from theano import function
>>> x = T.dvector('x')
>>> sum_squares = T.sum(x ** 2)
>>> gradient = T.grad(sum_squares, x)
>>> grad_f = function(inputs=[x], outputs=gradient)
>>> grad_f([1,2])
array([ 2., 4.])

In general, given any scalar function f and any vector or matrix �, Theano can effi‐
∂f
ciently compute ∂� , even if f has multiple inputs.

Expressing a Logistic Regression Network in Theano
Now that we’ve developed all of the basic concepts of Theano, let’s build a simple neu‐
ral network model to tackle the MNIST dataset. As you may recall, our goal is to
identify handwritten digits from 28 x 28 black and white images. The first network
that we’ll build implements a machine learning model known as logistic regression.
On a high level, logistic regression is a method by which we can calculate the proba‐
bility that an input belongs to one of the target classes. In our case, we’ll compute the
probability that a given input image is a 0, 1, ..., or 9. Our model uses a
matrix W representing the weights of the connections in the network as well as a vec‐
tor b corresponding to the biases to estimate whether a input x belongs to
class i using the softmax expression we talked about earlier:

Expressing a Logistic Regression Network in Theano

|

59

P y = i x = so f tmaxi Wx + b =

W x + bi
e i
W x + bj
∑je j

Our goal is to learn the values for W and b that most effectively classify our inputs as
accurately as possible. Pictorially, we can express the logistic regression network as
shown below in Figure 3-3 (bias connections not shown to reduce clutter).

Figure 3-3. Interpreting logistic regression as a primitive neural network
You’ll notice that the network interpretation for logistic regression is rather primitive.
It doesn’t any hidden layers, meaning that it is limited in its ability to learn complex
relationships! We have a output softmax of size 10 because we have 10 possible out‐
comes for each input. Moreover, we have an input layer of size 784, one input neuron
for every pixel in the image! As we’ll see, the model makes decent headway towards
correctly classifying our dataset, but there’s lots of room for improvement. Over the
course of the rest of this chapter and Chapter 5, we’ll try to significantly improve our
accuracy. But first, let’s look at how we can implement the logistic network in Theano
so we can train it on our computer!
We begin by taking a look at how we represent, on a high level, the logistic network in
Theano. The network receives some input, which it multiplies by the weights of the
connections (represented by W ). This result, combined with the bias terms then goes
into a softmax (for which Theano has a built in function that we can use out of the
box). At the end of the softmax, we just look for the bin with the highest probability.

60

|

Chapter 3: Implementing Neural Networks in Theano

We can express this functional procedure succinctly in Theano with the following
code.
def __init__(self, input, input_dim, output_dim):
"""
We first initialize the logistic network object with some important
information.
PARAM input : theano.tensor.TensorType
A symbolic variable that we'll use to represent one minibatch of our
dataset
PARAM input_dim : int
This will represent the number of input neurons in our model
PARAM ouptut_dim : int
This will represent the number of neurons in the output layer (i.e.
the number of possible classifications for the input)
"""
# We initialize the weight matrix W of size (input_dim, output_dim)
self.W = theano.shared(
value=np.zeros((input_dim, output_dim)),
name='W',
borrow=True
)
# We initialize a bias vector for the neurons of the output layer
self.b = theano.shared(
value=np.zeros(output_dim),
name='b',
borrow='True'
)
# Symbolic description of how to compute class membership probabilities
self.output = T.nnet.softmax(T.dot(input, self.W) + self.b)
# Symbolic description of the final prediction
self.predicted = T.argmax(self.output, axis=1)

Let’s talk about how this code works in more detail. The initialization takes in four
parameters. The first is a symbolic variable, denoted input, corresponding to a
matrix containing all of the training/validation/test examples. Each row of the matrix
corresponds to an example in our dataset. In our case, we’ll be using a vectorized
form of the MNIST images. We also need to keep track of the dimension of the input,
input_dim, and the number of possible classes in the output, output_dim, so we
know how big our network is. The connections are represented by the matrix self.W,
which has input_dim rows and output_dim columns. Specifically, the component in
the ith row and the jth column of the weight matrix, i.e. self.W[i][j], corresponds to
the weight of the connection between the ith input neuron and the jth output neuron
Expressing a Logistic Regression Network in Theano

|

61

in the softmax layer. The final two lines of code merely express the logistic network
model in Theano and select the digit with the highest probability. Note that for
matrix inputs, the T.dot function computes the matrix product. Moreover, although
the output of T.dot is a matrix, we can still add the vector self.b to it because The‐
ano broadcasts self.b to the right dimensions (stacks the vector up multiple times to
enable point-wise addition of self.b with every row of T.dot(input, self.W)).
Moreover, both the T.nnet.softmax and T.argmax (with axis=1) functions operate
on each row of their inputs. Consequently, self.prediction is a vector correspond‐
ing to the model’s prediction for each example.
Now we have that we’ve developed a symbolic expression for the model’s prediction
given a dataset of input images, we’ll need to develop a measure of how well our
model performs! In other words, we’re going to develop a symbolic expression for the
objective function. In this example, our objective function is going to have two com‐
ponents. The core component is a measure known as the negative log likelihood. Let’s
denote the prediction made by the machine learning model as y and the ith training
example and its label as x i and y i respectively. The the negative log likelihood can
be expressed as:
− ∑i log P y = y i ∣ x i , W, b
In other words, we want to maximize the negated sum of the log probabilities that
our model assigns to the correct answer. The second component of the objective
function is the L2 regularization component we discussed in the previous chapter. By
default we set λ = 0, but we can tweak this parameter while training our model if we
find ourselves overfitting to the training data. The code to express this cost function
is shown below.
def logistic_network_cost(self, y, lambda_l2=0):
"""
Here we express the cost incurred by an example given the correct
distribution
PARAM y : theano.tensor.TensorType
These are the correct answers, and we compute the cost with
respect to this ground truth (over the entire minibatch). This
means that y is of size (minibatch_size, output_dim)
PARAM lambda : float
This is the L2 regularization parameter that we use to penalize large
values for components of W, thus discouraging potential overfitting
"""

62

|

Chapter 3: Implementing Neural Networks in Theano

# Calculate the log probabilities of the softmax output
log_probabilities = T.log(self.output)
# We use these log probabilities to compute the negative log likelihood
negative_log_likelihood = -T.mean(log_probabilities[T.arange(y.shape[0]), y])
# Compute the L2 regularization component of the cost function
l2_regularization = lambda_l2 * (self.W ** 2).sum()
# Return a symbolic description of the cost function
return negative_log_likelihood + l2_regularization

Let’s take a moment to understand how this Theano snippet functions. The expres‐
sion requires two parameters. The first parameter is the set of true labels, stored in
the symbolic Theano variable y. The optional regularization parameter is
lambda_l2 and is set to 0 by default. In order to grab the correct log probabilities, we
use some Python array magic to pull out the correct entry of the log_probabili
ties matrix with the expression log_probabilities[T.arange(y.shape[0]), y].
We then compute the regularization component by taking the sum of the squares of
all the connections between the input and softmax layers. This is simply expressed in
Theano as (self.W ** 2).sum(). We can then return the sum of these two compo‐
nents as a symbolic description of the objective function we aim to optimize. This
expression will be critical for us as we train our logistic network model.
The final piece of the logistic regression network is a symbolic expression for evaluat‐
ing how well our model performs on a dataset. Fortunately, this is the simplest part of
the model. The code to accomplish this task is shown below.
def error_rate(self, y):
"""
Here we return the error rate of the model over a set of given labels
(perhaps in a minibatch, in the validation set, or the test set)
PARAM y : theano.tensor.TensorType
These are the correct answers, and we compute the cost with
respect to this ground truth (over the entire minibatch). This
means that y is of size (minibatch_size, output_dim)
"""
# Make sure y is of the correct dimension
assert y.ndim == self.predicted.ndim
# Make sure that ys contains values of the correct data type (ints)
assert y.dtype.startswith('int')
# Return the error rate on the data
return T.mean(T.neq(self.predicted, y))

Expressing a Logistic Regression Network in Theano

|

63

The only parameter is a symbolic variable y that holds the true labels for the dataset.
After asserting that this input is of the correct input and holds valid labels (the label
values must be integers!), we can count up the number of times the model makes a
mistake. The error rate is expressed in Theano as T.mean(T.neq(self.predicted,
y)). This completes our Theano representation of a logistic regression network! The
full LogisticNetwork class is included here with extensive comments for reference.
"""
We will use this class to represent a simple logistic regression
classifier. We'll represent this in Theano as a neural network
with no hidden layers. This is our first attempt at building a
neural network model to solve interesting problems. Here, we'll
use this class to crack the MNIST handwritten digit dataset problem,
but this class has been constructed so that it can be reappropriated
to any use!
References:
- textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2
- websites: http://deeplearning.net/tutorial, Lisa Lab
"""
import numpy as np
import theano.tensor as T
import theano
class LogisticNetwork(object):
"""
The logistic regression class is described by
we will want to learn). The first is a weight
this weight matrix as W. The second is a bias
text if you want to learn more about how this
started!
"""

two parameters (which
matrix. We'll refer to
vector b. Refer to the
network works. Let's get

def __init__(self, input, input_dim, output_dim):
"""
We first initialize the logistic network object with some important
information.
PARAM input : theano.tensor.TensorType
A symbolic variable that we'll use to represent one minibatch of our
dataset
PARAM input_dim : int
This will represent the number of input neurons in our model
PARAM ouptut_dim : int
This will represent the number of neurons in the output layer (i.e.
the number of possible classifications for the input)
"""

64

|

Chapter 3: Implementing Neural Networks in Theano

# We initialize the weight matrix W of size (input_dim, output_dim)
self.W = theano.shared(
value=np.zeros((input_dim, output_dim)),
name='W',
borrow=True
)
# We initialize a bias vector for the neurons of the output layer
self.b = theano.shared(
value=np.zeros(output_dim),
name='b',
borrow='True'
)
# Symbolic description of how to compute class membership probabilities
self.output = T.nnet.softmax(T.dot(input, self.W) + self.b)
# Symbolic description of the final prediction
self.predicted = T.argmax(self.output, axis=1)
def logistic_network_cost(self, y, lambda_l2=0):
"""
Here we express the cost incurred by an example given the correct
distribution
PARAM y : theano.tensor.TensorType
These are the correct answers, and we compute the cost with
respect to this ground truth (over the entire minibatch). This
means that y is of size (minibatch_size, output_dim)
PARAM lambda : float
This is the L2 regularization parameter that we use to penalize large
values for components of W, thus discouraging potential overfitting
"""
# Calculate the log probabilities of the softmax output
log_probabilities = T.log(self.output)
# We use these log probabilities to compute the negative log likelihood
negative_log_likelihood = -T.mean(log_probabilities[T.arange(y.shape[0]), y])
# Compute the L2 regularization component of the cost function
l2_regularization = lambda_l2 * (self.W ** 2).sum()
# Return a symbolic description of the cost function
return negative_log_likelihood + l2_regularization
def error_rate(self, y):
"""
Here we return the error rate of the model over a set of given labels
(perhaps in a minibatch, in the validation set, or the test set)
PARAM y : theano.tensor.TensorType

Expressing a Logistic Regression Network in Theano

|

65

These are the correct answers, and we compute the cost with
respect to this ground truth (over the entire minibatch). This
means that y is of size (minibatch_size, output_dim)
"""
# Make sure y is of the correct dimension
assert y.ndim == self.predicted.ndim
# Make sure that y contains values of the correct data type (ints)
assert y.dtype.startswith('int')
# Return the error rate on the data
return T.mean(T.neq(self.predicted, y))

Using Theano to Train a Logistic Regression Network
Now that we have a working logistic regression network, we’ll use Theano to train it
on the MNIST dataset. As in the previous section, we’ll take the code one snippet at a
time and understand how and why it works conceptually. Then we’ll end the section
by presenting the complete Python script.
After downloading our dataset and any appropriate pre-processing, we’ll have to pre‐
pare our dataset so that Theano can utilize it efficiently. Specifically, we need to
declare the datasets as shared variables. This is especially important if we’re using the
GPU to train our model. If a variable is not shared, it will be copied into the GPU
memory at every use. This is extremely inefficient, especially when we’re dealing with
large datasets. We can declare all our MNIST datasets as shared with the following
code snippet.
def shared_dataset(data_xy):
"""
We store the data in a shared variable because it allows Theano to copy it
into GPU memory (if GPU utilization is enabled). By default, if a variable is
not shared, it is moved to GPU at every use. This results in a big performance
hit because that means the data will be copied one minibatch at a time. Instead,
if we use shared variables, we don't have to worry about copying data
repeatedly.
"""
data_x, data_y = data_xy
shared_x = shared(np.asarray(data_x, dtype=config.floatX), borrow=True)
shared_y = shared(np.asarray(data_y, dtype='int32'), borrow=True)
return shared_x, shared_y
# We now instantiate the shared datasets
training_set_x , training_set_y = shared_dataset(training_set)
validation_set_x, validation_set_y = shared_dataset(validation_set)
test_set_x, test_set_y = shared_dataset(test_set)

66

| Chapter 3: Implementing Neural Networks in Theano

Now we need to set up several symbolic variables so we can build our finalized train‐
ing, validation, and test functions. Using Theano’s symbolic functionality, we can ach‐
ieve all of this boilerplate with just 10 lines of Theano code.
# Lets compute the number of minibatches for training, validation, and testing
n_training_batches = training_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_validation_batches = validation_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
# Now it's time for us to build the model!
#Let's start of with an index to the minibatch we're using
index = T.lscalar()
# Generate symbolic variables for the input (a minibatch)
x = T.dmatrix('x')
y = T.ivector('y')
# Construct the logistic network model
# Keep in mind MNIST image is of size (28, 28)
# Also number of output class is is 10 (digits 0, 1, ..., 9)
model = logistic_network.LogisticNetwork(input=x, input_dim=28*28, output_dim=10)
# Obtain a symbolic expression for the objective function
# EXPERIMENT!!! Play around with L2 regression parameter!
objective = model.logistic_network_cost(y, lambda_l2=0.0001)
# Obtain a symbolic expression for the error incurred
error = model.error_rate(y)
# Compute symbolic gradients of objective with respect to model parameters
dW, db = T.grad(objective, model.W), T.grad(objective, model.b)

In the first three lines, we start off by determining how many minibatches are in each
dataset and storing the values in n_training_batches, n_validation_batches, and
n_test_batches. In line 4, we declare the symbolic scalar variable index in order to
keep track of which minibatch we are on during the training process. In lines 4-5, we
declare symbolic variables to store the current minibatch examples in x and their
associated labels (ground truth) in y. Now we’re ready to pull out the symbolic
expressions we defined in the LogisticRegression class we defined in the previous
section. In line 6, we instantiate a LogisticRegression object in model. In line 7 and
8 we pull out the objective and error expressions we defined. And finally, in line 10,
we bring out the pull power of Theano by expressing the error derivatives in just a
single line of code, maintained symbolically in the variables dW and db.
With this boilerplate out of the way, we’re ready to declare the functions that we will
need to train, validate, and test our model on a minibatch of data. The code for this
purpose is shown below.
Using Theano to Train a Logistic Regression Network

|

67

# Compile theano function for training with a minibatch
train_model = function(
inputs=[index],
outputs=objective,
updates=[
(model.W, model.W - LEARNING_RATE * dW),
(model.b, model.b - LEARNING_RATE * db)
],
givens={
x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
# Compile theano functions for validation and testing
validate_model = function(
inputs=[index],
outputs=error,
givens={
x : validation_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : validation_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
test_model = function(
inputs=[index],
outputs=error,
givens={
x : test_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : test_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)

The Theano code for these functions is mostly self explanatory. All of the functions
take in the index representing a minibatch from the dataset. The output of
train_model is the value of the objective function on that minibatch while the out‐
puts of validate_model and test_model is a measure of the error incurred. More‐
over, the train_model performs the required minibatch stochastic gradient descent
updates on the connection weights W and the softmax layer biases b. You may have
noticed that we decided to pass the datasets as givens instead of as inputs. As
described above, this allows us to include the datasets in the graph optimizations,
which turns out to be more performant than passing them in as variable inputs.
The final code snippet is the piece of Python that is responsible for actually perform‐
ing the training process. It is also the most involved because, although we define a
constant N_EPOCHS to pre-determine how long training should occur, we also utilize a
concept called early stopping. As you may recall from the previous chapter, we always
68

|

Chapter 3: Implementing Neural Networks in Theano

use a validation set in order to ensure that we aren’t overfitting to the training data
during the training process. Early stopping is an algorithmic mechanism by which we
can use the performance on the validation data as an indicator of whether our model
is experiencing useful learning or if it is merely overfitting. The code for this process
is shown below and we will describe it in extensive detail so we understand how it
works.
# Let's set up the early stopping parameters (based on the validation set)
# Must look at this many examples no matter what
patience = 5000
# Wait this much longer if a new best is found
patience_increase = 2
# This is when an improvement is significant
improvement_threshold = 0.995
# We go through this number of minbatches before we check on the validation set
validation_freq = min(n_training_batches, patience / 2)
# We keep of the best loss on the validation set here
best_loss = np.inf
# We also keep track of the epoch we are in
epoch = 0
# A boolean flag that propagates when patience has been exceeded
exceeded_patience = False
# Now we're ready to start training the model
print "... TRAINING MODEL ..."
start_time = time.clock()
while (epoch < N_EPOCHS) and not exceeded_patience:
epoch = epoch + 1
for minibatch_index in xrange(n_training_batches):
minibatch_objective = train_model(minibatch_index)
iteration = (epoch - 1) * n_training_batches + minibatch_index
if (iteration + 1) % validation_freq == 0:
# Compute loss on validation set
validation_losses = [validate_model(i) for i in xrange(n_validation_batches)]
validation_loss = np.mean(validation_losses)
print 'epoch %i, minibatch %i/%i, validation error: %f %%' % (
epoch,
minibatch_index + 1,
n_training_batches,
validation_loss * 100
)
if validation_loss < best_loss:

Using Theano to Train a Logistic Regression Network

|

69

if validation_loss < best_loss * improvement_threshold:
patience = max(patience, iteration * patience_increase)
best_loss = validation_loss
if patience <= iteration:
exceeded_patience = True
break
end_time = time.clock()
# Let's compute how well we do on the test set
test_losses = [test_model(i) for i in xrange(n_test_batches)]
test_loss = np.mean(test_losses)

Early stopping works by terminating training if the validation error ceases to
improve. To determine whether we need to stop training early, we require a few of
important parameters. First, we need to decide how patient we will be when waiting
for an improvement in validation error. This is defined by the patience parameter. In
other words, we must wait a minimum of patience iterations (i.e. training steps on
minibatches) before we decide that training isn’t worth it. If we achieve a new best, we
increase the number of iterations we need to wait by multiplying the current iteration
by a predefined patience_increase factor. And finally, to determine if a new best is
significant, it must satisfy a predefined improvement_threshold. We compute the
validation error either once every epoch or every time we complete patience/2 mini‐
batches, whichever is smaller.
To illustrate these concepts, let’s walk through a two examples. Let’s set patience =
5000 and patience_increase = 2. In the first situation, we we do not achieve a sig‐
nificant improvement in validation error after completing patience/2 or 2500 itera‐
tions. This means that our value for patience never gets increased. Consequently,
after 5000 iterations, our training algorithm terminates.

In the second example, we’ll set patience and patience_increase to the same values
as before. This time however, our model experiences a significant improvement at
iteration 3000. This means that our patience will be increased to 6000. If we experi‐
ence no significant improvements during iterations 3001-6000, our algorithm will
terminate. On the other hand, if a significant improvement in validation error is
achieved, we will continue to resize patience until eventually, we max out N_EPOCHS
or we finally do exhaust our patience.
After completing the training process, we use the test_model function to evaluate our
error on the test set. We can report this error as the accuracy of this model. We’ve
finally completed writing and training our first model in Theano! Using the parame‐
70

| Chapter 3: Implementing Neural Networks in Theano

ters in the code below, we are able to achieve a test error of approximately 7.6%. Feel
free to experiment with the hyperparameters to see if you can force the logistic
regression network to perform any better!
"""
We'll now use the LogisticNetwork object we built in logistic_network.py in
order to tackle the MNIST dataset challenge. We will use minibatch gradient
descent to train this simplistic network model.
References:
- textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2
- websites: http://deeplearning.net/tutorial, Lisa Lab
"""
__docformat__ = 'restructedtext en'
import cPickle
import gzip
import os
import time
import urllib
from theano import function, shared, config
import theano.tensor as T
import numpy as np
import logistic_network

# Let's start off by defining some constants
# EXPERIMENT!!! Play around the the learning rate!
LEARNING_RATE = 0.2
N_EPOCHS = 1000
DATASET = 'mnist.pkl.gz'
BATCH_SIZE = 600
# Time to check if we have the data and if we don't, let's download it
print "... LOADING DATA ..."
data_path = os.path.join(
os.path.split(__file__)[0],
"..",
"data",
DATASET
)
if (not os.path.isfile(data_path)):
import urllib
origin = (
'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
)
print 'Downloading data from %s' % origin
urllib.urlretrieve(origin, data_path)

Using Theano to Train a Logistic Regression Network

|

71

# Time to build our models
print "... BUILDING MODEL ..."
# Load the dataset
data_file = gzip.open(data_path, 'rb')
training_set, validation_set, test_set = cPickle.load(data_file)
data_file.close()
# Define a quick function to established a shared dataset (for efficiency)
def shared_dataset(data_xy):
"""
We store the data in a shared variable because it allows Theano to copy it
into GPU memory (if GPU utilization is enabled). By default, if a variable is
not shared, it is moved to GPU at every use. This results in a big performance
hit because that means the data will be copied one minibatch at a time. Instead,
if we use shared variables, we don't have to worry about copying data
repeatedly.
"""
data_x, data_y = data_xy
shared_x = shared(np.asarray(data_x, dtype=config.floatX), borrow=True)
shared_y = shared(np.asarray(data_y, dtype='int32'), borrow=True)
return shared_x, shared_y
# We now instantiate the shared datasets
training_set_x , training_set_y = shared_dataset(training_set)
validation_set_x, validation_set_y = shared_dataset(validation_set)
test_set_x, test_set_y = shared_dataset(test_set)
# Lets compute the number of minibatches for training, validation, and testing
n_training_batches = training_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_validation_batches = validation_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
# Now it's time for us to build the model!
#Let's start of with an index to the minibatch we're using
index = T.lscalar()
# Generate symbolic variables for the input (a minibatch)
x = T.dmatrix('x')
y = T.ivector('y')
# Construct the logistic network model
# Keep in mind MNIST image is of size (28, 28)
# Also number of output class is is 10 (digits 0, 1, ..., 9)
model = logistic_network.LogisticNetwork(input=x, input_dim=28*28, output_dim=10)
# Obtain a symbolic expression for the objective function
# EXPERIMENT!!! Play around with L2 regression parameter!
objective = model.logistic_network_cost(y, lambda_l2=0.0001)

72

|

Chapter 3: Implementing Neural Networks in Theano

# Obtain a symbolic expression for the error incurred
error = model.error_rate(y)
# Compute symbolic gradients of objective with respect to model parameters
dW, db = T.grad(objective, model.W), T.grad(objective, model.b)
# Compile theano function for training with a minibatch
train_model = function(
inputs=[index],
outputs=objective,
updates=[
(model.W, model.W - LEARNING_RATE * dW),
(model.b, model.b - LEARNING_RATE * db)
],
givens={
x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
# Compile theano functions for validation and testing
validate_model = function(
inputs=[index],
outputs=error,
givens={
x : validation_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : validation_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
test_model = function(
inputs=[index],
outputs=error,
givens={
x : test_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : test_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
# Let's set up the early stopping parameters (based on the validation set)
# Must look at this many examples no matter what
patience = 5000
# Wait this much longer if a new best is found
patience_increase = 2
# This is when an improvement is significant
improvement_threshold = 0.995
# We go through this number of minbatches before we check on the validation set
validation_freq = min(n_training_batches, patience / 2)

Using Theano to Train a Logistic Regression Network

|

73

# We keep of the best loss on the validation set here
best_loss = np.inf
# We also keep track of the epoch we are in
epoch = 0
# A boolean flag that propagates when patience has been exceeded
exceeded_patience = False
# Now we're ready to start training the model
print "... TRAINING MODEL ..."
start_time = time.clock()
while (epoch < N_EPOCHS) and not exceeded_patience:
epoch = epoch + 1
for minibatch_index in xrange(n_training_batches):
minibatch_objective = train_model(minibatch_index)
iteration = (epoch - 1) * n_training_batches + minibatch_index
if (iteration + 1) % validation_freq == 0:
# Compute loss on validation set
validation_losses = [validate_model(i) for i in xrange(n_validation_batches)]
validation_loss = np.mean(validation_losses)
print 'epoch %i, minibatch %i/%i, validation error: %f %%' % (
epoch,
minibatch_index + 1,
n_training_batches,
validation_loss * 100
)
if validation_loss < best_loss:
if validation_loss < best_loss * improvement_threshold:
patience = max(patience, iteration * patience_increase)
best_loss = validation_loss
if patience <= iteration:
exceeded_patience = True
break
end_time = time.clock()
# Let's compute how well we do on the test set
test_losses = [test_model(i) for i in xrange(n_test_batches)]
test_loss = np.mean(test_losses)

# Print out the results!
print '\n'
print 'Optimization complete with best validation score of %f %%' % (best_loss * 100)
print 'And with a test score of %f %%' % (test_loss * 100)
print '\n'
print 'The code ran for %d epochs and for a total time of %.1f seconds' % (epoch, end_time - start
print '\n'

74

|

Chapter 3: Implementing Neural Networks in Theano

Multilayer Models in Theano
In the previous section, we built a simple logistic regression network with no hidden
layers. With a 93% accuracy rate, the model does quite well, but is still not nearly as
effective as a human is. In this section we will attempt to create a more powerful
model, a feed forward neural network with a hidden layer, in order to construct a
more accurate classifier. In the process, we’ll understand how to build models with
multiple layers, a characteristic that is fundamental to deep learning models.
In order to accomplish this, we’ll be building on many of the constructs we developed
in previous sections. Specifically, we’ll still be using the same output softmax layer
and the same early stopping strategy for training our feed forward neural network. In
this section we’ll discuss the unique components of this model and develop a concep‐
tual understanding of how the code is built step by step.
We start by looking at the structure of the hidden layers of our neural network. The
hidden layers utilize a tanh nonlinearity, which is generally considered to be more
effective than the sigmoidal nonlinearity because it is zero-centered. As described in
Glorot and Bengio 2010, we also initialize the weights by randomly sampling from a
narrow uniform distribution instead of setting all of the weights initially to zero. This
is because, if every neuron in the network computes the same output, then they will
also all compute the same gradients during backpropagation and undergo the exact
same parameter updates. The Python code responsible for this initialization is shown
below.
"""
We will use this class to represent a tanh hidden layer.
This will be a building block for a simplefeedforward neural
network.
References:
- textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2
- websites: http://deeplearning.net/tutorial, Lisa Lab
"""
import numpy as np
import theano.tensor as T
import theano
class HiddenLayer(object):
"""
The hidden layer class is described by two parameters (which
we will want to learn). The first is a incoming weight matrix.
We'll refer to this weight matrix as W. The second is a bias
vector b. Refer to the text if you want to learn more about how

Multilayer Models in Theano

|

75

this layer works. Let's get started!
"""
def __init__(self, input, input_dim, output_dim, random_gen):
"""
We first initialize the hidden layer object with some important
information.
PARAM input : theano.tensor.TensorType
A symbolic variable that we'll use to describe incoming data from
the previous layer
PARAM input_dim : int
This will represent the number of neurons in the previous layer
PARAM ouptut_dim : int
This will represent the number of neurons in the hidden layer
PARAM random_gen : numpy.random.RandomState
A random number generator used to properly initialize the weights.
For a tanh activation function, the literature suggests that the
incoming weights should be sampled from the uniform distribution
[-sqrt(6./(input_dim + output_dim)), sqrt(6./(input_dim + output_dim)]
"""
# We initialize the weight matrix W of size (input_dim, output_dim)
self.W = theano.shared(
value=np.asarray(
random_gen.uniform(
low=-np.sqrt(6. / (input_dim + output_dim)),
high=np.sqrt(6. / (input_dim + output_dim)),
size=(input_dim, output_dim)
),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
# We initialize a bias vector for the neurons of the output layer
self.b = theano.shared(
value=np.zeros(output_dim),
name='b',
borrow='True'
)
# Symbolic description of the incoming logits
logit = T.dot(input, self.W) + self.b
# Symbolic description of the outputs of the hidden layer neurons
self.output = T.tanh(logit)

76

|

Chapter 3: Implementing Neural Networks in Theano

We can now stack these hidden layers in order to construct a feedforward neural net‐
work. The initialization takes a random generator random_gen to perform the weight
initialization as well as a list of hidden_layer_sizes in order to specify the sizes of
each of the hidden layers in the neural network. We hook up the input of the first
hidden layer to a minibatch from our dataset and the output of the last hidden layer
to the input of self.softmax_layer. Every one of the other hidden layers has its
inputs drawing from the outputs of the layer below it and its outputs transmitting to
the inputs of the layer above it. These hidden layers are maintained in the list
self.hidden_layers as internal state. The objective function, as before, has both a
negative log likelihood component and an L2 regularization component. This time,
however, computing the sum of the squares of the weights of all connections in the
network is slightly more involved. The error function is the same as what we had for
the logistic network model. The Python code for the feed forward neural network is
shown below.
"""
We will use this class to represent a tanh hidden layer.
This will be a building block for a simplefeedforward neural
network.
References:
- textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2
- websites: http://deeplearning.net/tutorial, Lisa Lab
"""
import numpy as np
import theano.tensor as T
import theano
class HiddenLayer(object):
"""
The hidden layer class is described by two parameters (which
we will want to learn). The first is a incoming weight matrix.
We'll refer to this weight matrix as W. The second is a bias
vector b. Refer to the text if you want to learn more about how
this layer works. Let's get started!
"""
def __init__(self, input, input_dim, output_dim, random_gen):
"""
We first initialize the hidden layer object with some important
information.
PARAM input : theano.tensor.TensorType
A symbolic variable that we'll use to describe incoming data from
the previous layer
PARAM input_dim : int
This will represent the number of neurons in the previous layer

Multilayer Models in Theano

|

77

PARAM ouptut_dim : int
This will represent the number of neurons in the hidden layer
PARAM random_gen : numpy.random.RandomState
A random number generator used to properly initialize the weights.
For a tanh activation function, the literature suggests that the
incoming weights should be sampled from the uniform distribution
[-sqrt(6./(input_dim + output_dim)), sqrt(6./(input_dim + output_dim)]
"""
# We initialize the weight matrix W of size (input_dim, output_dim)
self.W = theano.shared(
value=np.asarray(
random_gen.uniform(
low=-np.sqrt(6. / (input_dim + output_dim)),
high=np.sqrt(6. / (input_dim + output_dim)),
size=(input_dim, output_dim)
),
dtype=theano.config.floatX
),
name='W',
borrow=True
)
# We initialize a bias vector for the neurons of the output layer
self.b = theano.shared(
value=np.zeros(output_dim),
name='b',
borrow='True'
)
# Symbolic description of the incoming logits
logit = T.dot(input, self.W) + self.b
# Symbolic description of the outputs of the hidden layer neurons
self.output = T.tanh(logit)

The Python script responsible for instantiating and training the feedforward network
is quite similar to how we trained the logistic network model. Here, we take a closer
look at the differences before we present the entire script.
# Construct the logistic network model
# Keep in mind MNIST image is of size (28, 28)
# Also number of output class is is 10 (digits 0, 1, ..., 9)
model = feed_forward_network.FeedForwardNetwork(
random_gen=random_gen,
input=x,
input_dim=28*28,
output_dim=10,
hidden_layer_sizes=[500]
)

78

|

Chapter 3: Implementing Neural Networks in Theano

# Obtain a symbolic expression for the objective function
# EXPERIMENT!!! Play around with L2 regression parameter!
objective = model.feed_forward_network_cost(y, lambda_l2=0.0001)
# Obtain a symbolic expression for the error incurred
error = model.error_rate(y)
# Compute symbolic gradients of objective with respect to model parameters
updates = []
for hidden_layer in model.hidden_layers:
dW = T.grad(objective, hidden_layer.W)
db = T.grad(objective, hidden_layer.b)
updates.append((hidden_layer.W, hidden_layer.W - LEARNING_RATE * dW))
updates.append((hidden_layer.b, hidden_layer.b - LEARNING_RATE * db))
dW = T.grad(objective, model.softmax_layer.W)
db = T.grad(objective, model.softmax_layer.b)
updates.append((model.softmax_layer.W, model.softmax_layer.W - LEARNING_RATE * dW))
updates.append((model.softmax_layer.b, model.softmax_layer.b - LEARNING_RATE * db))
# Compile theano function for training with a minibatch
train_model = function(
inputs=[index],
outputs=objective,
updates=updates,
givens={
x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)

As you’ll notice, our network only has a single hidden layer with 500 tanh neurons, as
declared by the line hidden_layer_sizes=[500]. Moreover, we need to compile
updates for the connections and biases over all of the layers in our feed forward net‐
work. This incurs some additional complexity, but all of the gradient updates can be
compiled into a single variable, updates, and passed to the train_model function.
For completeness, we present the full training script below. The model performs
much better than the logistic network, incurring a test error of only 1.65%!
"""
We'll now use the LogisticNetwork object we built in feed_forward_network.py
in order to tackle the MNIST dataset challenge. We will use minibatch gradient
descent to train this simplistic network model.
References:
- textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2
- websites: http://deeplearning.net/tutorial, Lisa Lab
"""
__docformat__ = 'restructedtext en'

Multilayer Models in Theano

|

79

import cPickle
import gzip
import os
import time
import urllib
from theano import function, shared, config
import theano.tensor as T
import numpy as np
import feed_forward_network

# Let's start off by defining some constants
# EXPERIMENT!!! Play around the the learning rate!
LEARNING_RATE = 0.01
N_EPOCHS = 1000
DATASET = 'mnist.pkl.gz'
BATCH_SIZE = 20
# Time to check if we have the data and if we don't, let's download it
print "... LOADING DATA ..."
data_path = os.path.join(
os.path.split(__file__)[0],
"..",
"data",
DATASET
)
if (not os.path.isfile(data_path)):
import urllib
origin = (
'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz'
)
print 'Downloading data from %s' % origin
urllib.urlretrieve(origin, data_path)
# Time to build our models
print "... BUILDING MODEL ..."
# Load the dataset
data_file = gzip.open(data_path, 'rb')
training_set, validation_set, test_set = cPickle.load(data_file)
data_file.close()
# Define a quick function to established a shared dataset (for efficiency)
def shared_dataset(data_xy):
"""
We store the data in a shared variable because it allows Theano to copy it
into GPU memory (if GPU utilization is enabled). By default, if a variable is
not shared, it is moved to GPU at every use. This results in a big performance

80

|

Chapter 3: Implementing Neural Networks in Theano

hit because that means the data will be copied one minibatch at a time. Instead,
if we use shared variables, we don't have to worry about copying data
repeatedly.
"""
data_x, data_y = data_xy
shared_x = shared(np.asarray(data_x, dtype=config.floatX), borrow=True)
shared_y = shared(np.asarray(data_y, dtype='int32'), borrow=True)
return shared_x, shared_y
# We now instantiate the shared datasets
training_set_x , training_set_y = shared_dataset(training_set)
validation_set_x, validation_set_y = shared_dataset(validation_set)
test_set_x, test_set_y = shared_dataset(test_set)
# Lets compute the number of minibatches for training, validation, and testing
n_training_batches = training_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_validation_batches = validation_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
n_test_batches = test_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE
# Now it's time for us to build the model!
#Let's start of with an index to the minibatch we're using
index = T.lscalar()
# Generate symbolic variables for the input (a minibatch)
x = T.dmatrix('x')
y = T.ivector('y')
# Create a random number generator for seeding weight initialization
random_gen = np.random.RandomState(1234)
# Construct the logistic network model
# Keep in mind MNIST image is of size (28, 28)
# Also number of output class is is 10 (digits 0, 1, ..., 9)
model = feed_forward_network.FeedForwardNetwork(
random_gen=random_gen,
input=x,
input_dim=28*28,
output_dim=10,
hidden_layer_sizes=[500]
)
# Obtain a symbolic expression for the objective function
# EXPERIMENT!!! Play around with L2 regression parameter!
objective = model.feed_forward_network_cost(y, lambda_l2=0.0001)
# Obtain a symbolic expression for the error incurred
error = model.error_rate(y)
# Compute symbolic gradients of objective with respect to model parameters
updates = []
for hidden_layer in model.hidden_layers:

Multilayer Models in Theano

|

81

dW = T.grad(objective, hidden_layer.W)
db = T.grad(objective, hidden_layer.b)
updates.append((hidden_layer.W, hidden_layer.W - LEARNING_RATE * dW))
updates.append((hidden_layer.b, hidden_layer.b - LEARNING_RATE * db))
dW = T.grad(objective, model.softmax_layer.W)
db = T.grad(objective, model.softmax_layer.b)
updates.append((model.softmax_layer.W, model.softmax_layer.W - LEARNING_RATE * dW))
updates.append((model.softmax_layer.b, model.softmax_layer.b - LEARNING_RATE * db))
# Compile theano function for training with a minibatch
train_model = function(
inputs=[index],
outputs=objective,
updates=updates,
givens={
x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
# Compile theano functions for validation and testing
validate_model = function(
inputs=[index],
outputs=error,
givens={
x : validation_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : validation_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)
test_model = function(
inputs=[index],
outputs=error,
givens={
x : test_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE],
y : test_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE]
}
)

# Let's set up the early stopping parameters (based on the validation set)
# Must look at this many examples no matter what
patience = 10000
# Wait this much longer if a new best is found
patience_increase = 2
# This is when an improvement is significant
improvement_threshold = 0.995

82

|

Chapter 3: Implementing Neural Networks in Theano

# We go through this number of minbatches before we check on the validation set
validation_freq = min(n_training_batches, patience / 2)
# We keep of the best loss on the validation set here
best_loss = np.inf
# We also keep track of the epoch we are in
epoch = 0
# A boolean flag that propagates when patience has been exceeded
exceeded_patience = False
# Now we're ready to start training the model
print "... TRAINING MODEL ..."
start_time = time.clock()
while (epoch < N_EPOCHS) and not exceeded_patience:
epoch = epoch + 1
for minibatch_index in xrange(n_training_batches):
minibatch_objective = train_model(minibatch_index)
iteration = (epoch - 1) * n_training_batches + minibatch_index
if (iteration + 1) % validation_freq == 0:
# Compute loss on validation set
validation_losses = [validate_model(i) for i in xrange(n_validation_batches)]
validation_loss = np.mean(validation_losses)
print 'epoch %i, minibatch %i/%i, validation error: %f %%' % (
epoch,
minibatch_index + 1,
n_training_batches,
validation_loss * 100
)
if validation_loss < best_loss:
if validation_loss < best_loss * improvement_threshold:
patience = max(patience, iteration * patience_increase)
best_loss = validation_loss
if patience <= iteration:
exceeded_patience = True
break
end_time = time.clock()
# Let's compute how well we do on the test set
test_losses = [test_model(i) for i in xrange(n_test_batches)]
test_loss = np.mean(test_losses)
# Print out the results!
print '\n'
print 'Optimization complete with best validation score of %f %%' % (best_loss * 100)
print 'And with a test score of %f %%' % (test_loss * 100)
print '\n'

Multilayer Models in Theano

|

83

print 'The code ran for %d epochs and for a total time of %.1f seconds' % (epoch, end_time - start
print '\n'

Summary
In this chapter, we learned more about using Theano as a library for expressing and
training machine learning models. We discussed many internal features of Theano,
including shared variables, symbolic differentiation, and graph optimizations. In the
final sections, we used this understanding to train a logistic network model and a
generalized feedforward neural network using stochastic gradient descent. Although
the logistic network model made many errors on the MNIST dataset, our feedfor‐
ward network performed very well, making only an average of 1.65 errors out of
every 100 digits.
In the next section, we’ll begin to grapple with many of the problems that arise as we
begin to make our networks deeper. While deep models afford us the power to tackle
more difficult problems, they are also notoriously difficult to train with vanilla sto‐
chastic gradient descent. To tackle these issues, we’ll delve in to modern optimization
theory to come up with better methods of training deep learning models.

84

|

Chapter 3: Implementing Neural Networks in Theano

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close