Artificial Neural Networks

(ANNs)

Background and Motivation

• Computers and the Brain: A Contrast

–

–

–

–

–

–

Arithmetic:

Vision:

1 brain = 1/10 pocket calculator

1 brain = 1000 super computers

Memory of arbitrary details: computer wins

Memory of real-world (related) facts: brain wins

A computer must be programmed explicitly

The brain can learn by experiencing the world

What are Artificial Neural Networks?

A system loosely modeled on the human brain that simulates the

multiple layers of simple processing elements called neurons.

There are various classes of ANN models

depending on:

q Problem type

Prediction, Classification , Clustering

q Design / Structure of the model

q Model building algorithm

A bit of biology . . .

Most important functional unit in human brain – a class of cells called –

NEURONS

Dendrites

Cell Body

Axon

Synapse

Actual Neuron

Schematic

• Dendrites – Receive information • Cell Body – Process information

• Axon – Carries processed information to other neurons

• Synapse – Junction between Axon end and Dendrites of other Neurons

Artificial Neurons

• simulated on hardware or by software

• input connections - the receivers

• node, unit, Processing Unit or Processing Element

simulates neuron body

• output connection and transfer function - the

transmitter (axon)

• activation function employs a threshold or bias

• connection weights act as synaptic junctions

Learning (for a particular model) occurs via changes in

value of the various weights and choice of functions.

An Artificial Neuron: Structure

Dendrites

X1

X2

w1

w2

..

.

Xp

wp

Cell Body

Axon

Direction of flow of Information

I

f

V = f(I)

I = w1X1 + w2X2

+ w3X3 +… + wpXp

• Receives Inputs X 1 X2 … Xp from other neurons or environment

• Inputs fed-in through connections with ‘weights’

• Total Input = Weighted sum of inputs from all sources

• Transfer function and Activation function convert the input to output

• Output goes to other neurons or environment

Behaviour of an artificial neural network

to any particular input depends upon

• structure of each node (threshold /

activation / transfer function)

• structure of the network (architecture)

• weights on each of the connections

.... these must be learned !

Activity

in

an

ANN

Node

Mathematical Model of a Node

Adder _ Function

a0

w0

Incoming

activation

wi

ai

∑

Θ

Outgoing

activation

wn

an

Activation_ Function

Activity in an ANN Node

Adder _ Function

n

a0

I = ∑ xi wi

w0

Incoming

activation

wi

ai

∑

Θ

i =0

Outgoing

activation

wn

an

Activation_ Function

Activity in an ANN Node

Adder _ Function

a0

w0

Incoming

activation

wi

ai

∑

Θ

Outgoing

activation

wn

Activation_ Function

an

1 if I > t,

f (I ) =

0 if I <= t

Transfer Functions

There are various choices for Transfer / Activation functions

1

1

1

0.5

-1

Tanh

f(x) =

(ex – e-x) / (ex + e-x)

0

Logistic

f(x) = ex / (1 + ex)

0

Threshold

0 if x< 0

f(x) =

1 if x >= 1

Artificial Neural Network Architecture

The most common neural network

architectures has three layers:

- 1st Layer: input layer

-2nd Layer: hidden layer

-3rd Layer: output layer

More complex ANN Architectures

Number of hidden layers can be

None

One

More

Feed-forward Artificial Neural Network Architecture

A collection of neurons form a ‘Layer’

X1

X2

X3

X4

Input Layer

Direction of information flow

- Each neuron gets ONLY

one input, directly from outside

Hidden Layer

- Connects Input and Output

layers

Output Layer

- Output of each neuron

directly goes to outside

y1

y2

Feed-forward Artificial Neural Network Architecture

A few restrictions

X1

X2

X3

X4

• Within a layer neurons are NOT

connected to each other.

• Neuron in one layer is connected

to neurons ONLY in the NEXT

layer. (Feed-forward)

• Jumping of layer is NOT

allowed

y1

y2

An ANN model

What do we mean by ‘A particular Model’ ?

Input: X1 X2 X3

For an ANN :

Output: Y

Model: Y = f(X1 X2 X3)

Algebraic form of f(.) is too complicated to write down.

However it is characterized by

• # Input Neurons

• # Hidden Layers

• # Neurons in each Hidden Layer

• # Output Neurons

• The adder, activation and transfer functions

• WEIGHTS for all the connections

‘Fitting’ an ANN model = Specifying values for all those parameters

ANN Model – an Example

Input: X1 X2 X3

Output: Y

X2

X1

0.6

-0.1

X3

-0.2

0.1

0.7

0.5

0.1

Model: Y = f(X1 X2 X3)

Parameters

Example

# Input Neurons

3

# Hidden Layers

1

# Hidden Layer Size

2

# Output Neurons

1

-0.2

Weights

Y

Decided by the structure

of the problem

# Input Nrns = # of X’s

# Output Nrns = # of Y’s

Learnt

Free parameters

Prediction using a particular ANN Model

Input: X1 X2 X3

X1 =1

Output: Y

X2=-1

0.6

-0.1

X3 =2

-0.2

0.1

0.7

0.5

0.2

f (0.2) = 0.55

0.55

0.9

f (0.9) = 0.71

0.71

0.1

Model: Y = f(X1 X2 X3)

0.2 = 0.5 * 1 –0.1*(-1) –0.2 * 2

f(x) = ex / (1 + ex)

f(0.2) = e0.2 / (1 + e0.2) = 0.55

Predicted Y = 0.478

-0.2

Suppose Actual Y = 2

-0.087

f (-0.087) = 0.478

0.478

Then

Prediction Error = (2-0.478) =1.522

Real Estate Appraisal by Neural Networks

Building ANN Model

Input: X1 X2 X3

Output: Y Model: Y = f(X1 X2 X3)

# Input Neurons = # Inputs = 3

# Hidden Layer = ???

# Neurons in Hidden Layer = ???

# Output Neurons = # Outputs = 1

Try 1

Try 2

Architecture is now defined …

No fixed strategy.

By trial and error

How to get the weights ???

Given the Architecture There are 8 weights to decide.

W = (W1, W 2, …, W 8)

Training Data: (Yi , X1i, X2i, …, Xpi ) i= 1,2,…,n

Given a particular choice of W, we will get predicted Y’s

( V1,V2,…,Vn)

They are function of W.

Choose W such that over all prediction error E is minimized

E = Σ (Yi – Vi) 2

Training the Model

Back Propagation

Feed forward

• Start with a random set of weights.

E = Σ (Yi – Vi) 2

• Feed forward the first observation through the net

X1

Network

V1

; Error = (Y1 – V1)

• Adjust the weights so that this error is reduced

( network fits the first observation well )

• Feed forward the second observation.

Adjust weights to fit the second observation well

• Keep repeating till you reach the last observation

• This finishes one CYCLE through the data

• Randomize the order of the input (observations)

•Perform many such training cycles till the

overall prediction error E is small.

Back Propagation

Each weight ‘Shares the Blame’ for prediction

error and tries to ‘adjust’ with other weights.

Back Propagation algorithm decides how to

distribute the blame among all weights and

adjust the weights accordingly.

Small portion of blame leads to small

adjustment.

Large portion of the blame leads to large

adjustment.

E = Σ (Yi – Vi) 2

Weight adjustment during Back Propagation

Weight adjustment formula in Back Propagation

Vi – the prediction for ith observation –

is a function of the network weights vector W = ( W1, W 2,….)

Hence, E, the total prediction error is also a function of W

E( W ) = Σ [ Yi – Vi( W ) ] 2

Gradient Descent Method :

For every individual weight W i, updation formula looks like

Wnew = Wold + α * ( ∂E / ∂W) | Wold

α = Learning Parameter (between 0 and 1)

Another slight variation is also used sometimes

W(t+1) = W (t) + α * ( ∂E / ∂W) | W(t) + β * (W(t) - W(t-1) )

β = Momentum (between 0 and 1)

Geometric interpretation of the Weight adjustment

Consider a very simple network with 2 inputs and 1 output. No hidden layer.

There are only two weights whose values needs to be specified.

E( w1, w2 ) = Σ [ Yi – Vi(w1, w2 ) ] 2

• A pair ( w1, w2 ) is a point on 2-D plane.

w1

w2

• For any such point we can get a value of E.

• Plot E vs ( w1, w2 ) - a 3-D surface - ‘Error Surface’

• Aim is to identify that pair for which E is minimum

• That means – identify the pair for which the height of

the error surface is minimum.

Gradient Descent Algorithm

• Start with a random point ( w1, w2 )

• Move to a ‘better’ point ( w’1, w’2 ) where the height of error surface is lower.

• Keep moving till you reach ( w*1, w*2 ), where the error is minimum.

Crawling the Error Surface

14.0

Error Surface

12.0

Local Minima

10.0

8.0

Error

6.0

4.0

w*

w0

Weight Space

-2.000

-1.000

0.000

1.000

2.000

3.000

4.000

5.000

6.000

6.000

5.000

4.000

3.000

2.000

1.000

0.000

-1.000

W2

-2.000

-3.000

0.0

-3.000

2.0

Global Minima

W1

Training Algorithm

Decide the Network architecture

(# Hidden layers, #Neurons in each Hidden Layer)

Decide the Learning parameter and Momentum

Initialize the Network with random weights

Do till Convergence criterion is not met

For I = 1 to # Training Data points

Feed forward the I-th observation thru the Net

Compute the prediction error on I-th observation

Back propagate the error and adjust weights

E = Σ (Yi – Vi)

2

Next I

Check for Convergence

End Do

Convergence Criterion

When to stop training the Network ?

Ideally – when we reach the global minima of the error surface

How do we know we have reached there ?

We don’t …

Suggestions:

1. Stop if the decrease in total prediction error (since last cycle) is small.

2. Stop if the overall changes in the weights (since last cycle) are small.

Drawback:

Error keeps on decreasing. We get a very good fit to training data.

BUT … The network thus obtained have poor generalizing power on unseen data

The phenomenon is also known as - Over fitting of the Training data

The network is said to Memorize the training data.

- so that when an X in training set is given,

the network faithfully produces the corresponding Y.

-However for X’s which the network didn’t see before, it predicts poorly.

Convergence Criterion

Modified Suggestion:

Partition the training data into Training set and Validation set

Use:

Training set - build the model

Validation set - test the performance of the model on unseen data

Typically as we have more and more training cycles

Error on Training set keeps on decreasing.

Error on Validation set first decreases and then increases.

Error

Validation

Training

Cycle

Stop training when the error on

Validation set starts increasing

Choice of Training Parameters

Learning Parameter and Momentum

- needs to be supplied by user from outside. Should be between 0 and 1

What should be the optimal values of these training parameters ?

- No clear consensus on any fixed strategy.

- However, effects of wrongly specifying them are well studied.

Learning Parameter

Too big – Large leaps in weight space – risk of missing global minima.

Too small –

- Takes long time to converge to global minima

- Once stuck in local minima, difficult to get out of it.

Suggestion

Trial and Error – Try various choices of Learning Parameter and Momentum

See which choice leads to minimum prediction error

Summary

q

Artificial Neural network (ANN)

– A class of models inspired by biological Neurons

q

Used for various modeling problems – Prediction, Classification, Clustering, ..

q

One particular subclass of ANN’s – Feed forward Back propagation networks

v

v

v

v

Organized in layers – Input, hidden, Output

Each layer is a collection of a number of artificial Neurons

Neurons in one layer in connected to neurons in next layer

Connections have weights

q

Fitting an ANN model is to find the values of these weights.

q

Given a training data set – weights are found by Feed forward Back

propagation algorithm, which is a form of Gradient Descent Method – a

popular technique for function minimization.

q

Network architecture as well as the training parameters are decided upon by

trial and error. Try various choices and pick the one that gives lowest

prediction error.

Self-Organizing Maps

Overview

• A Self-Organizing Map (SOM) is a way to

represent higher dimensional data in an usually 2D or 3-D manner, such that similar data is grouped

together.

• It runs unsupervised and performs the grouping on

its own.

• Once the SOM converges, it can classify new data.

• SOMs run in two phases:

– Training phase: the map is built, the network organizes

using a competitive process, it is trained using a large

numbers of inputs.

– Mapping phase: new vectors are quickly given a

location on the converged map, easily classifying or

categorizing the new data.

Example

• Example: Data on poverty levels in different countries.

– SOM does not show poverty levels, rather it shows how similar the

poverty levels of different countries are to each other. (Similar

color = similar poverty level).

Forming the Map

The Basic Process

1)

2)

3)

4)

5)

6)

7)

8)

9)

Each pixel is a node / unit.

Initialize each node’s weights with random values.

Take the next vector from training data and present it to the

SOM.

Every node is examined to find the Best Matching Unit /

Node (BMU).

Adjust BMU’s weights to make it come closer to the input

vector. The rate of adjustment is determined by the ‘learning

rate’. The learning rate decreases with each iteration.

The radius of the neighborhood around the BMU is

calculated. The size of the neighborhood decreases with each

iteration.

Each node in the BMU’s neighborhood has its weights

adjusted to become more like the BMU. Nodes closest to the

BMU are altered more than the nodes further away in the

neighborhood.

Repeat from step 3 to 7 for each vector in the training data.

Repeat from step 3 to 8 for enough iterations for convergence.

Calculating the Best Matching Unit

• Calculating the BMU is done according to the

Euclidean distance between the node’s weights

(W1, W2, … , Wn) and the input vector’s values

(V1, V2, … , Vn).

– This gives a good measurement of how similar the two

sets of data are to each other.

– The node closest to the input vector is chosen as the

BMU (or ‘winning’ node).

Determining the BMU

Neighborhood

•

Size of the neighborhood: uses an exponential decay function that shrinks

on each iteration until eventually the neighborhood is just the BMU itself.

•

Effect of location within the neighborhood: The neighborhood is defined by a

gaussian curve so that nodes that are closer are influenced more than farther

nodes.

Modifying the Nodes’ Weights

• The new weight for a node is the old weight, plus a fraction (L) of

the difference between the old weight and the input vector…

adjusted (by theta) depending on the distance from the BMU.

• The learning rate, L, is also an exponential decay function.

– This ensures that the SOM will converge.

• The lambda represents a constant, and t is the time step

SOM Application: Text Clustering

(ANNs)

Background and Motivation

• Computers and the Brain: A Contrast

–

–

–

–

–

–

Arithmetic:

Vision:

1 brain = 1/10 pocket calculator

1 brain = 1000 super computers

Memory of arbitrary details: computer wins

Memory of real-world (related) facts: brain wins

A computer must be programmed explicitly

The brain can learn by experiencing the world

What are Artificial Neural Networks?

A system loosely modeled on the human brain that simulates the

multiple layers of simple processing elements called neurons.

There are various classes of ANN models

depending on:

q Problem type

Prediction, Classification , Clustering

q Design / Structure of the model

q Model building algorithm

A bit of biology . . .

Most important functional unit in human brain – a class of cells called –

NEURONS

Dendrites

Cell Body

Axon

Synapse

Actual Neuron

Schematic

• Dendrites – Receive information • Cell Body – Process information

• Axon – Carries processed information to other neurons

• Synapse – Junction between Axon end and Dendrites of other Neurons

Artificial Neurons

• simulated on hardware or by software

• input connections - the receivers

• node, unit, Processing Unit or Processing Element

simulates neuron body

• output connection and transfer function - the

transmitter (axon)

• activation function employs a threshold or bias

• connection weights act as synaptic junctions

Learning (for a particular model) occurs via changes in

value of the various weights and choice of functions.

An Artificial Neuron: Structure

Dendrites

X1

X2

w1

w2

..

.

Xp

wp

Cell Body

Axon

Direction of flow of Information

I

f

V = f(I)

I = w1X1 + w2X2

+ w3X3 +… + wpXp

• Receives Inputs X 1 X2 … Xp from other neurons or environment

• Inputs fed-in through connections with ‘weights’

• Total Input = Weighted sum of inputs from all sources

• Transfer function and Activation function convert the input to output

• Output goes to other neurons or environment

Behaviour of an artificial neural network

to any particular input depends upon

• structure of each node (threshold /

activation / transfer function)

• structure of the network (architecture)

• weights on each of the connections

.... these must be learned !

Activity

in

an

ANN

Node

Mathematical Model of a Node

Adder _ Function

a0

w0

Incoming

activation

wi

ai

∑

Θ

Outgoing

activation

wn

an

Activation_ Function

Activity in an ANN Node

Adder _ Function

n

a0

I = ∑ xi wi

w0

Incoming

activation

wi

ai

∑

Θ

i =0

Outgoing

activation

wn

an

Activation_ Function

Activity in an ANN Node

Adder _ Function

a0

w0

Incoming

activation

wi

ai

∑

Θ

Outgoing

activation

wn

Activation_ Function

an

1 if I > t,

f (I ) =

0 if I <= t

Transfer Functions

There are various choices for Transfer / Activation functions

1

1

1

0.5

-1

Tanh

f(x) =

(ex – e-x) / (ex + e-x)

0

Logistic

f(x) = ex / (1 + ex)

0

Threshold

0 if x< 0

f(x) =

1 if x >= 1

Artificial Neural Network Architecture

The most common neural network

architectures has three layers:

- 1st Layer: input layer

-2nd Layer: hidden layer

-3rd Layer: output layer

More complex ANN Architectures

Number of hidden layers can be

None

One

More

Feed-forward Artificial Neural Network Architecture

A collection of neurons form a ‘Layer’

X1

X2

X3

X4

Input Layer

Direction of information flow

- Each neuron gets ONLY

one input, directly from outside

Hidden Layer

- Connects Input and Output

layers

Output Layer

- Output of each neuron

directly goes to outside

y1

y2

Feed-forward Artificial Neural Network Architecture

A few restrictions

X1

X2

X3

X4

• Within a layer neurons are NOT

connected to each other.

• Neuron in one layer is connected

to neurons ONLY in the NEXT

layer. (Feed-forward)

• Jumping of layer is NOT

allowed

y1

y2

An ANN model

What do we mean by ‘A particular Model’ ?

Input: X1 X2 X3

For an ANN :

Output: Y

Model: Y = f(X1 X2 X3)

Algebraic form of f(.) is too complicated to write down.

However it is characterized by

• # Input Neurons

• # Hidden Layers

• # Neurons in each Hidden Layer

• # Output Neurons

• The adder, activation and transfer functions

• WEIGHTS for all the connections

‘Fitting’ an ANN model = Specifying values for all those parameters

ANN Model – an Example

Input: X1 X2 X3

Output: Y

X2

X1

0.6

-0.1

X3

-0.2

0.1

0.7

0.5

0.1

Model: Y = f(X1 X2 X3)

Parameters

Example

# Input Neurons

3

# Hidden Layers

1

# Hidden Layer Size

2

# Output Neurons

1

-0.2

Weights

Y

Decided by the structure

of the problem

# Input Nrns = # of X’s

# Output Nrns = # of Y’s

Learnt

Free parameters

Prediction using a particular ANN Model

Input: X1 X2 X3

X1 =1

Output: Y

X2=-1

0.6

-0.1

X3 =2

-0.2

0.1

0.7

0.5

0.2

f (0.2) = 0.55

0.55

0.9

f (0.9) = 0.71

0.71

0.1

Model: Y = f(X1 X2 X3)

0.2 = 0.5 * 1 –0.1*(-1) –0.2 * 2

f(x) = ex / (1 + ex)

f(0.2) = e0.2 / (1 + e0.2) = 0.55

Predicted Y = 0.478

-0.2

Suppose Actual Y = 2

-0.087

f (-0.087) = 0.478

0.478

Then

Prediction Error = (2-0.478) =1.522

Real Estate Appraisal by Neural Networks

Building ANN Model

Input: X1 X2 X3

Output: Y Model: Y = f(X1 X2 X3)

# Input Neurons = # Inputs = 3

# Hidden Layer = ???

# Neurons in Hidden Layer = ???

# Output Neurons = # Outputs = 1

Try 1

Try 2

Architecture is now defined …

No fixed strategy.

By trial and error

How to get the weights ???

Given the Architecture There are 8 weights to decide.

W = (W1, W 2, …, W 8)

Training Data: (Yi , X1i, X2i, …, Xpi ) i= 1,2,…,n

Given a particular choice of W, we will get predicted Y’s

( V1,V2,…,Vn)

They are function of W.

Choose W such that over all prediction error E is minimized

E = Σ (Yi – Vi) 2

Training the Model

Back Propagation

Feed forward

• Start with a random set of weights.

E = Σ (Yi – Vi) 2

• Feed forward the first observation through the net

X1

Network

V1

; Error = (Y1 – V1)

• Adjust the weights so that this error is reduced

( network fits the first observation well )

• Feed forward the second observation.

Adjust weights to fit the second observation well

• Keep repeating till you reach the last observation

• This finishes one CYCLE through the data

• Randomize the order of the input (observations)

•Perform many such training cycles till the

overall prediction error E is small.

Back Propagation

Each weight ‘Shares the Blame’ for prediction

error and tries to ‘adjust’ with other weights.

Back Propagation algorithm decides how to

distribute the blame among all weights and

adjust the weights accordingly.

Small portion of blame leads to small

adjustment.

Large portion of the blame leads to large

adjustment.

E = Σ (Yi – Vi) 2

Weight adjustment during Back Propagation

Weight adjustment formula in Back Propagation

Vi – the prediction for ith observation –

is a function of the network weights vector W = ( W1, W 2,….)

Hence, E, the total prediction error is also a function of W

E( W ) = Σ [ Yi – Vi( W ) ] 2

Gradient Descent Method :

For every individual weight W i, updation formula looks like

Wnew = Wold + α * ( ∂E / ∂W) | Wold

α = Learning Parameter (between 0 and 1)

Another slight variation is also used sometimes

W(t+1) = W (t) + α * ( ∂E / ∂W) | W(t) + β * (W(t) - W(t-1) )

β = Momentum (between 0 and 1)

Geometric interpretation of the Weight adjustment

Consider a very simple network with 2 inputs and 1 output. No hidden layer.

There are only two weights whose values needs to be specified.

E( w1, w2 ) = Σ [ Yi – Vi(w1, w2 ) ] 2

• A pair ( w1, w2 ) is a point on 2-D plane.

w1

w2

• For any such point we can get a value of E.

• Plot E vs ( w1, w2 ) - a 3-D surface - ‘Error Surface’

• Aim is to identify that pair for which E is minimum

• That means – identify the pair for which the height of

the error surface is minimum.

Gradient Descent Algorithm

• Start with a random point ( w1, w2 )

• Move to a ‘better’ point ( w’1, w’2 ) where the height of error surface is lower.

• Keep moving till you reach ( w*1, w*2 ), where the error is minimum.

Crawling the Error Surface

14.0

Error Surface

12.0

Local Minima

10.0

8.0

Error

6.0

4.0

w*

w0

Weight Space

-2.000

-1.000

0.000

1.000

2.000

3.000

4.000

5.000

6.000

6.000

5.000

4.000

3.000

2.000

1.000

0.000

-1.000

W2

-2.000

-3.000

0.0

-3.000

2.0

Global Minima

W1

Training Algorithm

Decide the Network architecture

(# Hidden layers, #Neurons in each Hidden Layer)

Decide the Learning parameter and Momentum

Initialize the Network with random weights

Do till Convergence criterion is not met

For I = 1 to # Training Data points

Feed forward the I-th observation thru the Net

Compute the prediction error on I-th observation

Back propagate the error and adjust weights

E = Σ (Yi – Vi)

2

Next I

Check for Convergence

End Do

Convergence Criterion

When to stop training the Network ?

Ideally – when we reach the global minima of the error surface

How do we know we have reached there ?

We don’t …

Suggestions:

1. Stop if the decrease in total prediction error (since last cycle) is small.

2. Stop if the overall changes in the weights (since last cycle) are small.

Drawback:

Error keeps on decreasing. We get a very good fit to training data.

BUT … The network thus obtained have poor generalizing power on unseen data

The phenomenon is also known as - Over fitting of the Training data

The network is said to Memorize the training data.

- so that when an X in training set is given,

the network faithfully produces the corresponding Y.

-However for X’s which the network didn’t see before, it predicts poorly.

Convergence Criterion

Modified Suggestion:

Partition the training data into Training set and Validation set

Use:

Training set - build the model

Validation set - test the performance of the model on unseen data

Typically as we have more and more training cycles

Error on Training set keeps on decreasing.

Error on Validation set first decreases and then increases.

Error

Validation

Training

Cycle

Stop training when the error on

Validation set starts increasing

Choice of Training Parameters

Learning Parameter and Momentum

- needs to be supplied by user from outside. Should be between 0 and 1

What should be the optimal values of these training parameters ?

- No clear consensus on any fixed strategy.

- However, effects of wrongly specifying them are well studied.

Learning Parameter

Too big – Large leaps in weight space – risk of missing global minima.

Too small –

- Takes long time to converge to global minima

- Once stuck in local minima, difficult to get out of it.

Suggestion

Trial and Error – Try various choices of Learning Parameter and Momentum

See which choice leads to minimum prediction error

Summary

q

Artificial Neural network (ANN)

– A class of models inspired by biological Neurons

q

Used for various modeling problems – Prediction, Classification, Clustering, ..

q

One particular subclass of ANN’s – Feed forward Back propagation networks

v

v

v

v

Organized in layers – Input, hidden, Output

Each layer is a collection of a number of artificial Neurons

Neurons in one layer in connected to neurons in next layer

Connections have weights

q

Fitting an ANN model is to find the values of these weights.

q

Given a training data set – weights are found by Feed forward Back

propagation algorithm, which is a form of Gradient Descent Method – a

popular technique for function minimization.

q

Network architecture as well as the training parameters are decided upon by

trial and error. Try various choices and pick the one that gives lowest

prediction error.

Self-Organizing Maps

Overview

• A Self-Organizing Map (SOM) is a way to

represent higher dimensional data in an usually 2D or 3-D manner, such that similar data is grouped

together.

• It runs unsupervised and performs the grouping on

its own.

• Once the SOM converges, it can classify new data.

• SOMs run in two phases:

– Training phase: the map is built, the network organizes

using a competitive process, it is trained using a large

numbers of inputs.

– Mapping phase: new vectors are quickly given a

location on the converged map, easily classifying or

categorizing the new data.

Example

• Example: Data on poverty levels in different countries.

– SOM does not show poverty levels, rather it shows how similar the

poverty levels of different countries are to each other. (Similar

color = similar poverty level).

Forming the Map

The Basic Process

1)

2)

3)

4)

5)

6)

7)

8)

9)

Each pixel is a node / unit.

Initialize each node’s weights with random values.

Take the next vector from training data and present it to the

SOM.

Every node is examined to find the Best Matching Unit /

Node (BMU).

Adjust BMU’s weights to make it come closer to the input

vector. The rate of adjustment is determined by the ‘learning

rate’. The learning rate decreases with each iteration.

The radius of the neighborhood around the BMU is

calculated. The size of the neighborhood decreases with each

iteration.

Each node in the BMU’s neighborhood has its weights

adjusted to become more like the BMU. Nodes closest to the

BMU are altered more than the nodes further away in the

neighborhood.

Repeat from step 3 to 7 for each vector in the training data.

Repeat from step 3 to 8 for enough iterations for convergence.

Calculating the Best Matching Unit

• Calculating the BMU is done according to the

Euclidean distance between the node’s weights

(W1, W2, … , Wn) and the input vector’s values

(V1, V2, … , Vn).

– This gives a good measurement of how similar the two

sets of data are to each other.

– The node closest to the input vector is chosen as the

BMU (or ‘winning’ node).

Determining the BMU

Neighborhood

•

Size of the neighborhood: uses an exponential decay function that shrinks

on each iteration until eventually the neighborhood is just the BMU itself.

•

Effect of location within the neighborhood: The neighborhood is defined by a

gaussian curve so that nodes that are closer are influenced more than farther

nodes.

Modifying the Nodes’ Weights

• The new weight for a node is the old weight, plus a fraction (L) of

the difference between the old weight and the input vector…

adjusted (by theta) depending on the distance from the BMU.

• The learning rate, L, is also an exponential decay function.

– This ensures that the SOM will converge.

• The lambda represents a constant, and t is the time step

SOM Application: Text Clustering