Masters Thesis

Published on March 2017 | Categories: Documents | Downloads: 40 | Comments: 0 | Views: 690

of 97

Content

ACTION RECOGNITION BASED ON MULTI-LEVEL
REPRESENTATION OF 3D SHAPE
by
Binu M Nair

Bachelor of Electronics and Communication, April 2007,
Cochin University Of Science and Technology

A Thesis Submitted to the Faculty of
Old Dominion University in Partial Fulfillment of the
Requirement for the Degree of
MASTER OF SCIENCE
ELECTRICAL AND COMPUTER ENGINEERING
OLD DOMINION UNIVERSITY
August 2010

Approved by:

Vijayan K. Asari (Director)

Frederic D. McKenzie (Member)

Jiang Li (Member)

ABSTRACT
ACTION RECOGNITION BASED ON MULTI-LEVEL
REPRESENTATION OF 3D SHAPE
Binu M Nair
Old Dominion University, 2010
Director: Dr. Vijayan K. Asari

A novel algorithm is proposed in this thesis for recognizing human actions using
a combination of two shape descriptors, one of which is a 3D Euclidean distance
transform and the other based on the Radon transform. This combination captures
the necessary variations from the space time shape for recognizing actions. The space
time shapes are created by the concatenation of human body silhouettes across time.
The comparisons are done against some common shape descriptors such as the zernike
moments and Radon transform. This is also compared with an algorithm which uses
the same concept of a space time shape and uses another shape descriptor based
on the Poisson’s equation. The proposed algorithm uses a 3D Euclidean distance
transform to represent the space time shape and this shape descriptor in comparison
to the Poisson’s equation based shape descriptor is less complex. By taking the
gradient of this distance transform, the space time shape can be divided into different
levels with each level representing a coarser version of itself. Then, at each level,
specific features such as the R-Transform feature set and the R-Translation vector
set are extracted and concatenated to form the action features. These action features
extracted from a space time shape of a test sequence are compared with the action
features of space time shapes of the training sequences using the minimum Euclidean
distance metric and they are classified using the nearest neighbour approach. The
algorithm is tested on the Weizmann action database which consists of 90 video

sequences of which 10 different actions are performed by 9 different people. Research
work is being done to improve the recognition accuracy by extracting features which
are more localized and classifying them using a more sophisticated technique.

c
Copyright,
2010, by Binu M Nair, All Rights Reserved

iv

ACKNOWLEDGEMENTS
I would like to thank my advisor, Dr. Vijayan K. Asari for all his support and
constant guidance for my thesis as well as for my coursework and for providing me an
excellent opportunity to work in the Vision Lab. It has been a wonderful experience
during the last two years and I feel that I have reached a new level of technical
expertise by just being in the Vision Lab. For that, I have a lot of gratitude to my
advisor.
I wish to thank Dr.Jiang Li for his support not only for my thesis but also for
the machine learning course from which I have gained a lot of information regarding
the theoretical concepts of various classifiers and their implementation, which in
turn, helped me in my thesis work. I also wish to thank Dr.Frederic D. McKenzie
for being my committee member and helping me to finalize my thesis work. I also
wish to thank Dr. Zia-ur Rahman for teaching the image processing course where I
learned the basics and the implementation of the image processing algorithms in C.
This helped me a great deal in my thesis.
I want to thank my parents for their continous encouragement especially my
father who supported me not only in financial matters but also in harsh times which
allowed me to focus on my research work. I wish to thank all of my labmates for
their support. Finally, I wish to thank my good friend Ann Mary for continously
pushing me to finish the thesis in time and motivating me to present the thesis with
full confidence.
Once again, thank you all for giving me the encouragement, motivation, and
support.

v

vi

TABLE OF CONTENTS
Page
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

CHAPTERS

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

2

Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1

2.2

Action

Recognition

Algorithms

Based

on

Motion

Capture

Fields/Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1.1

Algorithm Based on Motion History Images . . . . . . . . . .

9

2.1.2

Algorithm Based on Optical Flow . . . . . . . . . . . . . . . .

11

2.1.3

Algorithm Based on Bag of Words Model . . . . . . . . . . . .

12

2.1.4

Algorithm Based on 3D SIFT Descriptors and Bag of Words
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.5

Algorithm Based on Trajectories

. . . . . . . . . . . . . . . .

15

2.1.6

Algorithm Based on PCA and HMM . . . . . . . . . . . . . .

17

2.1.7

Algorithm Based on Space Time Shapelets . . . . . . . . . . .

18

Action Recognition Algorithms Based on Shape Descriptors . . . . .

19

2.2.1

Some Common Shape Descriptors . . . . . . . . . . . . . . . .

19

2.2.2

Algorithm Based on Poisson’s Equation Based Shape Descriptor 24

2.2.3

Algorithm Based on R-Transform . . . . . . . . . . . . . . . .

27

vii
2.2.4

Algorithm Based on Shape Descriptors and Optical Flow . . .

27

2.2.5

Algorithm Based on Local Descriptors and Holistic Features .

29

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

Multi-Level Shape Representation . . . . . . . . . . . . . . . . . . . . .

31

2.3
3

3.1

3.2

3.3

31

3.1.1

Definition of Radon Transform . . . . . . . . . . . . . . . . . .

31

3.1.2

Geometrical Interpretation of the Radon Transform . . . . . .

32

3.1.3

Computation of the Radon Transform

. . . . . . . . . . . . .

34

3.1.4

Application of the Radon Transform to Binary Images . . . .

35

3.1.5

Properties of Radon Transform . . . . . . . . . . . . . . . . .

39

3.1.6

R-Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Distance Transform based on Euclidean Distance . . . . . . . . . . .

44

3.2.1

45

Computation of the Distance Transform . . . . . . . . . . . .

Multi-level Representation of 2D Shapes Using Chamfer Distance
Transform and R-Transform . . . . . . . . . . . . . . . . . . . . . . .

46

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Action Recognition Framework . . . . . . . . . . . . . . . . . . . . . . .

49

3.4
4

Radon Transform Based Shape Descriptor . . . . . . . . . . . . . . .

4.1

Silhouette Extraction and Formation of Space Time Shape . . . . . .

50

4.2

Segmentation of Space Time Shape into Different Levels . . . . . . .

55

4.2.1

Computation of 3D Distance Transform

. . . . . . . . . . . .

56

4.2.2

Segmentation of a 3D Shape . . . . . . . . . . . . . . . . . . .

57

Extraction of Action Features . . . . . . . . . . . . . . . . . . . . . .

62

4.3

viii

5

4.3.1

R-Transform Feature Set . . . . . . . . . . . . . . . . . . . . .

62

4.3.2

R-Translation Vector Set . . . . . . . . . . . . . . . . . . . . .

64

4.4

Classification of Action Features . . . . . . . . . . . . . . . . . . . . .

67

4.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.1

Variation of Space Time Shape Length and Overlap . . . . . . . . . .

69

5.2

Comparison of Proposed Algorithm with Other Methods . . . . . . .

72

5.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . .

80

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

6

ix

LIST OF TABLES
Page
1

Notations Representing each Action. . . . . . . . . . . . . . . . . . .

70

2

Confusion Matrix Obtained with the Proposed Algorithm. . . . . . .

73

3

Confusion Matrix Obtained with Zernike Moments,R-Transform and
Poisson’s Equation Shape Descriptor. . . . . . . . . . . . . . . . . . .

78

x

LIST OF FIGURES
Page
1.1

Block Schematic of the Proposed Algorithm. . . . . . . . . . . . . . . . .

4

3.1

Geometric Interpretation of the Radon Transform as Shown in [32]. . . . .

33

3.2

Definition of Radon Transform and its Computation for a Pixel.

. . . . .

35

3.3

Projections of Squares into Radon Space. . . . . . . . . . . . . . . . . .

37

3.4

Projections of Human Silhouettes into Radon Space.

. . . . . . . . . . .

38

3.5

R-transform of a Silhouette. . . . . . . . . . . . . . . . . . . . . . . . .

41

3.6

Properties of R-Transform. . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.7

8-Level Segmentation of Silhouettes of a Dog and a Human. . . . . .

47

4.1

Mean and Median Backgrounds. . . . . . . . . . . . . . . . . . . . . .

51

4.2

Silhouette Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.3

Space-time Shapes of Jumping Jack and Walk Actions. . . . . . . . . . .

53

4.4

Use of Morphological Operation of Dilation and Erosion. . . . . . . .

54

4.5

Sample Frames of the 3D Distance Transformed Space-time Shapes
with Various Aspect Ratios. . . . . . . . . . . . . . . . . . . . . . . .

4.6

Sample Frames of the Normalized Gradient of the Space-time Shape
with Various Aspect Ratios. . . . . . . . . . . . . . . . . . . . . . . .

4.7

59

60

8-Level Segmentation of Different Frames of a Space Time Shape of a
Jumping Jack Action.

. . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.8

R-Transform Set for Single Level of a Space Time Shape. . . . . . . .

63

4.9

R-Transation Vector Set for a Space Time Shape. . . . . . . . . . . .

66

xi
5.1

Bar Graph Showing the Overall Accuracy Obtained with the Proposed Algorithm for Different Lengths of the Space Time Shape. The
Overlap is just Half the Length of the Space Time Shape. . . . . . . .

5.2

Accuracy for Different Combinations of (length,overlap) of Space Time
Shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3

74

Error Rate for Different Combinations of (length,overlap) of Space
Time Shape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.4

71

75

Accuracy for Different Combinations of (length,overlap) of Space Time
Shape for Different Shape Descriptors. . . . . . . . . . . . . . . . . .

77

1

CHAPTER 1
INTRODUCTION
In recent years, human action recognition has been a widely researched area in computer vision as it holds a lot of applications relating to security and surveillance.
One such application is the human action detection and recognition algorithm implemented in systems using CCTV cameras where the system can discriminate suspicious actions from normal human behavior and alert authorities accordingly [1].
Another area is in object recognition whereby evaluating the human interaction with
the object and recognizing the human action, the object with which the interaction
takes place is recognized [2]. Thus arises the need for faster and robust algorithms
for human action recogntition. Action recognition involves extraction of features
which represents the variations caused due to the action with respect to time and
these variations can be extracted by using basically two types of approaches. One
approach involves the use of 2D or 3D shape descriptors which gives an internal representation of the space time shape and thereby facilitates the extraction of suitable
action features. The other approach is by using motion fields such as motion history
images and optical flow vectors and by using trajectories of specific body parts. The
various invariant properties extracted from these motion fields and trajectories are
considered action feature vectors. In the algorithm presented in this thesis, a combination of two different types of shape descriptors are used for the representation of
a space time shape for feature extraction.
An overview of the algorithm is shown as a block diagram in Figure 1.1. The
steps of the algorithm are
• Extraction of the human silhouette from a video sequence by foreground segmentation.
This thesis follows IEEE journal format.

2
• Formation of the space time shape by concatenting predefined number of silhouettes and applying the 3D Euclidean distance transform.
• Using the gradient of the distance transform to segment the space time shape
into multiple levels.
• Extracting the R-Transform feature at every level and at every frame of the
space time shape and the R-Translation vector at the coarsest level at every
frame.
• Concatenating the R-Transform Feature set and the R-Translation Vector set
to form the action features.
• Comparing these features using the minimum Euclidean distance metric and
classifying them using the nearest-neighbor approach.
To extract a silhouette of a human body in a video, foreground segmentation must
be performed on every frame of a video sequence by learning the background model.
Since the video sequences in the database used in the testing of the algorithm contain
a static background with uniform lighting, a simple background model based on the
median of the pixels of the frames is sufficient. The median background model is
easier to implement and less complex. This is preferred over the mean background
due to the fact that, unlike the mean, the median statistic of the pixels across the
frames is not affected by sudden changes in the pixel values. More details on the
background segmentation will be given in the coming chapters.
Next, the silhouettes obtained from the video sequence must be concatenated
across a predetermined number of frames so as to form space time shapes. Each
video sequence in the database will contain a certain number of space time shapes
with each one having a certain overlap with the previous one. Once the space time
shape is obtained, a 3-D shape descriptor should be used. The purpose of the 3D

3
shape descriptor in the proposed algorithm is to give an internal representation of the
space time shape by segmenting it into different levels with each level representing
its coarseness. The 3D shape descriptor based on the Euclidean distance transform
is selected for this type of representation. By using the gradient of the distance
transform, the space time shape is divided into multiple levels.
Once the multi level representations of the space time shape is obtained, each
level is analyzed seperately and suitable features are extracted. In this algorithm, the
analysis at each level is done by using another shape descriptor. This shape descriptor
is applied to a 2D human silhouette in a frame of the space time shape at every level
and the variation in the properties extracted from this shape descriptor across the
frames are futhur analyzed to give the action features. The shape descriptor known as
the R-Transform is used instead to capture the variations of the 2D shape in a frame
[22]. This R-transform is in fact derived from the 2-D Radon transform by integrating
it over one of the variables [18]. It can be made scale-invariant by using a suitable
scaling factor. The R-transform possess the other two properties of translation and
rotation variance which again makes it much more suited to bring out the variations
in a space time shape. So, a collection of R-Transforms are obtained at every level of
the space time shape. These can be called as the R-Transform feature set. Another
set of features can be extracted from the 2D Radon transform by integrating it over
the other variable. This set of features, known as R-Translation vector set, can be
used to determine how much the 2D shape has moved from its initial position within
a space time shape.
A complete action feature representation is formed by concatenating the RTransform features set and R-Translation vector set and this concantenated set of
features represents the particular action of an individual in the space time shape.
These actions features extracted from test space time shapes containing the same
type of action are compared with the action features extracted from the training set

4

Figure 1.1: Block Schematic of the Proposed Algorithm.

5
of space time shapes using the Euclidean distance metric and they are categorized
using the nearest neighbor classifier [27]. The evaluation of the algorithm is done
by finding the recognition rate for each action type for different lengths of the space
time shape. Moreover, the comparison of this algorithm is made with methods which
uses only a single shape descriptor for shape representation. Furthur research work
is being done to localize these features of the space time shapes so as to improve the
recognition rate.
The specific objectives of the proposed algorithm are
• To extract the silhouettes at every frame of the all the video sequences in the
database by background segmentation with suitable thresholding and suitable
morphological operations.
• To organize the silhouette frames of every video sequence in the database into
space time shapes of a fixed length with a predefined overlap.
• To apply the 3D Euclidean distance transform at every space time cube of every
video sequence and taking its gradient in order to segment it into predefined
number of levels.
• To apply the 2D Radon transform on every silhouette frame of the space time
shape at every level.
• To extract the R-Transform Feature set as well as the R-Translation vector set
and concantenate them to form the action feature set.
• To store these action features in an action feature database and use this
database to create tests and training sets based on the variant of leave-oneout procedure.
• To classify the test action features by nearest neighbor approach and evaluate

6
the algorithm by determining the recognition rate for each action and compare
the results with a known algorithm.
The thesis is organised as follows:
Chapter 2 gives a brief overview of the current algorithms that have been proposed over the last few years for action recognition and looks closely at the analytical
tools or transforms used. First, the algorithms based on some motions fields and trajectories are discussed. Then, the various shape descriptors and the action recognition
algorithms based on these are explained briefly. Finally, this Chapter will introduce
the shape descriptor on which the proposed algorithm is based.
Chapter 3 describes in detail the shape descriptor based on the Radon transform
and its purpose in the action recognition framework. The 2D Radon transform is first
explained mathematically along with the geometrical interpretation and illustrations.
Then, the actual shape descriptor known as the R-Transform and its properties are
explained. Finally, the concept of a multi-level representation of a 2D shape using
the R-Transform and the Chamfer distance transform is discussed.
Chapter 4 describes the proposed algorithm which is based on a combination of two shape descriptors, namely the Euclidean distance transform and the
R-Transform. It explains in detail how the gradient of the distance transform is used
in segmentating a space time shape into multiple levels with suitable illustrations and
describes how the action features are extracted from each level. Finally, the classifer
used in the algorithm is briefly described.
Chapter 5 gives the experimental results which are obtained by simulation of
the algorithm on a known database and compares these results with other existing
algorithms. The advantages and disadvantages of using the proposed algorithm are
discussed and the results are analysed to find the reason behind some of the low and
high recognition rates acheived.

7
Chapter 6 gives a complete summary of the proposed algorithm and states the
conclusions made by the analysis of the results. It also suggests improvements that
can be made to extract much more robust features required for action recognition.
Future work is also discussed in this Chapter which involves better preprocessing
stages for better feature extraction.

8

CHAPTER 2
LITERATURE SURVEY
Some of the earlier works in action recognition were based on capturing the motion
features by computing the optical flow or motion history images at each frame and
then using the variation in these motion features for action recognition. Some others
required tracking of certain body points such as the limbs,torso and head across the
frames of the video sequence. The trajectories of these points are analyzed and the
properties extracted from these trajectories are then used as action features. However, the algorithm presented in this thesis is based on frameworks which capture
human silhouette variations across the video sequence using some shape descriptors.
Both 2D as well as 3D shape descriptors can be used. In the former, the shape
descriptor is applied to each frame of the video sequence and the variations in the
silhouette shape description occuring across the video sequence are considered action features. The latter case involves extraction of properties of the 3D shape after
applying a suitable 3D shape descriptor. The algorithm presented in this thesis is
a combination of both approaches. This Chapter can be divided into two main sections. The first half of the Chapter reviews the work done in the action recognition
field which directly computes the motion fields. The second half discusses algorithms
that capture the motion variations using shape descriptors.

2.1

ACTION RECOGNITION ALGORITHMS BASED ON MOTION
CAPTURE FIELDS/TRAJECTORIES

In this section, various action recognition frameworks are discussed where some use
motion flow fields such as optical flow, motion history images, and trajectories of specific silhouette points. The advantages and disadvantages of using these algorithms

9
are also discussed.

2.1.1

Algorithm Based on Motion History Images

Some of the work in human action or movement recognition has been done using
temporal templates which give the motion characteristics at every spatial location of
a frame in an image sequence. The motion characteristics at a pixel of the current
frame depends on the motion characteristics of the corresponding pixel at previous
frames. In [3], the temporal template extracted from a frame is an image where
each pixel is a vector component with one of the components obtained from the
binary motion energy image(MEI) and the other component from a motion history
image(MHI). The motion energy image gives the location of the occurence of motion
in the image while the motion history image has pixel values corresponding to the
recency of the motion. The binary motion energy image Eτ (x, y, t) is formed by
accumulating the silhouettes for a specific number of frames and can be defined as

Eτ (x, y, t) =

τ[
−1

D(x, y, t − i)

(2-1)

i=0

where D(x, y, t) is a binary image representing the masked human silhouette at time
t. The purpose of the motion energy images can be used for incorporating view
invariant feature extraction by including the motion energy images computed at
different viewpoints for the same action in the training data. For the motion history
image Hτ (x, y, t), the pixel intensity gives the temporal history of motion at that
pixel and it is defined as


 τ
if D(x, y, t) = 1
Hτ (x, y, t) =

 max(0, Hτ (x, y, t − 1) − 1) otherwise

(2-2)

10
A pixel value in a motion history image gives the recency of motion at that pixel. In
other words, the brighter the pixel, the more recent the motion is at the corresponding location. The computation of the vector template image involves computing the
motion history image first and then thresholding it to get the motion energy image.
Statistical matching using Hu moments are used on the vector templates and the
Mahalanobis distance measure is used to compare and classify these moment based
features.
Another variation of the motion history image termed as the timed MHI or timed
motion history image is used for silhouette pose estimation and for segmentation of
moving parts [4]. The difference in the timed MHI with the previously defined MHI
is that the pixel values are stored in accordance to the current timestamp which is in
floating point format. By taking the gradient of the timed MHI, optical flow vectors
can be extracted. These vectors provide the local orientation at each pixel. Features
such as the radial histogram and the global orienation can also be extracted from
this optical flow representation and these along with the segmented body parts can
be used in the moment-based statistical matching for motion classification purposes.
A hierarchial MHI representation in accordance to speed of motion can also be
derived from the MHI computed at the current frame in order to compute local motion fields [6]. The MHI pyramid is computed from the image pyramid where each
frame is subsampled successively to a certain number of levels. The higest level in
the representation has the smallest image size or the coarsest image in the pyramid.
The idea behind the image pyramid is that each level represents a particular range
of the speed of motion. In other words, faster motion displacements in the lower
levels of the pyramid are represented by smaller displacements in the higher levels.
Thus, from each pyramid level, a MHI image is computed where the motion fields
are extracted. These motion fields from each level are then resampled to the original
size of the image and combined together to form the final motion field. Many types

11
of features can be extracted from this motion field such as the polar orientation histograms which are linear invariant in nature. These histograms are the final action
feature vectors and the classification is done by directly comparing the test histogram
with the ones in the database using the Euclidean distance measure.
One drawback with the motion history images is that it is not very robust to
partial occlusions. The motion history images computed from a partially occluded
silhouette frame will be different from the ones computed from non-occluded silhouettes. These may give rise to distorted motion fields which may result in inaccurate
classification especially when using the statistical matching using Hu moment features which are very sensitive to the changes in shape.

2.1.2

Algorithm Based on Optical Flow

A motion descriptor based on optical flow measurements has been used to describe
an action of an individual at a far off distance from the camera [7]. Here, a figurecentric spatio-temporal volume is extracted where the individual at each frame is
centered to stabilize the motion and to discard motions due to a shaky camera.
Then, the optical flow vector at each frame is computed using the Lucas-Kannade
algorithm. The optical vector field F¯xy is seperated into two different components
namely, the horizontal F¯x and vertical component F¯y . Each of these components is
then half wave rectified to get two more components thereby giving a total of four
components namely F¯x− , F¯x+ , F¯y− and F¯y− . But since these four components are
noisy measurements, these are smoothened using a Gaussian kernel and normalized to
get the four components F¯bx− , F¯bx+ , F¯by− ,F¯by− at each frame. The set of these four
vectors at each frame across the figure-centric spatio-temporal volume is considered
as a motion descriptor. In other words, the variation in the four components of
the optical vector field across the frames of the video sequence gives rise to action
features suitable for recognition. For comparison of these action features, a similarity

12
measure based on normalized correlation is used. Here, to test two video sequences
A and B, a similarity measure between motion descriptor of A centered at frame i
and that of B at frame j is computed by

S(i, j) =

4
X X
X
tT

j+t
ai+t
c (x, y) bc (x, y)

(2-3)

c = 1 (x,y) I

where T and I are temporal and spatial extents of the motion descriptors ac and bc
and c refers to either of the four components. Therefore, from every frame of video
sequences A and B, a similarity measure is obtained which can be organized as a
matrix known as the motion-to-motion similarity matrix S. Then, for classifying the
action present in a frame of a test sequence, the k-nearest neighbor rule is applied to
the motion descriptors and the action in the frame is appropriately labelled. Similar
to the case of MHI, this approach is not so robust to partial occulsions as the optical
flow motion fields get distorted to a greater degree thereby increasing the missclassification rate. On the other hand, this approach is fast and can be implemented in
real time as the action is recognized at every frame of a video sequence even at very
low resolutions.

2.1.3

Algorithm Based on Bag of Words Model

Spatio-temporal features can be extracted from the video sequences containing different types of actions and these features can be considered like video words in a
codebook [8]. By interpreting each video sequence in the training set as a set of
video words, a model for each action category can be learned. The features extracted from the video sequence are space time interest points in a space time shape.
They are obtained by finding the gradient or optical flow from the region of interest
in each frame of the video sequence. The regions of interest are actually the local
maxima regions of the response function R computed at every frame and this is given

13
by
R = (I ∗ g ∗ hev )2 + (I ∗ g ∗ hod )2

(2-4)

where g(x, y, σ) is the 2D Gaussian filter applied in the x−y domain and (hev , hod ) are
the quadrature pair of gabor filters applied temporally. From each of the space time
regions where these regions are in the form of spatio-temporal cubes centered at the
interest points, descriptors such as the brightness gradient or the windowed optical
flow field are computed and all of the computed descriptors are concatenated to form
a large feature vector. These vectors are then projected into the lower dimension
using principal component analysis and the lower dimensional representation of the
feature vector is considered as a video word. Therefore, each action catergory is
modelled probabilistically by using the video words as the input data. By noting the
number of occurences n(wi , dj ) of a word wi from the codebook in a video sequence dj ,
the joint probability P (wi , dj ) can be computed as P (wi , dj ) = P (dj )P (wi |dj ). From
this joint probability density, the conditional probability density can be computed as
K
X
P (wi |dj ) =
P (zk |dj )P (wi |zk ) where zk refers to an action category and K is the
k=1

total number of such categories. Estimation of the conditional probability density
P (wi |zk ) which gives the probability of the video word wi belonging to a category zk ,
is done by maximimzing the objective function using the expectation maximization
M Y
N
Y
or EM algorithm. The objective function is given by
P (wi |dj )n(wi ,dj ) . Once
i=1 j=1

the conditional probability density P (w|z) is learned, the posterior probability of an
action catergory P (zk |wi |dj ) can be known. Therefore, from the test video sequence
containing a set of video words, this posterior probability probability for each video
word is computed and the maximum of these probabilites gives the action category
to which this video sequence belongs.
The advantage in this algorithm is that no foreground segmentation is necessary
as the interest points are extracted from the response funtion defined directly on

14
the image pixel intensity. These interest points are generated due to the sudden
change in the spatial characteristics of local regions where these regions are part
of a complex action. But these interest points can be created due to noise in the
video sequence also and thus creates unwanted video words, which may affect the
classification accuracy.

2.1.4

Algorithm Based on 3D SIFT Descriptors and Bag of Words Model

Similar to the algorithm based on the bag of words model, this algorithm uses 3D
SIFT descriptors in place of the descriptors which were based on the gradient magnitude. Also, the interest regions are chosen at random rather than choosing the
local maxima regions of the response function based on Gabor space. The features
extracted from the 3D SIFT descriptors are the sub-histograms which are then considered as video words for the bag of words model [9]. Moreover, once the video
words are obtained, the video words are grouped according to their relationships and
these discovered groupings are used for the classification task.
For computation of the 3D SIFT descriptor, first the overall orientation for the
3D neighborhood surrounding an interest point should be calculated. By taking the
spatio-temporal gradient in that neighborhood, the orientation at each pixel or the
local orientation can be computed. An orientation histogram for that neighborhood
is then calculated from where the dominant peak or the overall orientation is computed. This is used to rotate the 3D neighborbood about its interest point so that the
dominant peaks of all the 3D neighborhoods are in the same direction which inturn
would make the features rotation invariant. After rotation, these 3D neighborhoods
are divided into sub-regions of a fixed size and the magnitude and orientation at each
pixel in a sub-region is computed. Using the gradient information in the sub-region,
a histogram is computed and these histograms are called sub-histograms since these

15
are associated with only a sub-region. The final descriptor is then obtained by vectorizing these sub-histograms and concatenating them to form final vector termed as
the 3D SIFT decriptor.
The selection of the interest regions is done by random sampling of the pixels
at different locations,time and scale. Then, the feature vectors or the SIFT descriptors obtained the neighborhood of every interest point are then quantized by using
K-mean clustering algorithm. This leads to predefined number of groups whose centers make up the video word vocabulary. The 3D SIFT descriptors from the videos
are matched to each video word in the vocabulary and for a space time cube, the
frequency of each word in the vocabulary, known as a signature, is computed. Suport vector machines (SVM) are used to train each action category using a modified
version of the signatures and the classification is done on the basis of the largest
distance from the action category.
The one drawback that this algorithm has is that the interest points are selected
at random. If the interest points are selected based on another set of features which
are directly correlated with the action, then, complexity in the classification can be
reduced by using the signatures directly in to the classifier without any modification.
Moreover, the features used to detect the interest points can be used as additional
features for classification, provided these features are directly related to the motion.

2.1.5

Algorithm Based on Trajectories

Another algorithm for action recognition considers the human action to be generated by a non-linear dynamical system where the state variables are defined by the
reference joints in the human body silhouette and their functions are defined by the
trajectories of these joints in the time domain [10] . Action features are derived from
the properties of these trajectories by considering them as time series data. There

16
are many methods available for studying time series data but the one used in this algorithm analyzes the non-linear dynamics of human actions using the concepts from
the theory of choatic systems [11].
The idea behind chaos theory is that there is some form of determinism in otherwise random data and that the determinism is due to some underlying non-linear
dynamics. In other words, a choatic time series is one which is apparently random
in nature but has been generated by a deterministic process. Dynamical systems are
represented by state space models with state variables X(t) = [x1 (t)x2 (t)...xn (t)] R n
defining the status at time t. Attractors are regions of phase space or the space
spanned by the state variables where the collection of all the trajectories or the
path of the variables, settle down as time t approaches infinity. These attractors are
termed strange if they are not stable. So, the invariants of the dynamical system’s attractor represents the non-linear nature of the system and its properties can be used
in a classification problem. Here, the non-linear dynamical system which represents
the human action generates the choatic time series data which, in this case, are the
trajectories of the reference joints. Therefore, by extracting the strange attractors,
the human actions can be distinguished.
The first step is to obtain the trajectories of the reference joints, namely the
head, two hands, two legs, and the belly in the video sequence. The scale and translation invariance of the trajectories are obtained by normalizing the trajectories with
respect to the belly point. Each single dimensional time series data is converted to
a multi-dimensional signal thereby modifying the state of the system. The modified
state space and the original state space has the same property according to chaos
theory and the modified state space brings out the deterministic features. The modified phase space invariants are then extracted to distinguish the different attractors
generated by different human actions. The invariants extracted in this algorithm
are the Maximal Lyapunov Exponent, the Correlation Integral, and the Correlation

17
Dimension. The Lyaponov exponent is considered a dynamical invariant which measures exponential divergence of the trajectories in the phase space . The correlation
integral quantifies the density of points while the correlation dimension measures the
change in density. Along with these three features, the variance of the time series
data is also included. So, a 4D feature is obtained from each time series data and
if there K time series data obtained from K reference joints, then, a total of K × 4
feature vectors are available for human action. Identification of the action is done by
first comparing the test feature vector with those in the database using some distance
metric and then classifying the test feature vector using the K-Nearest Neighbor rule.
This algorithm gives good recognition accuracies but the one drawback in this
algorithm is that the features are extracted from the trajectories of joints and to get
a noise free trajectory, a good background segmentation and tracking is required. If
the human silhouette is partially occluded in such a way that one of the joints is not
visible, then, the trajectory corresponding to that joint will be distorted which may
affect the properties extracted and hence, may increase the misclassification rate.

2.1.6

Algorithm Based on PCA and HMM

The action features extracted here [12] are the cartesian form of the optical flow
velocity and the vectorized form of the human body silhouette. Since the vectorized silhouette is of a higher dimension, it is reduced to a lower dimensional feature
space using PCA. Each action category is then modeled using hidden Markov models(HMMs) and the combination of the reduced silhouette feature vector and the
optical flow vector are used in the training of these models for every viewing direction possible.
The silhouettes are extracted from the video by foreground segmentation where
the algorithm models the background pixel color value as Gaussian. If that pixel
color value is beyond a certain threshold, then the pixel belongs to the foreground.

18
These silhouettes obtained at every frame of the video are normalized in size by
bi-cubic interpolation. Then, PCA analysis is performed on these normalized silhouettes by considering them as N -dimensional data points. The analysis is done
by computing the mean and the co-variance matrix of these data points and then,
computing the eigen vectors that span the variation in the silhouettes. So, at every
frame, the silhouette is projected onto the eigen space to get the lower dimensional
feature vector representing that silhouette. To estimate the non-rigid motion of the
human body at every pixel, the optical flow velocity is calculated. The action region
is divided into K number of blocks and the average value of the optical flow motion
field is extracted from each. The average values from all the blocks of an action
region are concatenated to form the optical flow feature vector. For each frame in
every action, the reduced silhouette feature vector and the optical flow feature vector
are combined to form the final feature vector. Hidden Markov models are used in
the modeling of the actions as they are useful in capturing the variations of a time
series data and classification is done using the maximum likelihood approach.

2.1.7

Algorithm Based on Space Time Shapelets

Here, the concept of mid-level features known as space time shapelets are introduced
where these shapelets are local volumetric objects or local 3D shapes extracted from
a space time shape. In other words, these shapelets characterize the local motion
patterns formed by the action [13]. Thus, an action is represented by a combination
of such shapelets and because these shapelets represent the local parts of the entire
space time shape, they are more robust to partial occlusions. Extracting all the
possible local volumes from each space time shape of the database and clustering
these sub-volumes using K-mean clustering provides the cluster centers which are
then considered as space time shapelets.

19
The action feature vector extracted using the shapelets is the probability distribution of these shapelets given a particular voxel of a space time shape. Then,
every voxel in the space time cube created by an action is represented by a feature
vector given by fD (x) = [ p1 p2 . . . pn ]T where pi is the probability of occurence of
the shapelet i given the voxel x. Using the bag of words model where each word is
the shapelet, the histogram for the dictionary of shapelets D is computed and this
X
fD (x) where n
histogram is the final action feature vector given by hD (V ) = n1
voxels

is the number of shapelets. Two classifiers were used for comparison, one is based
on the nearest neighbor rule and the other is by using logistic regression where the
latter was found to give better results.

2.2

ACTION RECOGNITION ALGORITHMS BASED ON SHAPE
DESCRIPTORS

This section explains some of the descriptors used to discriminate between 2D shapes.
Later on, some of the action recognition frameworks are discussed which are based
on directly using these descriptors individually or their combination. Some of the
frameworks discussed uses a combination of a shape descriptor and a motion field.

2.2.1

Some Common Shape Descriptors

This section provides an overview of the various shape descriptors which have been
used for recognizing human action patterns. The basic idea is to capture the variations of a 2D shape descriptor across time or capturing the variations extracted from
the space time shape using a 3D shape descriptor and use these variations as action
features.

20
Shape Descriptors Using Hu Moments
One of the most common moment based 2D shape descriptor is the Hu moment
invariants [14] which have been widely used in recognition of visual patterns and
characters. The Hu moments of a geometrical shape are a set of 7 moment values
which are derived from central moments of that shape and normalized by a scaling
factor in such a way as to acheive the property of translation, scale, and rotation
invariance. The central moments µp,q of a shape which are obtained from the centroid
(xcent , ycent ) and regular moments mp,q of that shape are translation invariant and so,
are used in the computation of Hu moments. The central moments of a shape are
defined as
Z
µp,q =

(xp − xcent ) (yq − ycent ) ρ(x, y) dx dy

(2-5)

where the ρ(x, y) is the probability density function of the shape under consideration.
The scale invariance property is incorporated by using the theory of algebraic invari(p+q)/2+1

ants to derive a scaling factor µ00

. This scaling factor can be used to normalize

the central moments of the shape with respect to its size. To obtain the orientation
invariance, the central moments of different orders are combined in accordance to the
theory of orthogonal invariants to produce the Hu moments. Therefore, according to
[16], the Hu moments with the property of translation, scale, and rotation invariance
can be used as region based shape descriptor of a 2D shape with the probability
density function as the 2D binary image.

Fourier-Based Shape Descriptors
Fourier based shape descriptors are boundary based descriptors unlike the Hu moments which takes into account only the pixels on the outer contour of the 2D binary shape [15],[16] and these are translation,scale and rotation invariant. Here, the
boundary of the shape can be described by four different shape signatures such as

21
complex number representation, centroid distance, curvature function, and curvature angular functions. For the boundary coordinate point (x, y), the complex shape
signature is given by
s = (x − xc ) + i(y − yc )

(2-6)

and the centroid distance function is given by

r = (x − xc )2 + (y − yc )2

(2-7)

By normalizing the co-ordinates of the boundary of the shape by the centroid, the
shape signatures can be made translation invariant. The curvature and curvature
angular functions are already translation and rotation invariant. The Discrete Fourier
Transform(DFT) of these shape signatures are then used as shape descriptors with
approriate modifications to acheive scale invariance. For instance in the case of
complex number shape signature, to have scale invariance, the average or the DC
value is ignored and all the other coefficients are scaled down by a factor equal to
the first coefficient of the DFT.

Shape Descriptors Using Zernike Moments
Zernike moments are another set of moments which are used for shape description.
These are in fact related to the normalized central moments of a shape and, so, are
translation and scale invariant. The main difference between Zernike moments and
Hu moments is that Zernike moments are generated with a rotation invariant property
while Hu moments are generated by combining different orders of normalized central
moments. In other words, the Zernike moments are obtained by the projection of the
binary shape onto a set of orthogonal functions with simple rotational properties and
these functions are known as the Zernike polynomials [17]. The Zernike polynomials

22
are given by
Vnl = Vnl (ρ sin θ , ρ cos θ) = Rnl (ρ) exp(ilθ)

(2-8)

Using these polynomials, a binary image f (x, y) can be represented as

f (x, y) =

XX
n

Anl Vnl (ρ , θ)

(2-9)

l

and by definition, a complex Zernike moment is defined as
(n + 1)
Znl = (
)
π

Z Z

f (x, y) [Vnl (ρ , θ)]∗ dx dy

(2-10)

From these Zernike moments, shape descriptors can be built which are related to
much higher order moments with the advantage of translation,scale and rotation
invariance and, hence, can represent more variations in the binary shape than just
regular moments or Hu moments.

Shape Descriptor Based on Radon Transform
The R-Transform is a 2D shape descriptor which is derived from the Radon transform
and which is invariant to translation [18]. It is made scale invariant by using a suitable
scaling factor. The Radon transform is in fact a projection of the 2D binary shape into
a set of straight lines which are at a distance of s from the centroid of the shape and
oriented at an angle of α with respect to the x axis and with s varying from (−∞, ∞)
and α varying from [0, π). The R-transform is computed by integrating the Radon
transform over the s axis. This R-transfrom is not exactly scale invariant but it can
be made so by scaling each value of the R-transform by the area covered by it and
the α axis. However, compared to the Hu moments and Fourier descriptors, these
R-Transforms are not rotation invariant but the R-transform of a rotated version
of an image is only a shifted version of the R-transform of the original function.

23
Therefore, the rotation invariance can be acheived by taking only the magnitude
of the Discrete Fourier Transform of the R-transform and scaling each coefficient
by the DC or average coefficient value. The final shape descriptor is obtained by
first applying a Chamfer distance transform to the binary image and then using the
transform values to segment the shape into different levels. The R-Transform is then
applied at each level and this set of R-Transforms applied is taken as a 2D shape
descriptor. This shape descriptor is an example of combining two different shape
descriptors which brings out different aspects of the shape into a single descriptor.

Shape Descriptor Based on Possion’s Equation
The shape descriptor based on Poisson’s equation is similar to the Euclidean distance
transform of a binary image where every internal point in the silhouette is assigned a
value based on its distance from a set of boundary points [19]. Here, the value of an
internal point is determined by the mean time taken by a set of particles at that point
to undergo a random walk process and hit the boundaries. These values at every
internal point of the silhouette S can be determined from the solution of the Poisson’s
equation U given by ∆U (x, y) = −1 under the condition U (x, y) = 0 at the boundary
δS where ∆U = Uxx + Uyy is the Laplacian of U . The one difference between this
shape descriptor and the distance transform is that while the distance transform
considers only the shortest distance boundary point to find the value of an interior
point, the Poisson’s equation based shape descriptor takes into account not only the
shortest distance boundary point but also its neigboring boundary points. Hence, it
brings out more global properties of the silhouette than the distance transform. The
other difference is that this representation enables the segmentation of the silhouette
into different parts by taking the gradient Φ = U + Ux2 + Uy2 . The gradient Φ gives
higher values near concavities which are often projections in a silhouette and thus,
segmentation of shapes into different parts is feasible. Next, by taking the second

24
derivatives of U , the local orientation and aspect ratio can be computed. Finally,
the features for this shape descriptor are the moments of the binary shape but are
weighted by functions which depend on the gradient, local orientation, and aspect
ratio.

Choice of Descriptors for Action Recognition Framework
All the shape descriptors described above can be used to extract the variation of
the silhouettes across the video frames either in the 2D or 3D case. By far the
sophisticated ones are the shape descriptor based on Poisson’s equation and Radon
transform as they not only give the boundary representation but also an internal
representation of the silhouette. This enables the extraction of more localized features
which would not have been possible with other shape descriptors like Hu moments,
Zernike moments and Fourier descriptors. In fact, a 3D version of the Poisson’s
equation based shape descriptor is used to describe a space time shape and global
as well as local properties are extracted from this descriptor. They are then used as
weighted functions in the computation of moments which are furthur used as action
features [20]. The R-Transform is used in an action recognition framework where
this is computed for silhouettes in key frames of the video sequence. This set of
R-Transforms are later trained by HMMs and the trained models are then used to
compute each action model similarity with the input test sequence. In short, the
Poisson shape descriptor and the Radon transform are the two shape descriptors
which are recommended for action feature extraction.

2.2.2

Algorithm Based on Poisson’s Equation Based Shape Descriptor

In this framework, the concept of a space time shape is introduced where a space time
shape is formed by the concatenation of silhouettes in the video frames. These space
time shapes contain the human action and, thus, considers actions as 3D shapes. An

25
extension of the Poisson shape descriptor to the 3D domain is used to extract the
various properties pertaining to the 3D shape such as local space-time saliency, action
dynamics, shape structure, and local orientation [20]. These properties are used as
action features and the results prove its robustness towards partial occlusions, nonrigid deformations, large changes in the scale, and viewpoint and low-quality video.
The only constraint is that the background should be known beforehand and in this
case, the median background computed from the video sequence is used to segment
out the background for silhouette extraction.
A 3D version of this descriptor is used for the internal representation of the
3D space time shape in which a value at an internal point reflects a much more
global aspect of the silhouette than the Euclidean distance transform. As mentioned
before, this is because the value is calculated not from just the nearest boundary point
but also from the set of neighboring boundary points. This descriptor is computed
by solving the equation ∆U (x, y, t) = −1 subject to both the dirichlet boundary
condition U (x, y, t) = 0 at the bounding surface δS and the Neumann boundary
condition Ut = 0 applied only at the first and last frames of the video sequence.
The numerical solutions are obtained by a simple ”w-cycle” of a geometric multigrid
solver [21] and the solution obtained can be used to extract global as well as local
features. The local features are obtained by furthur processing of the solution such
as by taking the gradient and Hessian of the same. The global features are actually
the regular moments but weighted by the local features.
Since human actions are described as a collection of moving parts, a particular
descriptor is defined which emphasizes the fast-moving parts with some importance
to parts which move at a moderate speed. This descriptor is a variant of the gradient
of U defined by
wΦ (x, y, t) = 1 −

log(1 + Φ(x, y, t)
max (log(1 + Φ(x, y, t))
(x,y,t)S

(2-11)

26
where the gradient Φ(x, y, t) = U + 23 k∇U k2 . This variant wΦ (x, y, t) will be one
of the weights used for the computation of the global features. The other set of
local features such as local orientation and local aspect ratio are extracted by first
defining three types of local space time structures named as stick, plate, and ball.
A 3 × 3 Hessian matrix H is applied to the solution of the shape descriptor U with
the matrix centered at each voxel location. Then, the eigen values of this matrix are
extracted and based on the ratio of these eigen values, the stick, and plate structures
are defined where the stick and plate structures Sst (x, y, t), Spl (x, y, t) are exponential functions. The informative direction Dj (x, y, t) given by the projection of the
three eigen vectors corresponding to the three largest eigen values are projected into
the x, y, and t axes where this informative direction gives the deviations of these
eigen vectors from the axes x,y and t. Then, the local orientation features are the
combination of the informative direction and the stick and plate structures given by
wi,j (x, y, t) = Si (x, y, t) × Dj (x, y, t) where i {st, pl} and j {1, 2, 3}.
The global features are the weighted moments of the 3D shape using local features as the weights. For classification, the video sequences are first broken down
into space time cubes of predefined length with a predefined overlap between them.
The features extracted from the space time shapes of the test sequence are then compared with the features of the space time shapes in the database using the Euclidean
distance metric and they are appropriately classfied using the nearest neighbor rule.
Although this algorithm is robust to partial occlusions, the memory required to store
the features is large. Moreover, the speed will not be issue if dealing with videos of
smaller frame size due to the fact that the computation time required for solving the
poisson’s equation on a small area is comparable to other methods. But, when the
video frame size is large as 256 × 256, the speed of computation of the features using
the Poisson’s equation will be less and takes considerable time.

27
2.2.3

Algorithm Based on R-Transform

The R-Transform is used to represent the low-level features in a binary silhouette
in a single frame and these features extracted from a video sequence are used to
train a set of HMMs to recognize the action [22]. The results provided justifies
the robustness of the algorithm to frame loss and disjoint silhouettes. HMMs are
used in the analysis of the time series data and since the low-level features are extracted at every frame, the action features can described as time series data with
the data being the low-features varying across the video sequence. The R-Transform
is generally translation-invariant but not scale and rotation invariant. The rotation
variance is ignored as human actions rarely include the silhouettes being rotated and
the scale invariance is acheived by resizing the image to a normalized scale before
the R-Transform is applied. The R-Transform is a 180 dimensional vector and so,
a feature matrix of size 60 × 3is formed. By applying PCA, the feature matrix is
reduced in size to 2 × 3 which are concatenated to form 6 × 1 vector. This vector
obtained from a silhouette from a single frame in a video sequence is used to train
the HMM to get a model for each action category.
The main advantage of using this algorithm is that the computation of Rtransform is linear and so, the computation cost of the features is low. The only
disadvantage is that the silhouettes need to be normalized with respect to the scale
and this often requires the use of interpolation methods.

2.2.4

Algorithm Based on Shape Descriptors and Optical Flow

In this framework, both the motion flow feature vectors as well as the global shape
flow feature vectors are extracted from image sequences and the combined feature
vectors are used to model each category by HMMs for multiple views [23]. The
motion flow feature vectors are given by the optical flow and the shape flow feature
vector is obtained by using suitable shape descriptors.

28
To capture the shape flow, a concatenation of the different features is used
such as Hu’s invariant moments, Zernike moments, flow deviations over the image
sequences in a space time shape, and global anthropometric variations. The flow
deviations are the mean absolute deviation of the silhouette image from the center
of mass of the silhouette in the x and y directions and the mean intensity of the
shape distribution of the space time shape. The mean absolute deviations helps
in discriminating between actions involving large body motions and actions having
small body movements. The global anthropometric variants are the projection of the
silhouette onto the x and y axes. These four feature vectors are combined to form the
shape flow feature vector. The combined local-global(CLG) optic flow feature vector
is the combination of the global optical flow extracted by considering the silhouette
as a whole and the optical flow vectors extracted from the various blocks of the
silhouette image. Here, the silhouette action boundary is divided into four quadrants
or blocks with the origin at the center of mass of the silhouette and from each block,
the optical flow vectors are extracted. Therefore, the key features extracted from a
single frame in a space time cube are the combined vectors of both the shape flow and
the CLG optic flow vectors. Modeling of the various action categories are done by
multi-dimensional hidden markov models where the HMM algorithms are modified to
include the combined shape-CLG flow vectors. Classification of the action features
is done using hidden Markov Model where the model which gives out the highest
conditional probability is selected.
The use of optical flow vectors and shape descriptors such as Hu moments reduce
the robustness of the algorithm to partial occlusions. Moreover, this algorithm uses a
combination of different shape descriptors such as Hu moments and Zernike moments
which is not very efficient. Although it is stated that the different shape descriptors
bring out different aspects of the shape, it is not very clear as to what those aspects
are.

29
2.2.5

Algorithm Based on Local Descriptors and Holistic Features

The algorithm proposed uses a combination of the local features in the form of
SIFT descriptors and the global features in the form of Zernike moments. Both the
local and global features emphasize the different aspects of actions [24]. For the
local features, both the 2D as well as the 3D SIFT descriptors are extracted from
neighborhood regions which have 2D SIFT interest points. The 2D SIFT descriptor
emphasizes the 2D silhouette shape and the 3D SIFT descriptor emphasizes the motion. The global or holistic features are the Zernike moments extracted from single
frames and from motion energy images.
First, consecutive frames of the video sequence are subtracted from each other to
remove the background and what remains are the difference images. It is from these
difference images that the interest points are obtained. Each of the difference images
are projected into scale space using a Gaussian kernel. This creates a pyramid like
structure with each level representing the scale of the image. The whole scale space
is divided into octaves with each octave having a certain number of levels. At each
octave, the difference images projected into consecutive levels are subtracted to form
a difference of Gaussians (DOG). Each pixel in the DOG image is then compared to
its 8-neighbors in the same level and the 9-neighbors in the higher and lower levels.
If there is a significant difference, then, that pixel is considered as an interest point
referred to as a 2D SIFT interest point. It is from these interest points that the
2D SIFT descriptor feature vector and the 3D SIFT descriptor feature vector are
extracted. In both the extractions, the gradient magnitude and the orientation are
calculated at each scale or each level and at the interest points, the orientation histogram in that interest region is computed. As mentioned in the previous sub section,
the scale invariance is acheived by scaling the histogram of the interest region with
the magnitude of the gradient at the respective interest point. The interest regions
are rotated so that the dominant orientation is aligned in the same direction as the

30
dominant orientations of other interest regions. The whole region is then divided
into sub-regions on which the sub-histograms of the orientations are calculated. The
final 2D or 3D descriptor feature vector of that interest region is the concenation of
these sub-histograms in that interest region.
The holistic features are the Zernike moments extracted from the space time
shape. There are two variations which are extracted. One is the set of Zernike moments extracted from the single frame thereby providing the spatial variation. The
bag of words approach along with K-means clustering algorithm is applied to these
moments to get necessary holistic feature vector. The other is the set of Zernike
moments extracted from the motion energy image computed from the single frames.
The holistic features and the 2D/3D SIFT descriptors are applied to SVMs for classification. The one drawback this algorithm faces is that it may not be robust to
partial occlusions. This is because of the use of Zernike moments in describing the
global features which can vary a lot due to occlusions.

2.3

SUMMARY

Various action recognition algorithms have been discussed with some using motions
fields such as optical flow and local features such as SIFT descriptors while some
others using shape flow features extracted using shape descriptors. The emphasis has
been on the algorithms using the shape descriptors. In the next Chapter, the shape
descriptor based on the Radon transform is explained in detail with illustrations.

31

CHAPTER 3
MULTI-LEVEL SHAPE REPRESENTATION
The algorithm proposed in this thesis is a small extension of the shape descriptor
based on the multi-level Radon transform[18] from where action features are extracted. This Chapter describes the Radon transform in detail and it can be used to
describe the spatial variations of the 2D silhouette of a frame in a video sequence.

3.1

RADON TRANSFORM BASED SHAPE DESCRIPTOR

This type of shape descriptor does not belong to the category of region based shape
descriptor where all the pixels within the boundary of the shape are considered or
the contour-based descriptor which takes only the boundary pixels into account.
The descriptor is based on the Radon transform where this transform computes the
projection of a 2D shape on to a set of lines with each line being at a distance of
s from the centroid of that shape and oriented at an angle of α from one of the
co-ordinate axes. The Radon transform is a 2D image having co-ordinates s and α.
A slight variation of the Radon transform known as the R-Transform is extracted
from it and this is used as one of the components of the shape descriptor. In fact,
each component is obtained by projecting different levels of the binary shape into
the Radon space with each level obtained from the 2D Chamfer distance transform.
Thus the link between the internal structure and the boundary is emphasized while
computing the shape descriptor.

3.1.1

Definition of Radon Transform

As defined before, the Radon transform projects a 2D binary shape onto a set of lines
oriented at different angles and at different distances from the centroid of that shape.
In other words, the Radon transform is an integral transform where the function is

32
integrated over a set of lines. According to [32], the Radon transform of a density
distribution ρ(x, y)is defined as
Z

∞

<ρ (x, y) =

ρ(x, y)δ(R − x cos θ + y sin θ) dx dy

(3-1)

−∞

where θ is the scanning direction or the direction of the projection line and R is the
distance of the projection line from the origin. Here, the density distribution ρ(x, y)
is the 2D binary image f (x, y) and the origin is the centroid of that binary shape. In
vector notation form where x is a vector with two components (x, y) and t is a unit
vector in the scanning direction, the Radon transform is given by
Z
<ρ (x, y) =

3.1.2

ρ(x)δ(R − x · t) dx

(3-2)

Geometrical Interpretation of the Radon Transform

Consider a unit point mass at P (a, b) as shown in Figure 3.1a and Figure 3.1b.
Then, the density function ρ(x, y) given in terms of this point mass is ρ(x, y) =
δ 2 (x−a, y−b). Let the Radon transform of this density function be I(a, b, R, θ) which
will in turn be the impulse response of the operator <. Let a line L be scanning in
a fixed direction given by θ across the x − y plane. At a particular distance R from
the origin, the line L passes through the point P (a, b) and let gθ (R) be the Radon
transform computed for that line L for a fixed θ at a fixed distance R. Since this line
L passes through the point P (a, b), gθ (R) will be a non-zero value. Then, by fixing
this distance R, the line is rotated with θ varying from 0 to π so that the contribution
of this point P (a, b) to the Radon transform <(x, y) is accumulated for all directions.
In other words, gθ (R) is calculated for all values of θ for a fixed R. By plotting the
points of the intersection of the line L and the perpendicular lines from the origin,
a circular locus is obtained. This locus of points defines the nature of the impulse

33

(a) PointMass1

(b) PointMass1

Figure 3.1: Geometric Interpretation of the Radon Transform as Shown in [32].

response I(a, b, R, θ) as a uniform ring delta function of unit line density. The above
interpretation is shown in Figure 3.1a. Now, consider a unit mass spread over a small
circular area of diameter τ surrounding the unit point mass P . As shown in Figure
3.1b, the line integral gθ (R) for a constant R and constant θ will be a narrow hump
of unit area and width τ . For a fixed R, the line L is rotated so that the contribution
of the unit mass centered at P to the Radon transform <(x, y) is accumulated for
all values of θ. The boundary of the Radon transform will then be a circular strip
of non-uniform width. The maximum width occurs at the position where the line
L passes through both the point P and the origin O when extended as it is at this
position the maximum area can be seen from the line L. At all other positions, the
area of the point mass as seen from the line L is less than this maximum. Thus, it can
be stated that the mass at P (a, b) is distributed non-uniformly along the perimeter
of the circle with diameter OP . This gives the true nature of the impulse response
which is a ring delta with non-uniform line density.

34
3.1.3

Computation of the Radon Transform

If the density distribution is represented by an array of point masses in the Cartesian
co-ordinate system, then, the mass at P (a, b) is non-uniformly distributed along
the Radon transform boundary and the density at point Q is proportional to OQ.
Therefore, the computation of the Radon transform is done by first sub-dividing this
mass into smaller point masses and accumulating its contribution. It can be divided
in two ways, one with uniform spacing between the point masses on the Radon
boundary but each one with a different mass, the other with non-uniform spacing on
the Radon boundary but having same mass. The line L is divided into smaller parts
and each point mass is allocated to the nearest part of the line. Allocations from
all the smaller point masses from all the points are accumulated to form the Radon
transform of the density distribution. The Radon transform can also be denoted
by T (s, α) where α refers to θ and s refers to R in the previous representation.
For computing the Radon transform of a binary image, it is taken as the density
distribution with the pixels considered as the point masses. The computation of
the Radon transform is done by applying the analogy of smaller point masses where
each pixel is divided into sub-pixels and then, the value of each subpixel is projected
into a line which is divided into a certain number of bins. If the projection of the
sub-pixel falls on to the center of the bin, the full value of the sub-pixel is used in the
computation. If the projection falls on the border of a bin, the value of the sub-pixel
is split between the current bin and the adjacent bin. This type of projection for a
single pixel is computed for all sets of lines oriented at different angles with the x
axis and at different distances from the centroid of the shape. The accummulation
of all these projections over all the pixels of the image gives the Radon transform of
the image. The definition of the Radon transform and its computation for a pixel
is illustrated in Figure 3.2a and Figure 3.2b where the 2D binary shape is projected
on to the line AA0 which is at a distance of s from the centroid of the shape and

35

(a) Definition

(b) Computation

Figure 3.2: Definition of Radon Transform and its Computation for a Pixel.

oriented at an angle of α. Since the 2D binary image is discrete, the maximum and
minimum values that s can take is finite and this value depends on the size of the
image.
3.1.4

Application of the Radon Transform to Binary Images

The Radon transform is mainly used for the detection of lines and curves in an
image where this transformation will emphasize these straight lines or curves. In
other words, the Radon transform will contain high and low value pixels where the
high value gives the intensity of the straight line in the original image whose location
in the original is determined by its co-ordinates in the Radon transformed image.
As mentioned before, the Radon transform co-ordinates are (s, α) where s gives
the perpendicular distance of that straight line from the origin and α gives the
inclination of the line with the x axis of the original image. The Radon transform

36
for a square binary image is shown in Figure 3.3a. The maximum values denote that
the maximum projection of the square occurs along the diagonals. The detection
of lines property is illustrated in Figure 3.3b where the outline of the square binary
image is Radon transformed.
As shown in Figure 3.3a, the bright pixels in the Radon transform of the square
image occurring at co-ordinates (−25, 0◦ ),(+25, 0◦ ),(−25, 90◦ ) and (+25, 90◦ ) correspond to the horizontal bottom edge, the horizontal upper edge, the left vertical edge
and the right vertical edge respectively. By comparing the Radon transform of the
square image to that of its outline, a significant difference can be noticed. In Figure
3.3a, the highest values occurs at co-ordinates (0, 45◦ ) and (0, 135◦ ) which represent
the diagonals of the square image and in Figure 3.3b, the highest values correspond
to the edges of the square outline. This asserts the fact that the computation of
the Radon transform not only considers the boundary pixels of the shape but also
the interior pixels. Furthur analysis on the Radon transform is shown in Figure 3.3c
where the interior values of the square are represented by a 2D distance transform
based on the Euclidean distance. The Radon transform of the distance transform is
not very different compared to that of the original with the difference being in the
intensity of the sinusoidal variations present in both the images. The co-ordinates of
the bright pixels are the same in the Radon transforms of both the square and the
distance transformed image but the intensity is greater at those locations in case of
the distance transformed one. This shows that this transform changes in accordance
to the interior values of the shape. The analysis can also be applied to human binary
silhouette images and is illustrated in Figure 3.4.
Similar to the case of the square image, the Radon transform of the silhouette
image shown in Figure 3.4a differs from that of its outline and its distance transform shown in Figure 3.4b and Figure 3.4c. This is due to the different values of
the interior pixels. The Radon transform of the silhouette image has bright pixels

37

(a) Square Image

(b) Outline of Square Outline

(c) Distance Transform of Square Image

Figure 3.3: Projections of Squares into Radon Space.

38

(a) Human Silhouette Image

(b) Outline of Human Silhouette Outline

(c) Distance Transform of Human Silhouette Image

Figure 3.4: Projections of Human Silhouettes into Radon Space.

39
which correspond to the main axis. For the outline silhouette, the Radon transform
consists of bright regions which corresponds to the line segments that make up the
boundary. In the case of the distance transform where the interior pixels have nonzero values, the Radon transform have other regions which are brighter and these
regions corresponds not only to the line segments that make up the boundary but
also to the interior structure of the silhouette. Another observation is that the sinusoidal variations become more emphasized in the Radon transform domain when the
interior pixels of the input image have larger values. This can be seen in the case
of the silhouette outline images where the corresponding Radon transform has sinusoidal variations which are less brighter than those present in the Radon transform
of the original image. In the case of the distance transformed image, the sinusoidal
variations even though brighter than those found in the outline, are darker when
compared to those of the original.

3.1.5

Properties of Radon Transform

Some of the properties of the Radon transform applicable to shape representation
are discussed in this section. These properties are periodicty, symmetry, translation,
rotation, and scaling [18].
• Periodicity :- The Radon transform is periodic in α with a period being a multiple of 2π. This property is given by T (s, α) = T (s, α + 2kπ) where k Z and
T (s, α) is the Radon transform. So, if the Radon transform is shifted by a
multiple of 2π along the α variable, the inverse of the shifted Radon transform
would be the same as the original image. By shifting in α, the set of lines on
which the original image gets projected on, is rotated by a certain offset and if
this offset is a multiple of 2π, the lines are back to their original positions.

40
• Symmetry :- The next property is symmetry which is given by T (s, α) =
T (−s, α ± π). As seen in Figure 3.3 and Figure 3.4, the Radon transform
is symmetric along the line s = 0 and about α = π. This property can be used
to reduce the number of computations and memory required for storing the
transform. In other words, only the upper or lower half is need to reconstruct
the original image.
• Translation :- Let the image f (x, y) be translated by a vector p~ = (x0 , y0 ).
The Radon transform of the translated image is then given by T (s − x0 cos α −
y0 sin α, α). This tells us that the transform is translated along the s axis
by a value equal to the projection of the translation vector p~ onto the line
x cos α + y sin α. So, an image and its translated version will have different
Radon transforms and hence, to use these transforms as a suitable shape descriptor, the translation vector should be determined from the image and its
translated version and this predetermined vector should be used as the normalization factors.
• Rotation :- Another property is that of rotation where if the image is rotated
by θ0 , the Radon transform is shifted by the same amount along the α axis. In
other words, the rotated Radon transform is given by T (s, α + θ0 ). Again as in
the case of translation, the rotation angle between the original and the rotated
image should be determined and this angle should be used for normalization.
• Scaling :- The scaling of the image by a value k results in a Radon transform
1
T (ks, α)
|k|

which is scaled not only in the amplitude but also in the s co-

ordinate. As mentioned in the previous properties, this scaling coefficient must
be determinedfor normalization purposes.
In short, if the original image is translated, rotated, and scaled, the corresponding
Radon transform would also be the translated, shifted, and scaled versions and the

41

Figure 3.5: R-transform of a Silhouette.

extraction of the translation vector, the rotation angle and the scaling coefficient
at the same time for normalization is extremely difficult. Therefore, the 2D Radon
transform cannot be used directly as a suitable shape descriptor. This lead to a
modified version of the transform known as the R-Transform [18].

3.1.6

R-Transform

The R-Transform is formed by integrating the 2D Radon transform T (s, α) over the
variable s. The R-Transform which is single dimensional is given by
Z

∞

Rf (α) =

T 2 (s, α) ds

(3-3)

−∞

where T (s, α) is the 2D Radon transform of the image. A single value of the RTransform can be interpreted as the summation of the projections of a set of lines
at different distances from the centroid but all oriented at a specific angle to the
x co-ordinate. In other words, R-Transform is formed by keeping the α variable
constant and accumulating the projections on the set of lines all oriented in the
same direction. The R-Transform is shown in Figure 3.5. This R-Transform has the

42

(a) Translated Image

(b) Scaled Image

(c) Rotated Image

Figure 3.6: Properties of R-Transform.

43
following properties which makes it a suitable descriptor for 2D shapes [18]. The
following properties are given below.
• Periodicity :- The R-Transform is periodic in nature with a period π unlike the
2D Radon transform which has the period 2π. This is given by the equation:
Rf (α ± π) = Rf (α).
• Rotation :- The R-Transform is not rotation invariant but the angle at which the
shape has been rotated can easily be extracted and thus, can normalize the RTransform to make it rotation invariant. In fact, the R-transform of the image
rotated by an angle θ0 will be a translated version with a translation shift by the
Z ∞
T 2 (s, α+θ0 )ds.
same amount. This is given by the equation : Rf (α+θ0 ) =
−∞

This property is illustrated in Figure 3.6c.
• Translation :- The R-Transform is translation invariant although the 2D Radon
transform is not. This removes the need to extract a translation factor for
normalization, and thus, satisfies one of the criteria for a shape descriptor.
The R-Transform of an image translated by a vector p~ = (x0 , y0 ) is given by
Z ∞
the equation:
T 2 (s − x0 cos α − y0 sin α)ds = Rf (α). This is illustrated in
−∞

Figure 3.6a.
• Scaling :- The R-Transform is not scale invariant but can be made so by normalizing with a scaling factor. This transform extracted from a scaled version
of the image will also be scaled but only in amplitude unlike the 2D Radon
transform where the amplitude as well as the variable s was scaled. This
makes it easier to use a scaling factor for normalization which depends on the
R-Transform itself. Here, the scaling factor used in the normalization is the
area of the R-Transform and by using this scaling factor, scale invariance can
be achieved. The property is illustrated in Figure 3.6b.

44
Thus, the property of translation and scale invariance makes the R-transform
a suitable shape descriptor. The rotation invariance can be acheived by extracting
the amount by which the image is rotated from its original one and this can be
computed from the two R-transforms by a correlation based technique. But usually,
by normalizing the Discrete Fourier transform coefficients of the R-Transform by
the average value and retaining only the magnitude, the rotational dependence can
be removed. As shown in Figure 3.6, the scaled images and the translated images
have the same R-transform while the rotated image with a rotation of 30◦ has the
R-transform shifted to the right.

3.2

DISTANCE TRANSFORM BASED ON EUCLIDEAN DISTANCE

The distance transform is another type of shape descriptor which gives an internal
representation of the shape based on the Euclidean distance of a pixel from the nearest
boundary pixel. The interior pixel values depend on the type of the approximation of
the Euclidean distance metric used and this leads to different families of the distance
transform [29]. The approximation of the Euclidean distance between a particular
interior pixel and the neigboring pixel depends on the weights given to its neighbors.
Let the weights given to the diagonal neigbors be d2 and to the horizontal/vertical
neighbors be d1. The distance between two arbitrary pixels p and q is then given by

Dd1,d2 = m2 × d2 + (m1 − m2) × d1

(3-4)

where m1 is the number of horizontal steps and m2 is the number of vertical steps
between the two pixels. Depending on the value of the (d1, d2) pair, different approximations are possible and they are known by different names. For (d1, d2) = (1, ∞),
the approximation is known as 4-neighbor distance or city-block distance and when

45
(d1, d2) = (1, 1), it is called 8-neighbor distance or the chessboard distance. The Euclidean distance approximations with d1 or d2 being an integer other than 1 or ∞ are
known as Chamfer distances. One example of the Chamfer distance approximation
√
is the Quasi-Euclidean distance with (d1, d2) = (1, 2) .

3.2.1

Computation of the Distance Transform

The distance transform of an image is computed using local operations applied sequentially to pixels using certain types of masks [30]. The operational mask starts
its processing from the top-left most pixel and the image is scanned by this mask
in a forward raster scan manner. Each shift of the operational mask in the scan
involves the processing of the currently centered pixel using the neighboring pixels
where the current values of the neighborhood pixels are obtained from the processing
of the previous shift of the operational mask. If a pixel in an image is represented by
a(i, j) in the ith row and j th column and the new value represented by a∗ (i, j), the
sequential operation on a centered pixel can be summarized as

a∗ (i, j) = f (a∗i−1,j−1 , a∗i−1,j , a∗i−1,j+1 , a∗i,j−1 , ai,j , ai,j+1 , ai+1,j−1 , ai+1,j , ai+1,j+1 )
(3-5)
where f (·) is the operation performed by the mask. The image on which the distance
transform is to be applied, is usually a binary image. The two local operations used
are



0
if ai,j = 0



f1 (ai,j ) =
min(ai−1,j + 1, ai,j−1 + 1) (i, j) 6= (1, 1) and ai,j = 1




 M +N
(i, j) = (1, 1) and a1,1 = 1

f2 (ai,j ) = min(ai,j , ai+1,j + 1, ai,j+1 + 1)

(3-6)

(3-7)

46
where the size of the image is M × N . Two scans of the image are done to compute
the distance transform, the first is the forward raster scan by the operation f1 and
the second is the reverse raster scan by the operation f2 . These operations mark the
pixels with values equal to their distance from the set of zero-valued pixels. The first
operator increments the values of the top vertical and left horizontal neighbors and
assigns the minimum value to the center pixel. The extreme conditions are when the
centered pixel has a value 0 or when the centered pixel is the top-left most pixel. The
image obtained after the first scan is a partial distance transform where the interior
pixel values gives the distance from the nearest top or left boundary pixel. The
second operation in the reverse raster scan mode is applied to the output image of
the first scan and by centering the mask at every pixel, the bottom vertical and right
horizontal neighbors are incremented. The value of the centered pixel is then updated
by the minimum of the current value of the centered pixel and the updated values
of the corresponding neighbors. By varying the scale of the increments applied to
the vertical and horizontal neighbors, the approximations to the Euclidean distance
transform can be implemented. The horizontal increments are scaled by d1 and the
vertical increments are scaled by (d2 − d1) during the forward and reverse scans. The
different types of approximations for the Euclidean distance transform are mentioned
in [31].

3.3

MULTI-LEVEL REPRESENTATION OF 2D SHAPES USING
CHAMFER DISTANCE TRANSFORM AND R-TRANSFORM

The R-Transform gives a very compact representation of a shape and so, to include the variation of the interior structure, a distance transform along with the
R-Transform is used to give a multi-level representation of the shape. First, a distance transform is applied to the 2D shape and using the interior values obtained,
the shape is segmented into different levels as shown in Figure 3.7. Then, each level

47

(a) Segmentation of Dog Silhouette as shown in [18]

(b) Segmentation of Human Silhouette

(c) Forward and Backward Masks used in [18]

Figure 3.7: 8-Level Segmentation of Silhouettes of a Dog and a Human.

is considered as a seperate 2D shape. In other words, a certain number of shapes are
extracted from a single shape and each extracted shape gives a coarser representation of the original. Finally, the R-Transform is extracted from each level of the 2D
shape to get the multi-level representation. The distance transform used here is the
Chamfer distance measure with (d1, d2) = (3, 4).
The segmentation of the shape into different levels is done by a simple thresholding scheme. At a particular level, the pixels whose distance transform values are
greater than a predefined threshold are selected and the rest are discarded. The
selected values at every level of the shape are then given a constant value to get the
binary shape. The combination of all the R-Transforms extracted at each level gives
a complete description of the shape including the internal structure. An illustration
of 8 - level segmentation applied to binary images of a dog and a human silhouette are
shown in Figure 3.7a and Figure 3.7b along with the operational masks used to compute the distance transform. But this set of R-Transforms are not that suitable for

48
shape representation as it is still rotation variant as explained in the previous section.
A rotation of a shape by a certain degree will have translatory shift by the same degree
in the R-Transform. Therefore, the Discrete Fourier transform F is taken from the
R-Transform of every level of the shape and only the magnitude of the coefficients is
retained so as to remove the rotation variance. Let Rl be the discrete R-Transforms of
the 2D shape where l = (1, L) refers to a particular level. Then, the final multi-level
0

0

L

L

(1)
, ... FF 0(π)
, ... FF L (0)
, ... FF L(π)
). This
representation of the shape is given by X = ( FF 0 (1)
(0)
(0)
(0)

multi-level representation of the shape obtained after the Radon transform and the
Discrete Fourier transform is the final shape descriptor which is translation, rotation
and scale invariant.

3.4

SUMMARY

This Chapter gives a detailed explanation of the Radon transform and a brief
overview of the algorithm used in the computation of the Radon transform. The
various properties relating to shape description and the drawbacks which renders the
original Radon unfit for shape representation are discussed. Then, the R-Transform
was introduced which is a modified form of the original Radon transform and this
modification fulfills the necessary properties such as translation and scale invariance.
Finally, the concept of a multi-level representation of a shape is introduced where
this representation is derived from a combination of the Chamfer distance transform
and the R-Transform. The final shape descriptor is the Discrete Fourier transform
extracted from this multi-level representation which is not only translation and scale
invariant but also rotation invariant. In the next Chapter, an extension of the concept of multi-level representation is used to describe 3D space time shapes obtained
from the actions and this representation is used for action feature extraction.

49

CHAPTER 4
ACTION RECOGNITION FRAMEWORK
The action recognition algorithm proposed in this thesis is an extension of the 2D
multi-level Radon transform based shape descriptor. As previously indicated, the
distance transform is used to segment the silhouette but the difference is that instead of segmenting a 2D silhouette, a space time shape formed from an action is
segmented into different levels. The segmentation is done by first computing the distance transform of the 3D space time shape and then, computing the gradient. This
gradient serves as the benchmark for multi-level space time shape representation.
The 3D distance transform used is based on the Euclidean distance transform which
can be approximated by different metrics such as the Chamfer distance transform
or the Quasi-Euclidean distance transform discussed in the previous Chapter. The
algorithm is as follows:
• Extraction of the silhouette by extracting the foreground from the video frame
by median background subtraction.
• Concatenation of a predefined number of silhouettes extracted from the previous step into space time shapes.
• Applying the 3D distance transform with anisotropic aspect ratio with more
weight given to the time axis.
• Computing the normalized gradient of the distance transform.
• Segmenting the space time shape into multiple levels using the normalized
gradient.
• Applying the R-Transform for each frame at each level of the the space time
cube to get 3D feature vector.

50
• Extract the R-Translation vector from the coarsest level of the space time shape
and concantenate it with the 3D feature vector obtained from the previous step
to form the final feature vector.
• Classify the feature vector by comparing it with those features in the database
using the nearest neighbor approach.

4.1

SILHOUETTE EXTRACTION AND FORMATION OF SPACE
TIME SHAPE

Extraction of the binary human silhouette from the video sequence is done by a
simple foreground segmentation where it is assumed that the video sequence has
a stationary background and the only object which moves in the video sequence
is the individual performing a certain action. Using a background model obtained
from the video sequence, every frame is compared with the background to segment
out the foreground pixels. Using morphological operations of dilation and erosion,
the holes in the foreground silhouette and noisy background pixels are removed.
The silhouettes extracted from a predefined number of consecutive frames are then
concatenated to form the space time shape.
Here, the median background image is used as the background model. In a
median background, each pixel value is the median of the correponding pixel in all
the frames of the video sequence. The median background is preferred over the
mean background due to two reasons. One is that the median background can
be extracted directly from the video sequence containing the action which involves
movement of the entire silhouette across the frame and thus, does not require a
seperate background video sequence. But for actions which does not contain a large
movement of the entire silhouette and where the torso of the silhouette is almost
stationary, a seperate background video sequence should be used. The second reason

51

(a) mean background

(b) median background

Figure 4.1: Mean and Median Backgrounds.

is that the median of the pixels are not much affected by outliers in the pixel values
when compared to their mean where the outliers are caused due to the movement
of the person across the frame. The outlier pixel values infact reduce the mean by
a significant amount but not the median. This is illustrated in Figure 4.1 where the
mean background computed from a video sequence is a little darker than the median
background. Also, some portions in the background image have some slightly dark
patches and this is caused due to the movement of the person across the same region
in the video sequence. It can be seen that the median background image does not
have such dark patches and, moreover, the overall brightness is almost the same as
compared to each frame. It should be noted that this background segmentation is
not implemented in real time as the background model image is learned after making
a single pass across the video. Silhouette extraction is done from the second pass
onwards. Once the background image is learned, then, every frame in the video
sequence is compared with the background image by taking the absolute difference
between the two. The difference image is given by

dif f Imagek (i, j) = |backgnd(i, j) − f ramek (i, j)|

(4-1)

52

(a) Sample Frame

(b) Extracted Silhouette

Figure 4.2: Silhouette Extraction.

where (i, j) refers to the pixel location at the ith row and j th column, backgnd is the
background image, f ramek is the k th frame and dif f Imagek is the difference image
computed from the k th frame. To get the complete human silhouette, the background
image is then masked to retain those difference pixels having a value greater than a
certain threshold. The thresholding operation is given by


 0
if dif f Imagek (i, j) < T
Sk (i, j) =

 255 otherwise

(4-2)

where Sk is the binary silhouette image and T is the threshold used. Since the background image and the video frames are color images, the difference image computed
is also a color image and the thresholding operation is performed for each color band
seperately using different thresholds for each band. After thresholding, the silhouette may contain holes and unwanted protrusions and so, to remove these anomalies,
morphological operations such as dilation and erosion are used. The number of dilations performed on the silhouette is more than the number of erosions. This is to
ensure than no holes are present inside the silhouette as dilation helps in closing of
the holes. But, due to dilation, the silhouette size gets very large and so, erosions

53

Figure 4.3: Space-time Shapes of Jumping Jack and Walk Actions.

are required to bring the silhouette back to its original size. Usually, a rectangular
mask is used for both dilation and erosion processes. An illustration of the silhouette extraction operation in shown in Figure 4.2. The silhouette extracted from the
frame is slightly larger than the actual body of the person. The use of a morphology
operation is illustrated in Figure 4.4 where the number of dilations and erosions are
varied and how this variation affects the output of the video sequence. In Figure
4.4b, it is seen that without morphological operation, the silhouette has some holes
and imperfections. The morphological operation reduces the holes and imperfections
to some extent. The more number of dilations and erosions, the more the holes are
reduced but the shape of the silhouette gets distorted. So, a trade off exists between
the extent of the holes and the distortion of the silhouette.
Every space time shape has a certain overlap between the adjacent space time
shapes of a video sequence where this overlap is usually less than its length. The

54

(a) Sample Frame

(b) No Morphology Operation

(c) Dilation and Erosion done once

(d) Dilation done twice and erosion once

1 1 1
1 1 1
1 1 1
(e) Dilation and Erosion done twice

(f) Mask Used for Morphology Operation

Figure 4.4: Use of Morphological Operation of Dilation and Erosion.

55
length of a space time shape is the number of silhouette frames used in the concatenation. So, for every video sequence in the training set, a certain number of space
time shapes of length L and an overlap N are formed and action features are then
extracted from each of these space time shapes. A shape time shape is illustrated in
Figure 4.3.

4.2

SEGMENTATION OF SPACE TIME SHAPE INTO DIFFERENT
LEVELS

A space time shape can be considered as a concatenation of human body silhouettes extracted over a predefined number of frames with axes x, y and t where (x, y)
represent the frame axes and t represents the time axis. In the previous Chapter,
to describe a 2D binary image, more specifically a human silhouette, a 2D distance
transform was used to mark the interior pixels with values which are proportional
to its distance from the nearest boundary pixel. Here, an extension of the distance
transform to 3D is used where a voxel (a pixel in 3D) inside the volume spanned
by the space time shape is assigned a value which is proportional to the distance
between this voxel and the nearest boundary. The algorithm to compute the approximation of a 3D Euclidean distance is the 3-pass algorithm which is very similar
to the 2-pass used for 2D distance transforms. The difference between the 2D and
3D distance transform algorithm is that the minimum distance calculation is done
by finding the local minima of the lower envelope of the set of parabolas where each
parabola is defined on the basis of the Euclidean distance between the two points.
Also, when computing the 3D distance transform of a space time shape, the aspect
ratio is selected such that the time t axis gets more emphasis. A normalized gradient
of this distance transform is used to segment the space time shapes into different
levels. The number of levels and the interval between the levels is chosen so that
the most coarse level will be the concatenation of only the silhouette torsos without

56
the limbs obtained from each frame. In other words, the emphasis on the limbs is
reduced as the level becomes coarser and coarser.

4.2.1

Computation of 3D Distance Transform

The 3D distance transform is computed in three passes where each pass is associated
with a raster scan in a single dimension. In two of the passes, the distance transforms computed are bound by the parabolas defined on the boundary voxels in the
respective dimension. This type of distance transform measure is given by

Df (p) = min((p − q)2 + f (q))
qB

(4-3)

where p is a non-boundary point, q is a boundary point, B is the boundary and
f (q) is the value of the distance measure between points p and q. It is seen that for
every q B, the distance transform is bounded by the parabola rooted at (q, f (q)).
In short, the distance transform value at point p is the minima of the lower envelope
of the parabolas formed from every boundary point q. This idea can used for the 3D
distance transform computation where the parabolas along a dimension are defined
from the boundary points along that dimension.
Let U (x, y, t) be the 3D distance transform computed for a space time shape
S(x, y, t). The 3D distance transform U (x, y, t) is computed by using the following
lemmas : Consider F (X, k) as the family of parabolas where each parabola is defined
on the boundary voxel at one of the dimensions X. This family is given by

F (X, M ) = min(Gi (X) + (M − ki )2 )
i

(4-4)

where the scanning is done in (X, M ) plane along the X dimension, ki is the value of
the boundary voxel i along the dimension X and at fixed dimension M , and Gi (X)
is an arbitrary distance measure computed along the dimension X from each of the

57
boundary voxel i. Let Hk (X, M ) be one of the parabolas in the family of parabolas
F (X, M ). The scanning is done in three passes. In the first pass, the scanning
is done along the y axis in a raster scan manner to get an intermediate distance
transform values. The algorithm used in this pass uses a 1D mask oriented along
the y axis for distance transform computation and its value reflects the shortest
distance of a voxel to the nearest boundary voxel which are along the y-axis or the
row axis. The boundary pixels on the left or right columns are not considered for
the computation during this pass. The second pass includes scanning along the x
axis direction in the raster scan manner. The distance transform computation at an
arbritray voxel depends on the family of parabolas where each parabola is defined
at the boundary voxels occuring along the x − axis direction in the x − y plane. In
fact, this approximation to the distance transform is actually the minimum distance
of that particular voxel to the lower envelope of the family of parabolas F (x, y). In
the third pass, the scanning is done in the t-axis direction where again just like in
the second pass, the distance transform values depend on the family of parabolas
but, here, each parabola is defined on the boundary voxels which are occurring along
the t-axis direction in the y − t plane. The final 3D distance transform of the 3D
space time shape is obtained after these three passes. It should be noted that the
weights to the computation of the distance transform can included in each of the
passes. This provides the flexibility to incorporate non-isotropic aspect ratio so as
to emphasize the variation in a particular dimension.

4.2.2

Segmentation of a 3D Shape

Once the distance transform has been computed for the space time shape, the gradient of this distance transform is taken and this is used as the bookmark for segmenting
it into different levels. Some sample frames of the distance transformed image for
various aspect ratios are shown in Figure 4.5a, Figure 4.5b, and Figure 4.5c. It is

58
seen than the area covered by the torso part of the body has higher values than the
area covered by the limbs. By varying the aspect ratio, the different axes x,y and t
will have different emphasis on the computed distance transform. In other words, the
distance transform is tuned to the variation in the axes non-uniformly. Therefore, to
emphasize the time variation of a space time cube, the aspect ratio is selected such
that time axis t is scaled more than the other axes. This is expected from the distance transformed space time shape where the torso being the interior part of shape
will have higher values. But, human actions are distinguished from the variation of
the silhouette and these variations are more along the limbs than in the torso. So,
a better representation of the space time shape is required which emphasizes fast
moving parts so that features extracted gives the necessary variation to represent
the action. Thus, a normalized gradient of the distance transform is used and, as
shown in Figure 4.6a, Figure 4.6b and Figure 4.6c, the fast moving parts such as the
limbs have higher values compared to the torso region. The gradient of the space
time shape φ(x, y, t) is defined as

φ(x, y, t) = U (x, y, t) + K1 ·

∂ 2U
∂ 2U
∂ 2U
+
K
·
+
K
·
2
3
∂x2
∂y 2
∂t2

(4-5)

where U (x, y, t) is the distance transformed space time shape, Ki is the weight added
to the derivative taken along the ith axis. The weights associated with the gradients
along each of the axes are usually kept the same. It is seen that the proper variation
occurs in Figure 4.6c where the time axis has more emphasizes. The fast moving
parts in this case being the hands and legs have high values, the region surrounding
the torso which are not so fast moving have moderate values while the torso region
which moves very slowly with respect to the limbs have very low values. Moreover
this representation also contains concatenation of silhouettes from the previous frame
onto the current frame due to the gradient and so, the time nature is emphasized

59

(a) Aspect Ratio : (3,4,10)

(b) Aspect Ratio : (5,7,11)

(c) Aspect Ratio : (1,1,20)

Figure 4.5: Sample Frames of the 3D Distance Transformed Space-time Shapes with
Various Aspect Ratios.

60

(a) Aspect Ratio : (3,4,10)

(b) Aspect Ratio : (5,7,11)

(c) Aspect Ratio : (1,1,20)

Figure 4.6: Sample Frames of the Normalized Gradient of the Space-time Shape with
Various Aspect Ratios.

61

(a) 2nd F rame

(b) 5th F rame

(c) 8th F rame

Figure 4.7: 8-Level Segmentation of Different Frames of a Space Time Shape of a
Jumping Jack Action.

in a single frame of the space time shape. In short, this representation of the space
time shape is tuned more towards the time variation where this variation is directly
related to the action being performed. The normalized log gradient is given by

L(x, y, t) =

log(φ(x, y, t))
max (φ(x, y, t))

(4-6)

(x,y,t) S

This normalized gradient is used to segment the space time shape into multiple
levels where, at each level, the features correponding to the silhouette variations
in a frame are extracted. The statistics of L(x, y, t) such as the minimum value
and the standard deviation are computed and these statistics are used to define the

62
interval between adjacent levels. An illustration of 8-level segmentation of a space
time shape for different frames are shown in Figure 4.7a, Figure 4.7b, and Figure
4.7c. The segmentation is done on each frame using the values of the normalized
gradient and from each level, a particular set of features are extracted. In the next
section, the type of features extracted from the space time shape will be discussed.

4.3

EXTRACTION OF ACTION FEATURES

There are two sets of features which are extracted from the segmented space time
shape. One is the set of translation invariant R-Transform features extracted at each
level and the other is the R-Translation vectors which are extracted from the coarsest
level. The R-Transform, mentioned in Chapter 3, describes the 2D shape variations
present in a frame at a certain level. The set of R-Transforms taken across the
frames of the space time shape at each level gives the variation which corresponds to
a particular action. The R-Translation vector taken at the coarsest level emphasizes
the translatory variation of the entire silhouette across the frames of the space time
shape while reducing the emphasis on the 2D shape variation to a large extent.

4.3.1

R-Transform Feature Set

The R-Transform feature set is the set consisting of elements where each element is
given by
Z

∞

Rk,l (α) =

2
Tk,l
(s, α) ds

(4-7)

−∞

where Tk,l (s, α) is the 2D Radon transform of the frame k of the space time shape at
level l and α [0, π[ is the angle of inclination of the line on which the silhouette in the
frame is projected on. For a space time shape containing K frames and for L number
of levels, the R-Transform feature set is a 3D matrix of size L × M × K where M is
the number of angles on which the projection is taken. Typically, M is taken as 180.

63

(a) jumping jack

(b) walk

Figure 4.8: R-Transform Set for Single Level of a Space Time Shape.

64
The surface plot for the R-Transform feature set for a single level is shown in Figure
4.8a and Figure 4.8b. This gives the variations of the silhouette body shape across
the frames. These variations differ from action to action and are independent of the
person performing that particular action. The reason is that these R-Transforms are
scale invariant, as explained in Chapter 3, and thus the variation of the silhouette
shape with person is removed to a great extent. A person who has a large silhouette
can be considered as a normalized silhouette shape which is scaled up and a person
who has small silhouette can be considered as a normalized silhouette which is scaled
down. Scale invariance property of the R-Transform removes these scaling variations
in the silhouette and only captures the variation in the silhouette due to the change
of shape with respect to time. Thus, these variations correpond to only action
independent of the person who performed it. Moreover, the R-Transform is also
translation invariant which gives the fact that the silhouette shape variations which
correspond to an action are captured irrespective of the position of the silhouette in
the frames. In other words, the silhouette is not required to be centered in the frame.
When the silhouette is rotated, the R-Transform gets shifted by a certain amount.
But, human actions seldom contain variations caused due to the rotation of a human
body and, so, does not need to incorporate rotation invariance in the feature vector.
The human silhouettes either stay in place or move in a translatory manner when an
action is performed.

4.3.2

R-Translation Vector Set

The R-Transform feature set gives the variations of the silhouette shape across the
frames but removes the variation caused due to translation of the shape. Therefore,
to distinguish between actions which have large translatory motions such as walk and
run actions from those which have very little translatory motion such as single hand
wave action, another set of features should be extracted which gives the translatory

65
variation while minimizing the time variation of the shape. This type of feature is
known as the R-Translation vector. This feature vector extracted from a frame k of
the space time shape at the coarsest level, is given by
Z

π

RTk (s) =

2
Tk,1
(s, α) dα

(4-8)

−π

where Tk,l is the 2D Radon transform of the centered silhouette present at the frame
k. The R-translation vector is obtained by integrating the 2D Radon transform over
the variable α. Before the extraction of the R-Translation vector, the silhouette in
every frame of the space time shape is shifted with respect to the position of the
silhouette in the first frame. The position of the silhouette in the first frame is given
by its centroid and the distance from this centroid to the center of the frame is
calculated. Then, the silhouette in every frame is shifted by this distance so that the
centroid of the silhouette in the first frame coincides with the center of the frame. The
R-Translation vector is then extracted from the modified silhouettes and the variation
in this vector across the frames gives the translation of the silhouette. The set of
R-Translation vectors extracted from the space time shape is a matrix of size K × M
where K is the number of frames and M refers to twice the maximum distance of the
projection line from the centroid of the silhouette where this projection line is used
in the Radon transform computation. The set of R-Translation vectors extracted
is illustrated in Figure 4.9a and Figure 4.9b. At every frame k, the figure shows a
Gaussian-like function having a peak at s = M/2 and these Gaussian-like functions
do not vary much across the frames for the jumping jack action but for the walk
action, there is a considerable variation. This shows that the jumping jack action
has less translatory motion than the walk action. The small variations that occur in
the R-Translation vectors of the jumping jack action is due to the time variations in
the silhouette shape but unlike the R-Tranform feature set, the significance of those

66

(a) jumping jack

(b) walk

Figure 4.9: R-Transation Vector Set for a Space Time Shape.

67
type of variations are given less emphasis in the R-Translation vector.

4.4

CLASSIFICATION OF ACTION FEATURES

Classification of the action features is done using the nearest neighbor approach
where a data point in the classfier is the action feature set extracted from a space
time shape. The extracted action features are the concatenation of the R-Transform
feature set and the R-Translation vector set which is done by first converting the
multi-dimensional R-Transform feature Set and the R-Translation vector set into
a single dimensional vectors and then, appending one with the other to form the
complete single dimensional action feature vector. This action feature vector, used
as a test vector, is then compared with the action feature vectors of the space time
shapes of each training action class using the Euclidean distance metric. The class
of the action feature vector which has the minimum distance from the test vector
is noted and the space time shape corresponding to this test vector is then grouped
under this class.

4.5

SUMMARY

This Chapter explains the action recognition algorithm proposed in this thesis. The
algorithm used the concept of the space time shape and an extension of the idea of
multi-level representation using the modified R-Transform. The extension was to use
a 3D distance transform to represent a space time shape and then, use the normalized
gradient of this transform to segment it into different levels. Once segmented, the RTransform is extracted from every frame at every level to get a R-Transform feature
set. To get the translatory variance of the silhouette across the frames, another
feature known as the R-Translation vector was defined and this vector was extracted
from the coarsest level. These two features are concatenated to form the complete
feature vector which is then used in a nearest neighbor classifier.

68

CHAPTER 5
EXPERIMENTAL RESULTS
The algorithm proposed in this thesis is implemented using softwares such as MATLAB 7.4, OPENCV version 1.0, and C++ on a computer having a RAM of 3.0
Gigabytes and 2.00 GHz processor with Ubuntu 9.10 as the operating system. The
background segmentation using the median background model and the silhouette extraction by morphological operation of dilation and erosion are implemented in C++
using the OPENCV vision library. The silhouettes are then loaded into the MATLAB environment and extraction of features are implemented using the functions
available in MATLAB. Testing of the action features extracted and the computation
of the recognition accuracy are performed in MATLAB using the nearest neighbor
classifier which is also implemented in the same environment.
The algorithm is tested on the Weizmann dataset which consists of 90 lowresolution video sequences each having a single person performing a particular action. Each video sequence is taken at 50 fps with a frame size of 180 × 144. This
dataset contains 10 different actions classes with each action class containing 9 sample video sequences with each of them performed by a different person. The various
action classes are “bend”, “jump-in-place”, “jumping-jack”, “jump-forward”, “run”,
“gallop-sideways”, “wave-one-hand”,“skip”, “wave-two-hands” and “walk”. Space
time shapes are extracted from each of the video sequences by using a window along
the time axis where this window is of a pre-defined length. Each shift of the window
by a pre-determined amount gives rise to a space time shape. If the shift is less
than the length of the window, the space time shape will have an overlap with the
previously extracted space time shape, both belonging to the same video sequence.
The training data set thus consists of the space time shapes extracted from each of
the video sequences in the database. For evaluation of the algorithm, a variation of

69
the ”leave-one-out” procedure is used where to test a video sequence, the space time
shapes corresponding to that particular video sequence is taken as the testing data
and these same space time shapes are left out of the training data. Classification
is done by comparing the features extracted from the test space time shape with
those extracted from the training space time shapes and, then, using the nearest
neighbor rule to classify the same. The evaluation of the algorithm is conducted in
two phases. The first is that the window size and overlap are varied and the recognition rates are plotted for each combination. The second involves the comparison of
this algorithm with other methods or existing algorithms. The results are provided
in the form of confusion matrices and bar graphs where the confusion matrix gives
the recognition rates for each set of parameters and the bar graphs give the overall
accuracy acheived by the algorithm. The confusion matrix is a matrix which gives
the classification rates of each of the action classes obtained by the algorithm. Each
row of the matrix refers to the actual action class of the space time shapes and each
column refers to the action class under which the space time shapes are classified.
The quantity given in each entry gives the percentage of the space time shapes classified under various classes. The notation for each of the actions is given in Table
1.

5.1

VARIATION OF SPACE TIME SHAPE LENGTH AND OVERLAP

In this experiment, the lengths of the space time shape and the overlap are varied.
Six combinations of (Length,Overlap) are taken : (6,3), (8,4), (10,5), (12,6), (14,7)
and (16,8). The experiment shows that the “bend”,“jump-in-place” and “singlehand-wave” actions have very good accuracies of over 95% while the actions such
as “jumping-jack”, “two-hand-wave” and “walk” sequences have an accuracy of over
92%. Some actions such as “jump-forward”, “run” and “gallop-sideways” have a
fair amount of accuracy of over 88%. But one action “skip” has a poor accuracy of

70

TABLE 1: Notations Representing each Action.
Action Notation
a1 Bend
a2 Jump-In-Place
a3 Jumping-Jack
a4 Jump-Forward
a5 Run
a6 Gallop-Sideways
a7 Wave-One-Hand
a8 Skip
a9 Two-Hand-Wave
a10 Walk

around 60% for all various lengths of the space time shapes. Here, the “skip” action
is being confused with the “walk” action and the “run” action. This is because
of the similarities in the variation of the silhouette shape for the “skip” with the
“walk” and “run” actions and hence, the features extracted using the algorithm
is not good enough to discriminate them. But, the algorithm inspite of the lowaccuracy for the “skip” action, obtains a fairly overall classification accuracy of over
90%. The average accuracy obtained with different shape time shape lengths are
shown in Figure 5.1 inclusive and exclusive of the “skip” action. It is seen that the
overall accuracy obtained without the “skip” action is higher than the one obtained
with the “skip” action and that accuracy peaks around 95% Moreover, even with
the “skip” action, the overall accuracy still peaks around 92% which is still good.
Furthur analysis shows that the accuracy achieved is better for the space time shapes
having larger length. This is because more time variations are included in the longer
space time shapes and hence provides better discrimination between action features.
The optimal length of the space time shape is found to be around 12 or 14 where
the overall accuracy is around 95% without the “skip” action and 92% with it. The

71

Figure 5.1: Bar Graph Showing the Overall Accuracy Obtained with the Proposed
Algorithm for Different Lengths of the Space Time Shape. The Overlap is just Half
the Length of the Space Time Shape.

72
confusion matrices for the (12, 6) and (14, 7) are given in Table 2. The variation in the
accuracy of each type of action with the increase in length and overlap of the space
time shape is illustrated in Figure 5.2. It is seen that the “jump-in-place” action has
100% accuracy for all lengths of the space time shape while the “bend” action except
for (6,3) length-overlap, also has 100% percent accuracy. This is because the action
features extracted from these actions do not share any similarities with the action
features extracted from other actions. The “single-hand-wave” and the “two-handwave” actions have a good accuracy of above 92% for all lengths. The “jump-forward”
action has fairly a consistent accuracy of 88% for all lengths and this action often
gets confused with ”skip” action due to feature similarity. The “jumping jack” shows
a good consistent accuracy of around 92% and this action is sometimes confused with
the ”two-hand-wave” action. Actions such as the “walk” and “run” action show a
consistent accuracy of about 95% and 85% respectively and as mentioned before,
these sometimes get confused with the “skip” action. The “gallop-sideways” action
accuracy is not very consistent as shown and it oscillates wildly with the lengths of
the space time shape. This is because this action gets classified under the “skip”
action again due to the features being similar. The action which shows a consistent
but poor accuracy with the proposed algorithm is the “skip” action as very often it
gets classifed under the “walk”,“gallop-sideways” or “run” actions. The error rate
of the actions with the increase in length furthur explains the points mentioned and
this is illustrated in Figure 5.3.

5.2

COMPARISON OF PROPOSED ALGORITHM WITH OTHER
METHODS

In this section, the comparison of the proposed algorithm is made with the other
methods where each method uses a different shape descriptor. In fact, the comparisons are actually between the shape descriptors used for action recognition. The

73

TABLE 2: Confusion Matrix Obtained with the Proposed Algorithm.
(a) (Length,Overlap) = (12,6)

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a1 100.0
0
0
0
0
0
0
0
0
0
a2
0 100.0
0
0
0
0
0
0
0
0
a3
0
0 96.80
0
0
0
0
0 3.22
0
a4
0
0
0 88.10
0
0
0 11.9
0
0
a5
0
0
0
0 86.00
0
0
10
0
4
a6
0
0
0
0
0 96.40
0
0
0 3.63
a7
0 2.27
0
0
0
0 97.70
0
0
0
a8
0
0
0 3.45 15.5
0
0 65.50
0 15.5
a9
0 4.76
1.2
0
0
0
0
0 94.10
0
a10
0
0
2.2
1.1
1.1
0
0
2.2
0 93.40
(b) (Length,Overlap) = (14,7)

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
100.0
0
0
0
0
0
0
0
0
0
0 100.0
0
0
0
0
0
0
0
0
0
0 94.90
0
0
0
0
0 5.13
0
0
0
0 89.60
0
0
0 10.4
0
0
0
0
0
0 85.40
0
0 9.76
0
4.9
0
0
0
0
0 91.10
0
0
0 8.89
0
0
0
0
0
0 98.60
0
1.4
0
0
0
0 4.17 18.75
0
0 62.50
0 14.6
0
0
0
0
0
0
0
0 100.0
0
0
0 3.89
0 1.29
0
0
0
0 94.80

74

(a) with skip

(b) without skip

Figure 5.2: Accuracy for Different Combinations of (length,overlap) of Space Time
Shape.

75

(a) with skip

(b) without skip

Figure 5.3: Error Rate for Different Combinations of (length,overlap) of Space Time
Shape.

76
recognition accuracies and their error rates of the algorithm have been compared
with methods where the R-Transform and the Zernike moments are being used as
shape descriptors. This descriptor describes a 2D silhouette shape and the variation
of this descriptor across the space time shape is taken as the action features. The
comparison of accuracies is shown as a bar graph in Figure 5.4. As seen, the lowest
accuracy obtained is by using the R-Transform directly on the 2D silhouette and
considering its variation as action features. This is because the R-Transform only
describes the variation of the silhouette along the x − y plane. No time variation is
being captured by the R-Transform and the variation in the 2D silhouette description
across the space time shape emphasizes the time variation to a minimal extent. This
time variation does not represent the action performed by the person completely
and this can be seen in the low accuracy obtained. The same case applied to the
2D Zernike moments where again the time variation captured does not completely
pertain to the action although it does acheive better accuracy than the R-Transform.
The better accuracy is because Zernike moment shape descriptors are region based
descriptors while the R-Transform is a boundary-based descriptor and so, the time
variation in the Zernike moment descriptor across the frames of the space time shape
captures the action with more emphasis than the time variation in the R-Transform
shape descriptor. In the algorithm, one major difference from the above mentioned
methods is that the time variation is captured before applying the shape descriptor
and this is done by taking the gradient of the distance transform across the x,y, and
t axes and then, using that to segment the space time shape into different levels.
This captures the action much more efficiently than the simple R-Transform method
and the Zernike moment method and hence, achieves much better accuracy.
The individual recognition rates achieved with the Zernike moments and the RTransform are shown in Table 3. As seen in the table, the individual recognition
rates acheived with the proposed algorithm is far better than the ones acheived with

77

(a) With Skip Action

(b) Without Skip Action

Figure 5.4: Accuracy for Different Combinations of (length,overlap) of Space Time
Shape for Different Shape Descriptors.

78
TABLE 3: Confusion Matrix Obtained with Zernike Moments,R-Transform and Poisson’s Equation Shape Descriptor.
(a) Zernike Moments

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a1 98.10
0
0
0
0
0 1.85
0
0
0
a2
0 83.10
0
0
0
0 16.9
0
0
0
a3 1.07
0 91.40
0
0
0
0 2.15 1.08
4.3
a4
0
0
0 88.10
0
0
0 11.9
0
0
a5
0
4
4
2 50.00
4
2
10
0
24
a6
0 9.09
0
0
0 70.90 16.4 1.81
0 1.81
a7
0 7.96
0
0
0
0 50.00
0 42.1
0
a8
0
0 10.3 17.2 12.1
0 22.40 18.9 5.17 13.8
a9
0
0
0
0
0
0 26.2
0 73.80
0
a10
0
0
0
0 10.9 3.29 2.19 9.89 2.19 71.40
(b) R-Transform

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
87.00
0
0 5.56 1.85
0
0 5.56
50
0
0 66.20 9.85
0
0 11.3 2.81 4.23 1.41 4.23
0 9.68 71.00
0
0 2.15
0
0
14 3.23
10.2 1.69
0 57.60 1.69
5.1 3.39 11.9
0
8.5
4
4
0
2
32
12 22.00
20
0
4
0 12.7 5.45 3.63 5.45 47.30 3.63
0
0 21.8
0 2.27
0
0 7.95 2.27 86.40 1.13
0
0
8.62 12.1 1.72 6.89 32.8 5.17 10.4 13.80 1.72 6.89
0 4.76 13.1
0
0 1.19
0
0 78.60 2.38
0
6.6
7.7
7.7 2.2 17.6
1.1
3.3
5.5 48.40
(c) Poisson Shape Descriptor

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
99.10
0
0
0
0
0
0
0
0.9
0 100.0
0
0
0
0
0
0
0
0
0
0 100.0
0
0
0
0
0
0
0
0
0
0 89.20
0
0
0 10.8
0
0
0
0
0
0 98.00
0
0
0.2
0
0
0
0
0
0
0 100.0
0
0
0
0
0
0.9
0.9
0
0
0 94.80
0
3.5
0
0
0
0
0
2.9
0
0 97.10
0
0
0
0
0.9
0
0
0
1.9
0 97.20
0
0
0
0
0
0
0
0
0
0 100.0

79
the Zernike moments and R-Transform especially in the “skip” action where the
recognition rate achieved with Zernike moments and R-Transform is just under 20%
while that achieved with the current algorithm is above 60%. Another comparison
can be made with the current state of the art action recognition framework which
uses a Poisson-shape descriptor to capture motion variations. The confusion matrix
for this framework is provided in Table 3c. It can be seen that the proposed algorithm achieves individual action accuracies close to the state of the art with some
action accuracies exceeding with the proposed one. The only action which has lower
accuracy when compared to the state of the art is that of the “skip” action.

5.3

SUMMARY

This Chapter evaluates the algorithm in two ways. The first way is to vary the
length and overlap of the space time shape and plot their average accuracies as
a bar graph and plot the individual action accuracies and error rates. The plots
are analyzed and it is found that the best combinations of (length, overlap) are
(12, 6) and (14, 7). The second way of evaluation is by comparing the algorithm with
other methods such as using moment-based shape descriptors and just the Radon
transform based shape descriptor as shape representation. The proposed algorithm
shows comparitively higher accuracy. The comparison is also done with another
action recognition framework which used Poisson’s equation based shape descriptor
for shape representation. The accuracies obtained with the proposed algorithm are
comparable with some actions acheiveing better accuracy while one particular action
has a lower of accuracy.

80

CHAPTER 6
CONCLUSIONS AND FUTURE WORK
The thesis had provided an algorithm which have used the concept of multi-level
representation of a 3D shape for action classification. An action has been considered
as a space time shape or a 3D shape and a multi-level representation using the gradient of the 3D distance transform and the Radon transform have been used from
where the action features have been extracted. Silhouettes from a video sequence
containing a particular action have been concatenated to form the space time shapes
representing that action. Action features were extracted from each level of the representation and these features concatenated as a single feature was used in a nearest
neighbor classifier for recognition.
The evaluation of the algorithm was performed by comparison of the recognition
accuracies with those which used shape descriptors such as Zernike moments and RTransform to represent the space time shape. The results obtained showed higher
accuracy rates for the proposed algorithm. Furthur comparison has been done with
the action recognition framework which used a shape descriptor based on Poisson’s
equation for space time shape representation. The results obtained showed comparable accuracy rates with majority of the actions and some of them being greater
with the proposed algorithm. Another evaluation was performed by computation
of the accuracies and error rates obtained with different lengths and overlap of the
space time shape. It was found that the most optimal space time shape parameters
(length, overlap) were (12, 6) and (14, 7). The database used for these evaluations
was the Weizmann Action database which contains 90 video sequences with 10 action
classes.
Although the average accuracies are high, the accuracy for one particular action
obtained by the proposed algorithm is low as the features extracted from the space

81
time shape corresponding to this action cannot be discriminated from those of other
similar actions. Future work will involve extraction of more localized features so that
the average accuracy as well as the accuracy of each action is high. Also, the use of a
more sophisticated classification technique is also considered for action recognition.

82

BIBLIOGRAPHY
[1] C.J.Cohen, K.A.Scott, M.J.Huber, S.C.Rowe and F.Morelli, “Behavior recognition architecture for surveillance applications,” International Workshop on
Applied Imagery and Pattern Recognition - AIPR 2008, pp. 1-8, 15-17 October,
2008.
[2] M.Mitani, M.Takaya, A.Kojima and K.Fukunaga, “Environment recognition
based on analysis of human actions for mobile robot,” The 18th International
Conference on Pattern Recognition - ICPR 2006 , vol. 4, pp. 782-786, 2006.
[3] J.W.Davis and A.F.Bobick, “The representation and recognition of human
movement using temporal templates,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 928-934, 17-19
June 1997.
[4] G.R.Bradski and J.W.Davis, “Motion segmentation and pose recognition with
motion history gradients,” Fifth IEEE Workshop on Applications of Computer
Vision, pp. 238-244, 4-6 December 2000.
[5] A.F.Bobick and J.W.Davis, “The recognition of human movement using temporal templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, March 2001.
[6] J.W.Davis, “Hierarchical motion history images for recognizing human motion,”
Proceedings of the IEEE Workshop on Detection and Recognition of Events in
Video, pp. 39-46, 8 July 2001.
[7] A.A.Efros, A.C.Berg, G.Mori and J.Malik, “Recognizing action at a distance,”Proceedings of Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 726-733, 13-16 October 2003.
[8] J.C.Niebles, H.Wang and L.Fei-Fei, “Unsupervised Learning of Human Action
Categories Using Spatial-Temporal Words,”Proceedings of the British Machine
Vision Conference - BMVC 2006, 4-7 September 2006.
[9] P.Scovanner, S.Ali and M.Shah, “A 3-Dimensional SIFT descriptor and its application to action recognition”,Proceedings of the 15th ACM International Conference on Multimedia, 25-29 September 2007.

83
[10] S.Ali, A.Basharat and M.Shah, “Chaotic invariants for human action recognition,” IEEE 11th International Conference on Computer Vision - ICCV 2007,
pp. 1-8, 14-21 October 2007.
[11] J.Zhang, K.F.Man and J.Y.Ke, “Time series prediction using Lyapunov exponents in embedding phase space,” International Conference on Systems, Man
and Cybernetics, vol. 2, pp. 1750-1755, 11-14 October 1998.
[12] M.Ahmad and S.W.Lee, “HMM-based human action recognition using multiview image sequences,” The 18th International Conference on Pattern Recognition - ICPR 2006, vol. 1, pp. 263-266, September 2006.
[13] D.Batra, T.Chen and R.Sukthankar, “Space-time shapelets for action recognition,”IEEE Workshop on Motion and Video Computing - WMVC 2008, pp. 1-6,
8-9 January 2008.
[14] M.K.Hu, “Visual pattern recognition by moment invariants,” IRE Transactions
on Information Theory, vol. 8, no. 2, pp. 179-187, February 1962.
[15] D.Zhang and G.Lu, “A comparative study on shape retrieval using fourier descriptors with different shape signatures ”,Journal of Visual Communication and
Image Representation, no. 14, pp. 41-60, 2003.
[16] Q.Chen, E.Petriu and X.Yang; , “A comparative study of fourier descriptors and
Hu’s seven moment invariants for image recognition,” Canadian Conference on
Electrical and Computer Engineering, vol. 1, pp. 103- 106, 2-5 May 2004.
[17] A.Khotanzad and Y.H.Hong, “Invariant image recognition by Zernike moments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
12, no. 5, pp. 489-497, May 1990.
[18] S.Tabbone, L.Wendling and J.P.Salmon, “A new shape descriptor defined on
the Radon transform”, Computer Vision and Image Understanding , vol. 102,
pp. 42-51, April 2006.
[19] L.Gorelick, M.Galun, E.Sharon, R.Basri and A.Brandt, “Shape representation
and classification using the Poisson equation,” Proceedings of the 2004 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2004, vol. 2, pp. 61-67, June 2004.

84
[20] M.Blank, L.Gorelick, E.Shechtman, M.Irani and R.Basri, “Actions as space-time
shapes,” Tenth IEEE International Conference on Computer Vision - ICCV
2005, vol. 2, pp. 1395-1402, 17-21 October 2005.
[21] U.Trottenberg, C.W.Oosterlee and A.Schuller, Multigrid, Academic Press, 2001.
[22] Y.Wang, K.Huang and T.Tan, “Human activity recognition based on R Transform,”IEEE Conference on Computer Vision and Pattern Recognition - CVPR
2007, pp. 1-8, 17-22 June 2007.
[23] A.Mohiuddin and S.W.Lee, “Human action recognition using shape and CLGmotion flow from multi-view image sequences”,Pattern Recognition, vol. 41, pp.
2237-2252, July 2008.
[24] X.Sun, M.Chen and A.Hauptmann, “Action recognition via local descriptors
and holistic features,” IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops, pp. 58-65, 20-25 June 2009.
[25] R.Fabbri, O.M.Bruno, J.C.Torelli and L.F.Costa, “2D Euclidean Distance
Transform Algorithms: A Comparative Survey”, ACM Computing Surveys,
February 2008.
[26] C.R.Maurer, R.Qi, V.Raghavan, “A linear time algorithm for computing exact
Euclidean distance transforms of binary images in arbitrary dimensions,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp.
265- 270, February 2003.
[27] C.M.Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[28] G.Bradski and A.Kaehler, Learning OpenCV, O’Reilly Media Inc., 2008.
[29] G.Borgefors, “Distance transformations in arbitrary dimensions”,Computer Vision, Graphics, and Image Processing, vol. 27, pp. 321-345, September 1984.
[30] A.Rosenfeld and J.Pfaltz, “Sequential operations in digital picture processing,”
Journal of the Association for Computing Machinery, vol. 13, pp. 471-494, 1966.
[31] D.W.Paglieroni, “Distance transforms: Properties and machine vision applications”, Graphical Models and Image Processing - CVGIP 1992, vol. 54, pp.
56-74, January 1992.

85
[32] R.N.Bracewell, Two-Dimensional Imaging , Englewood Cliffs, NJ, Prentice Hall,
1995, pp. 505-537.

86

VITA
Personal Information:
Binu M Nair
Department of Electrical and Computer Engineering
Old Dominion University
Norfolk, VA 23529
Phone No : 757-288-3522
Email : [email protected]
Education:
Master of Science in Electrical and Computer Engineering
G.P.A 3.81
August 2010
Old Dominion University, Norfolk, VA
Bachelor of Technology in Electronics and Communication
Percentage 70 %
April 2007
Cochin University of Science and Technology, Cochin, India
Work Experience:
Research Assistant
Old Dominion University, Norfolk, VA
Fall 2008 - Spring 2010
Programmer Analyst
Mindtree Consulting Ltd., Bangalore, India
July 2007-April 2008

Masters Thesis

Comments

Content

Sponsor Documents

Recommended