this modified mean shift tracking can able to address scale and orientation problem of the object which was not the case of conventional mean shift technique.
Comments
Content
Designing and implementation of a highly efficient Object
Tracking system using Modified Mean Shift Tracking
Abstract
Object tracking has been widely applied to video retrieval, robotics control, traffic surveillance
and homing technologies. A lot of object tracking algorithms have been reported in literatures,
but the area is still lacking with an efficient algorithm which can, not only track the objects but at
the same time able to recognize the orientation and movement of object. In this project work an
efficient object tracking system is proposed based on Modified mean shift tracking (MMST)
algorithm. This project work basically deals with how to address the problem to estimate the
scale and orientation changes of the target under the mean shift tracking framework. In the
original mean shift tracking algorithm, the position of the target can be well estimated, while the
scale and orientation changes cannot be adaptively estimated. Considering that the weight image
derived from the target model and the candidate model can represent the possibility that a pixel
belongs to the target, in this project work it will show that the original mean shift tracking
algorithm can be derived using the zeroth and the first order moments of the weight image. With
the zero moment and the Bhattacharyya coefficient between the target model and candidate
model, a simple and effective method is proposed to estimate the scale of target. Then an
approach, which utilizes the estimated area and the second order center moment, is proposed to
adaptively estimate the width, height and orientation changes of the target.
CHEPTER 1
1. Introduction
Real-time object tracking is a critical task in computer vision, and many algorithms have been
proposed to overcome the difficulties arising from noise, occlusions, clutters, and changes in the
foreground object and/or background environment [14]. Among various tracking methods, the
mean shift tracking algorithm is a popular one due to its simplicity and efficiency. The mean shift
algorithm was originally developed by Fukunaga and Hostetler [2] for data analysis, and later
Cheng [3] introduced it to the field of computer vision. Bradski [6] modified it and developed the
Continuously Adaptive Mean Shift (CAMSHIFT) algorithm for face tracking. Comaniciu and
Meer successfully applied mean shift algorithm to image segmentation [8] and object tracking [7,
9]. Some optimal properties of mean shift were discussed in [13, 15].
In the classical mean shift tracking algorithm [9], the estimation of scale and orientation
changes of the target is not solved. Although it is not robust, the CAMSHIFT algorithm [6], as
the earliest mean shift based tracking scheme, could actually deal with various types of
movements of the object. In CAMSHIFT, the moment of the weight image determined by the
target model was used to estimate the scale (also called area) and orientation of the object being
tracked. Based on Comaniciu et al’s work in [9], many tracking schemes [10, 11, 17, 18, and 23]
were proposed to solve the problem of target scale and/or orientation estimation. Collins [10]
adopted Lindeberg et al’s scale space theory [19, 20] for kernel scale selection in mean-shift
based blob tracking. However, it cannot handle the rotation changes of the target. An EM-shift
algorithm was proposed by Zivkovic and Krose in [11], which simultaneously estimates the
position of the local mode and the covariance matrix that can approximately describe the shape
of the local mode. In [23], a distance transform based asymmetric kernel is used to fit the object
shape through a scale adaptation followed by a segmentation process. Hu et al [17] developed a
scheme to estimate the scale and orientation changes of the object by using spatial-color features
and a novel similarity measure function [12, 16].
In this project work, a Modified mean shift tracking (MMST) algorithm is proposed
under the mean shift framework. Unlike CAMSHIFT, this uses the weight image determined by
the target model; the proposed MMST algorithm employs the weight image derived from the
target model and the target candidate model in the target candidate region to estimate the target
scale and orientation. Such a weight image can be regarded as the density distribution function of
the object in the target candidate region, and the weight value of each pixel represents the
possibility that it belongs to the target. Using this density distribution function, we can compute
the moment features and then estimate effectively the width, height and orientation of the object
based on the zeroth order moment, the second order center moment and the Bhattacharyya
coefficient between target model and target candidate model.
CHEPTER 2
2. Literature Review
Among various tracking methods, the mean shift tracking algorithm is a popular one due to its
simplicity and efficiency.
The mean shift algorithm was originally developed by Fukunaga and Hostetler [2] for
data analysis. In their paper a Nonparametric density gradient estimation using a generalized
kernel approach is investigated. Conditions on the kernel functions are derived to guarantee
asymptotic unbiasedness, consistency, and uniform consistency of the estimates. The results are
generalized to obtain a simple mean-shift estimate that can be extended in a nearest-neighbor
approach. Applications of gradient estimation to pattern recognition are presented using
clustering and intrinsic dimensionality problems, with the ultimate goal of providing further
understanding of these problems in terms of density gradients.
Cheng [3] introduced Mean shift algorithm to the field of computer vision. In his paper
he has briefly about Mean shift, Mean Shift is a simple interactive procedure that shifts each data
point to the average of data points in its neighborhood is generalized and analyzed in the paper.
This generalization makes some k-means like clustering algorithms its special cases. It is shown
that mean shift is a mode-seeking process on the surface constructed with a “shadow” kernal. For
Gaussian kernels, mean shift is a gradient mapping. Convergence is studied for mean shift
iterations. Cluster analysis if treated as a deterministic problem of finding a fixed point of mean
shift that characterizes the data. Applications in clustering and Hough transform were
demonstrated. Mean shift is also considered as an evolutionary strategy that performs multistate
global optimization.
Bradski [6] modified Mean Shift Algorithm developed by Cheng [3], and developed the
Continuously Adaptive Mean Shift (CAMSHIFT) algorithm for face tracking. As a first step
towards a perceptual user interface, a computer vision color tracking algorithm was developed
and applied towards tracking human faces. Computer vision algorithms that are intended to form
part of a perceptual user interface must be fast and efficient. They must be able to track in real
time yet not absorb a major share of computational resources: other tasks must be able to run
while the visual interface is being used. The new algorithm developed here was based on a robust
non-parametric technique for climbing density gradients to find the mode (peak) of probability
distributions called the mean shift algorithm. In his case, they want to find the mode of a color
distribution within a video scene. Therefore, the mean shift algorithm was modified to deal with
dynamically changing color probability distributions derived from video frame sequences. The
modified algorithm was called the Continuously Adaptive Mean Shift (CAMSHIFT) algorithm.
CAMSHIFT’s tracking accuracy was compared against a Polhemus tracker. Tolerance to noise,
distracters and performance was studied. CAMSHIFT was then used as a computer interface for
controlling commercial computer games and for exploring immersive 3D graphic worlds.
Comaniciu and Meer successfully applied mean shift algorithm to image segmentation
[7] and object tracking. They have developed a new method for real time tracking of non-rigid
objects seen from a moving camera was proposed. The central computational module is based on
the mean shift iterations and finds the most probable target position in the current frame. The
dissimilarity between the target model (its color distribution) and the target candidates were
expressed by a metric derived from the Bhattacharyya coefficient. The theoretical analysis of the
approach shown, that it relates to the Bayesian framework while providing a practical, fast and
efficient solution. The capability of the tracker to handle in real time partial occlusions,
significant clutter, and target scale variations was demonstrated for several image sequences.
Comaniciu and Meer [8] then modified their approach and developed a general nonparametric technique for the analysis of a complex multimodal feature space and to delineate
arbitrarily shaped clusters. The basic computational module of the technique was an old pattern
recognition procedure: the mean shift. For discrete data, they proved the convergence of a
recursive mean shift procedure to the nearest stationary point of the underlying density function
and, thus, it’s utility in detecting the modes of the density. The relation of the mean shift
procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators;
of location was also established. Algorithms for two low-level vision tasks discontinuitypreserving smoothing and image segmentation were described as applications. In those
algorithms, the only user-set parameter was the resolution of the analysis, and either gray-level
or color images are accepted as input. Extensive experimental results illustrated their excellent
performance.
Comaniciu et. al.[9], developed Vision based tracking, it was a challenging engineering
problem is one of the hot research areas in machine vision. At that time Kernel based tracking
using Bhattacharya similarity measure was shown to be an efficient technique for non-rigid
object tracking through the sequence of images. In their paper they presented a robust and
efficient tracking approach for targets having larger motions as compared to their sizes. Their
tracking approach was based on calculating the Gaussian pyramids of the images and then
applying mean shift algorithm at each pyramid level for tracking the target. Model based tracking
often suffers abrupt changes in target model, which is compensated by the model updates of
target. This leads to a very efficient arid robust nonparametric tracking algorithm the new
method was easily able to track the fast moving targets and is more robust and environment
independent as compared to original kernel based object tracking.
Collins R [10], the mean-shift algorithm is an efficient technique for tracking 2D blobs
through an image. Although the scale of the mean-shift kernel was a crucial parameter, there was
presently no clean mechanism for choosing or updating scale while tracking blobs that are
changing in size. He adapted Lindeberg's (1998) theory of feature scale selection based on local
maxima of differential scale-space filters to the problem of selecting kernel scale for mean-shift
blob tracking. He had shown that a difference of Gaussian (DOG) mean-shift kernel enables
efficient tracking of blobs through scale space. Using this kernel requires generalizing the meanshift algorithm to handle images that contain negative sample weights.
Zivkovic Z. and Krose B [11], the iterative procedure called 'mean-shift' is a simple robust
method for finding the position of a local mode (local maximum) of a kernel-based estimate of a
density function. A new robust algorithm was developed that presented a natural extension of the
'mean-shift' procedure. The new algorithm simultaneously estimates the position of the local
mode and the covariance matrix that describes the approximate shape of the local mode. They
applied the new method to develop new 5-degrees of freedom (DOF) color histogram based nonrigid object tracking algorithm.
Yang C. et. al.[12], The mean shift algorithm has achieved considerable success in object
tracking due to its simplicity and robustness. It finds local minima of a similarity measure
between the color histograms or kernel density estimates of the model and target image. The
most typically used similarity measures are the Bhattacharyya coefficient or the Kullback-Leibler
divergence. In practice, these approaches face three difficulties. First, the spatial information of
the target is lost when the color histogram is employed, which precludes the application of more
elaborate motion models. Second, the classical similarity measures are not very discriminative.
Third, the sample-based classical similarity measures require a calculation that is quadratic in the
number of samples, making real-time performance difficult. To deal with these difficulties they
proposed a simple method to compute and more discriminative similarity measure in spatialfeature spaces. The new similarity measure allows the mean shift algorithm to track more general
motion models in an integrated way. To reduce the complexity of the computation to linear order
they employed the improved fast Gauss transform. That leads to a very efficient and robust
nonparametric spatial-feature tracking algorithm. The algorithm was tested on several image
sequences and shown to achieved robust and reliable frame-rate tracking.
Carreira [15], the mean-shift algorithm, based on ideas proposed by Fukunaga and
Hosteller, is a hill-climbing algorithm on the density defined by a finite mixture or a kernel
density estimate. Mean-shift can be used as a nonparametric clustering method and has attracted
recent attention in computer vision applications such as image segmentation or tracking. He has
showed that, when the kernel is Gaussian, mean-shift is an expectation-maximization (EM)
algorithm and, when the kernel is non-Gaussian, mean-shift is a generalized EM algorithm. This
implies that mean-shift converges from almost any starting point and that, in general, its
convergence is of linear order. For Gaussian mean-shift, He has showed: 1) the rate of linear
convergence approaches 0 (super linear convergence) for very narrow or very wide kernels, but
is often close to 1 (thus, extremely slow) for intermediate widths and exactly 1 (sub linear
convergence) for widths at which modes merge, 2) the iterates approach the mode along the local
principal component of the data points from the inside of the convex hull of the data points, and
3) the convergence domains are no convex and can be disconnected and show fractal behavior.
He has suggested ways of accelerating mean-shift based on the EM interpretation.
Hu J. et. al. [17], in their paper developed, an enhanced mean-shift tracking algorithm
using joint spatial-color feature and a novel similarity measure function. The target image was
modeled with the kernel density estimation and new similarity measure functions are developed
using the expectation of the estimated kernel density. With these new similarity measure
functions, two similarity-based mean-shift tracking algorithms were derived. To enhance the
robustness, the weighted-background information was added into the proposed tracking
algorithm. Further, to cope with the object deformation problem, the principal components of the
variance matrix were computed to update the orientation of the tracking object, and
corresponding Eigen values were used to monitor the scale of the object. In the experimental
results they had shown that the new similarity-based tracking algorithms can be implemented in
real-time and were able to track the moving object with an automatic update of the orientation
and scale changes.
Quast K., Kaup A [23], they had developed a new technique for object tracking based on
the mean shift method. Instead of using a symmetric kernel like in traditional mean shift
tracking, the developed tracking algorithm uses an asymmetric kernel which is retrieved from an
object mask. During the mean shift iterations not only the new object position was located but
also the kernel scale is altered according to the object scale, providing an initial adaption of the
object shape. The final shape of the kernel was then obtained by segmenting the area inside and
around the adapted kernel and distinguishing the object segments from the non-object segments.
Thus, the object shape was tracked very well even if the object is performing out of plane
rotations.
CHEPTER 3
3. Problem Identification And Objective of Project Work
Significant research effort has focused on video-based motion tracking [1] [2] [3] [4] and attract
the interest of industry. Performance evaluation of motion tracking is important not only for the
comparison and further development of algorithms from researchers, but also for the
commercialization and standardization of the technology. Already lots of work has been done for
development object tracking systems, out of them a most important one is mean shift tracking.
Although the mean shift tracking algorithm is able to provide good results but it cannot able to
deal with the scale and orientation of targets. So the most important problem identified for this
project is to modify mean shift tracking algorithm to resolve scale and orientation problems of
target tracking system.
The objective of this project is to design and implementation of a highly efficient object
tracking system in MATLAB using modified mean shift tracking which can efficiently handle
scale and orientation problem during tracking of object.
CHEPTER 4
4. What is Object Tracking?
4.1 INTRODUCTION
Capturing video is becoming increasingly easy. Machines that see and understand their
environment already exist, and their development is accelerated by advances both in microelectronics and in video analysis algorithms. Now, many opportunities have opened for the
development of richer applications in various areas such as video surveillance, content creation,
personal communications, robotics and natural human–machine interaction.
One fundamental feature essential for machines to see, understand and react to the
environment is their capability to detect and track objects of interest. The process of estimating
over time the location of one or more objects using a camera is referred to as object tracking. The
rapid improvement both in quality and resolution of imaging sensors, and the dramatic increase
in computational power in the past decade have favored the creation of new algorithms and
applications using object tracking.
Figure 4.1 Examples of targets for object tracking: (left) people, (right) faces.
The dentition of object of interest depends on the septic application at hand. For example, in a
building surveillance application, targets may be people (Figure 4.1 (left)), whereas in an
interactive gaming application, targets may be the hands or the face of a person (Figure 4.1
(right)).
This chapter covers the fundamental steps for the design of a tracker and provides the
mathematical formulation for the object tracking problem.
4.2 THE DESIGN OF A VIDEO TRACKER
Video cameras capture information about objects of interest in the form of sets of image pixels.
By modeling the relationship between the appearance of the target and its corresponding pixel
values, a video tracker estimates the location of the object over time. The relationship between an
object and its image projection is very complex and may depend on more factors than just the
position of the object itself, thus making object tracking a difficult task. In this section, we first
discuss the main challenges in object tracking and then we review the main components into
which a video-tracking algorithm can be decomposed.
4.2.1 Challenges
The main challenges that have to be taken into account when designing and operating a tracker
are related to the similarity of appearance between the target and other objects in the scene, and
to appearance variations of the target itself.
Figure 4.2 Examples of clutter in object tracking. Objects in the background (red boxes) may
share similar color (left) or shape (right) properties with the target and therefore distract the
tracker from the desired object of interest (green boxes). Left: image from the Birchfield head
tracking dataset. Right: Surveillance scenario from PETS-2001 dataset.
The appearance of other objects and of the background may be similar to the appearance of the
target and therefore may interfere with its observation. In such a case, image features extracted
from non-target image areas may be difficult to discriminate from the features that we expect the
target to generate. This phenomenon is known as clutter. Figure 4.2 shows an example of color
ambiguity that can distract a tracker from the real target. This challenge can be dealt with by
using multiple features weighted by their reliability.
In addition to the tracking challenge due to clutter, object trackingis made difficult by changes of
the target appearance in the image plane that are due to one or more of the following factors:
Changes in pose. A moving target varies its appearance when projected onto the image
plane, for example when rotating (Figure 4.3(a)–(b)).
Ambient illumination. The direction, intensity and color of the ambient light influence the
appearance of the target. Moreover, changes in global illumination are often a challenge
in outdoor scenes. For example, ambient light changes when clouds obscure the sun.
Also, the angles between the light direction and the normal to the object surface vary with
the object pose, thus affecting how we see the object through the camera lens.
Noise. The image acquisition process introduces into the image signal a certain degree of
noise, which depends on the quality of the sensor. Observations of the target may be
corrupted and therefore affect the performance of the tracker.
(c)
(d)
Figure 4.3 Examples of target appearance changes that make object trackingdifficult. (a)– (b) A
target (the head) changes its pose and therefore its appearance as seen by the camera. Bottom
row: Two examples of target occlusions. (c) The view of the target is occluded by static objects
in the scene. (d) The view of the target is occluded by another moving object in the scene;
reproduced with permission of HOSDB.
Occlusions. A target may fail to be observed when partially or totally occluded by other
objects in the scene. Occlusions are usually due to:
i.
a target moving behind a static object, such as a column, a wall, or a desk (Figure
ii.
4.3(c)), or
Other moving objects obscuring the view of a target (Figure 4.3(d)).
To address this challenge, different approaches can be applied that depend on the expected level
of occlusion:
i.
Partial occlusions that affect only a small portion of the target area can be dealt with
by the target appearance model or by the target detection algorithm itself. The
invariance properties of some global feature representation methods (e.g. the
histogram) are appropriate to deal with occlusions. Also, the replacement of a global
representation with multiple localized features that encode information for a small
ii.
region of the target may increase the robustness of a video tracker.
Information on the target appearance is not sufficient to cope with total occlusions. In
this challenging scenario track continuity can be achieved via higher-level reasoning
or through multi-hypothesis methods that keep propagating the tracking hypotheses
over time. Information about typical motion behaviours and pre-existing occlusion
patterns can also be used to propagate the target trajectory in the absence of valid
measurements. When the target reappears from the occlusion, the propagation of
multiple tracking hypotheses and appearance modeling can provide the necessary
cues to reinitialize a track.
A summary of the main challenges in object tracking is presented in Figure 4.4.
Figure 4.4 the main challenges in object tracking are due to temporal variations of the target
appearance and to appearance similarity with other objects in the scene.
4.2.2 Main components for object tracking
In order to address the challenges discussed in the previous section, we identify five main logical
components of a video tracker (Figure 4.5):
1.
The definition of a method to extract relevant information from an image area occupied
by a target. This method can be based on motion classification, change detection; object
classification or simply on extracting low-level features such as color or gradient, or mid-
2.
level features such as edges or interest points.
The definition of a representation for encoding the appearance and the shape of a target
(the state). This representation defines the characteristics of the target to be used by the
tracker. In general, the representation is a trade off between accuracy of the description
(descriptiveness) and invariance: it should be descriptive enough to cope with clutter and
to discriminate false targets, while allowing a certain degree of flexibility to cope with
3.
changes of target scale, pose, illumination and partial occlusions.
The definition of a method to propagate the state of the target over time. This step
recursively uses information from the feature extraction step or from the already available
state estimates to form the trajectory. This task links different instances of the same
object over time and has to compensate for occlusions, clutter, and local and global
illumination changes.
Figure 4.5 the video-tracking pipeline. The flowchart shows the main logical components of a
tracking algorithm.
4. The definition of a strategy to manage targets appearing and disappearing from the
imaged scene. This step, also referred to as track management, initializes the track for an
incoming object of interest and terminates the trajectory associated with a disappeared
target. When a new target appears in the scene (target birth), the tracker must initialize a
new trajectory. A target birth usually happens:
at the image boundaries (at the edge of the field of view of the camera),
at specific entry areas (e.g. doors),
in the far-field of the camera (when the size of the projection onto the image plane
increases and the target becomes visible), or
When a target spawns from another target (e.g. a driver parking a car and then
stepping out).
Similarly, a trajectory must be terminated (target death) when the target:
leaves the field of view of the camera, or
Disappears at a distance or inside another object (e.g. a building).
In addition to the above, it is desirable to terminate a trajectory when the tracking performance is
expected to degrade under a predefined level, thus generating a track loss condition.
5. The extraction of meta-data from the state in a compact and unambiguous form to be
used by the specific application, such as video annotation, scene understanding and
behaviour recognition.
In the next sections we will discuss in detail the first four components and specific solutions used
in popular video trackers.
4.3 PROBLEM FORMULATION
This section introduces a formal general definition of the video-tracking problem that will be
used throughout the book. We first formulate the single-target tracking problem and then extend
the definition to multiple simultaneous target tracking.
4.3.1 Single-target tracking
Let I = {Ik: k ∈ N} represent the frames of a video sequence, with I k ∈ EI being the frame
(image plane) at time k, defined in EI, the space of all possible images.
Tracking a single target using monocular video can be formulated as the estimation of a
time series
x = {xk : k ∈ N}
(4.1)
Over the set of discrete time instants indexed by k, based on the information in I. The vectors
xk ∈ Es are the states of the target and E s is the state space. The time series x is also known as the
trajectory of the target in Es. The information encoded in the state xk depends on the application.
Ik may be mapped onto a feature (or observation) space E o that highlights information relevant to
the tracking problem. The observation generated by a target is encoded in zk ∈ Eo. In general, Eo
has a lower dimensionality than that of original image space, EI (Figure 4.6).
The operations that are necessary to transform the image space E I to the observation
space Eo are referred to as feature extraction.
Video trackers propagate the information in the state xk over time using the extracted features. A
localization strategy defines how to use the image features to produce an estimate of the target
state xk.
We can group the information contained in xk into three classes:
1. Information on the target location and shape. The positional and shape information
depends on the type of object we want to track and on the amount (and quality) of the
information we can extract from the images.
Figure 4.6 the flow of information between vector spaces in object tracking. The information
extracted from the images is used to recursively estimate the state of the target (Key. E I: the
space of all possible images; Eo : feature or observation space; Es : state space; k: time index).
2. Information on the target appearance. Encoding appearance information in the state helps
in modeling appearance variations over time.
3. Information on the temporal variation of shape or appearance. The parameters of this
third class are usually first or higher order derivatives of the other parameters, and are
optional.
Note that some elements of the state x k may not be part of the final output required by the
specific application. This extra information is used as it may be beneficial to the performance of
the tracker itself. For example, tracking appearance variations through a set of state parameters
may help in coping with out-of-plane rotations. Nevertheless, as adding parameters to the state
increases the complexity of the estimator, it is usually advisable to keep the dimensionality of xk
as low as possible. Figure 4.7 shows examples of states describing location and an approximation
of the shape of a target. When the goal is tracking an object on the
Figure 4.7 Example of state definitions for different video-tracking tasks.
Image plane, the minimal form of x k will represent the position of a point in I k, described by its
vertical and horizontal coordinates, that is
Xk = (uk,vk)
(4.2)
Similarly, one can bound the target area with a rectangle or ellipse, defining the state xk as
Xk = (uk,vk,hk,wk,θk)
(4.3)
Where yk = (uk,vk) defines the centre, hk the height, wk the width and (optionally) θk the clockwise
rotation. More complex representations such as chains of points on a contour can be used.
4.4 APPLICATIONS OF OBJECT TRACKING
4.4.1 INTRODUCTION
Tracking objects of interest in video is at the foundation of many applications, ranging from
video production to remote surveillance, and from robotics to interactive immersive games.
Video trackers are used to improve our under- standing of large video datasets from medical
and security applications; to increase productivity by reducing the amount of manual labor that
is necessary to complete a task and to enable natural interaction with machines.
In this chapter we offer an overview of current and upcoming applications that use object
tracking. Although the boundaries between these applications are somehow blurred, they can be
grouped in six main areas:
Media production and augmented reality.
Medical applications and biological research.
Surveillance and business intelligence.
Robotics and unmanned vehicles.
Tele-collaboration and interactive gaming.
Art installations and performances.
Specific examples of these applications will be covered in the following sections.
4.4.2
MEDIA PRODUCTION AND AUGMENTED REALITY
Object tracking is an important element in post-production and motion capture for the movie and
broadcast industries.
Match moving is the augmentation of original shots with additional
computer graphics elements and special effects, which are rendered in the movie. In order to
consistently add these new elements to subsequent frames, the rendering procedure requires the
knowledge of 3D information on the scene. This information can be estimated by a cam- era
tracker, which computes over time the camera position, orientation and focal length. The 3D
estimate is derived from the analysis of a large set of 2D trajectories of salient image features
that the object tracking algorithm identifies in the frames [1, 2]. An example of tracking patches
and points is shown in Figure 4.8, where low-level 2D trajectories are used to estimate higherlevel 3D information. Figure 4.9 shows two match-moving examples where smoke special
effects and additional objects (a boat and a building) are added to the real original scenes. A
related application is virtual product placement that includes a Specific product to be advertised
in a video or wraps a logo or a Specific texture around an existing real object captured in the
scene.
Figure 4.8 Example of a camera tracker that uses the information obtained by tracking image
patches. Reproduced with permission of the Oxford Metrics Group.
Figure 4.9 Match-moving examples for special effects and object placement in a dynamic scene.
Top: smoke and other objects are added to a naval scene. Bottom: the rendering of a new
building is added to an aerial view. Reproduced with permission of the Oxford Metrics Group.
Another technology based on object tracking and used by media production houses is motion
capture. Motion capture systems are used to animate virtual characters from the tracked motion
of real actors. Although marker less motion capture is receiving increasing attention, most
motion-capture systems track a set of markers attached to an actor’s body and limbs to estimate
their poses (Figure 4.10). Specialized motion-capture systems recover the movements of real
actors in 3D from the tracked markers. Then the motion of the makers is mapped onto characters
generated by computer graphics.
Figure 4.10 Examples of motion capture using a marker-based system. Left: retro- reflective
markers to be tracked by the system. Right: visualization of the motion of a subject. Reproduced
with permission of the Oxford Metrics Group.
Object tracking is also used for the analysis and the enhancement of sport events. As shown in
the example of Figure 4.11, a tracking algorithm can estimate the position of players in the
field in order to gather statistics about a game (e.g. a football match). Statistics and enhanced
visualizations aid the commentators, coaches and supporters in highlighting team tactics and
player performance.
Figure 4.11 Object tracking applied to media production and enhanced visualization of sp ort
events. Animations of the real scene can be generated from different view-points based on
tracking data. Moreover, statistics regarding player positions are automatically
may be presented
4.4.3
gathered and
as overlay or animation. Reproduced with permission of Mediapro.
MEDICAL APPLICATIONS AND BIOLOGICAL RESEARCH
The motion-capture tools described in the previous section are also used for the analysis
of human motion to improve the performance of athletes (Figure 4.12(a)–(b)) and for the
analysis of the gait of a patient [3]
to assess the condition of the joints and bones (Figure
4.12(c)). In general, object tracking has been increasingly used by medical systems to aid the
diagnosis and to speed up the operator’s task. For example, automated algorithms track the
ventricular motion in ultrasound images [4–6]. Moreover, object tracking can estimate the
position of particular soft tissues [7] or of instruments such as needles [8, 9] and bronchoscopes
[10] during surgery.
In biological research, tracking the motion of non-human organisms allows one to
analyze and to understand the effects of Specific drugs or the e ffects of ageing [11–15].
Figure 4.13 shows two application examples where video
(a)
(b)
(c)
Figure 4.12 Example of object tracking for medical and sport analysis applications. Motion
capture is used to analyze the performance of a golfer (a), of a rower (b) and to analyze the gait
of a patient (c). Reproduced with permission of the Oxford Metrics Group.
Figure 4.13 Examples of object tracking for medical research. Automated tracking of the position
of Escherichia coli bacteria (left) and of Caenorhabditis elegans worms (right). Left: reproduced
from [14]; right: courtesy of Gavriil Tsechp enakis, IUPUI.
4.4.4 SURVEILLANCE AND BUSINESS INTELLIGENCE
Video tracking is a desirable tool used in automated video surveillance for security,
assisted living and business intelligence applications. In surveillance systems, tracking can be
used either as a forensic tool or as a processing stage prior to algorithms that
classify
behaviours [16]. Moreover, video-tracking software combined with other video analytical tools
can be used to redirect the attention of human operators towards events of interest. Smart
surveillance systems can be deployed in a variety of different indoor and outdoor environments
such as roads, airports, ports, railway stations, public and private buildings (e.g. Schools, banks
and casinos). Examples of video surveillance systems (Figure 4.14) are the IBM Smart
Surveillance System.
(a)
(b)
(c)
Figure 4.14 Examples of object tracking in surveillance applications. (a)– (b): General Electric
intelligent video platform; (c): Object Video surveillance platform. The images are reproduced
with permission of General Electric Company (a, b) and Object Video (c).
Figure 4.15 Examples
of object tracking for intelligent retail applications. Screen shots from
IntelliVid software (American Dynamics).
Object tracking may also serve as an observation and measurement tool in retail environments
(e.g. retail intelligence), such as supermarkets, where the position of customers is tracked over
time [22] (Figure 4.15). Trajectory data combined with information from the point of sales (till)
is used to build behavioral models describing where customers spend their time in the shop, how
they interact with products depending on their location, and what items they buy. By analyzing
this information, the marketing team can improve the product placement in the retail space.
Moreover, gaze tracking in front of bill- boards can be used to automatically select the type of
advertisement to show or to dynamically change its content based on the attention or the
estimated marketing profile of a person, based for example on the estimated gender and age.
4.4.5 ROBOTICS AND UNMANNED VEHICLES
Another application area that extensively uses video-tracking algorithms is robotics. Robotic
technology includes the development of humanoid robots, automated PTZ cameras and
unmanned aerial vehicles (UAVs). Intelligent visions via one or more cameras mounted on the
robots provide information that is used to interact with or navigate in the environment. Also
environment exploration and mapping [23], as well as human–robot interaction via gesture
recognition rely on object tracking [24]. The problem of estimating the global motion of
robots and unmanned vehicles is related to the camera-tracking problem discussed in
Section 2.2. While tracking algorithms for media production can be applied o ffline, video
trackers for robotics need to simultaneously localize in real time the position of the robot (i.e. Of
the camera) and to generate a map of the environment. 3D localization information is generated
by tracking the position of prominent image features such as corners and edges [25, 26], as
shown in Figure 4.8.
Figure 4.16 Example of object tracking from an Unmanned Aerial Vehicle. Reproduced with
permission of the Oxford Metrics Group.
Information on the 3D position is also used to generate a 3D mesh approximating the structure of
surrounding objects and the environment. In particular, UAVs make extensive use of object
tracking to find the position of specific objects on the ground (Figure 4.16) as well as to enable
automated landing.
4.4.6 TELE-COLLABORATION AND INTERACTIVE GAMING
Standard webcams are already shipped with tracking software that localises and follows the face
of a user for on-desk video conferencing. Moreover, video- based gaze tracking is used to
simulate eye contact among attendees of a meeting to improve the electiveness of interaction in
video-conferencing [27]. Object tracking technology for lecture rooms is available that uses a set
of PTZ cameras to follow the position of the lecturer [27–30]. The PTZ cameras exploit the
trajectory information in real-time to guide the pan, tilt and zoom parameters of the camera. To
improve tracking accuracy, information from an array of microphones may also be fused with the
information from the camera [31].
Video tracking
is also changing the way we send control to machines. This
natural
interaction modality is being used in interactive games. For example, the action of pressing a
button on the controller is replaced by a set of more intuitive gestures performed by the user in
front of the camera [32] (Figure 4.17). Likewise, in pervasive games, where the experience
extends to the physical world, vision-based tracking refines positional data from the Global
Positioning System (GPS) [33].
4.4.7 ART INSTALLATIONS AND PERFORMANCES
Object tracking
is increasingly being used in art
installations and performances where
interaction is enabled by the use of video cameras and often by projection systems. The
interactivity can be used to enhance the narrative of a piece or to create unexpected actions
or reactions of the environment.
Figure 4.17 Examples of gesture interfaces for interactive gaming.
For example, tracking technology enables interaction between museum goers and visual
installations (Figure 4.18, left). Also, someone in a group can be selectively detected and then
tracked over time while a light or an ‘animated’ shadow is projected next to the selected person
(Figure 4.18, right). Interactive art based on object tracking can also enable novel forms of
communication between distant people. For example the relative position of a tracked object and
a human body may drive a set of lighting effects [34].
Figure 4.18 Examples of object tracking applied to interactive art installations. A person interacts
with objects visualized on a large display (left, reproduced with permission of Alterface). An
animated shadow (right) appears next to a person being tracked.
CHEPTER 5
Start
Read AVI Video File in MATLAB
Estimate Weight Images for Target Scale Change
Estimate Target Area for Tracking
Analyze Movement Features in Mean Shift Tracking
5. Project Methodology
This section presents the description of the proposed Modified Mean shift tracking algorithm.
Figure (5.1)
shows the proposed
methodology
withand
the help
of block diagram
Estimation
of Width,
Height
Orientation
of representation.
Target
Determining the Candidate Region for Next Frame
Stop
Figure (5.1) Project Methodology.
The detailed description of the project work is as follows
5.1 Mean Shift Tracking Algorithm
5.1.1 Target Representation
In object tracking, a target is usually defined as a rectangle or an ellipsoidal region in the image.
Currently, a widely used target representation is the color histogram because of its independence
of scaling and rotation and its robustness to partial occlusions [9, 21]. Denote by
{ X i¿ }i=1 ⋯n
the
normalized pixels in the target region, which is supposed to be centered at the origin point and
have n pixels. The probability of the feature u (u=1, 2… m) in the target model is computed as
[9].
(5.1)
Where
q^
is the target model,
¿
Kronecker delta function, b { X i }
q^ u
is the probability of the uth element of
associates the pixel
¿
Xi
q^ , δ is the
to the histogram bin, and k(x) is an
isotropic kernel profile. Constant C is a normalization function defined by
(5.2)
Similarly, the probability of the feature u in the target candidate model from the candidate region
centered at position y is given by
(5.3)
(5.4)
Where
^p ( y ) .
^p ( y )
is the target candidate model,
^pu ( y )
is the probability of the uth element of
{ X i }n=1 ⋯ n are pixels in the target candidate region centered at y, h is the bandwidth
h
and Ch is the normalization function which is independent of y [9].
In order to calculate the likelihood of the target model and the candidate model, a metric based
on the Bhattacharyya coefficient [1] is defined by using the two normalized histograms
and q^
as follows
^p ( y )
(5.5)
The distance between
^p ( y )
and q^
is then defined as
(5.6)
Minimizing the distance
Bhattacharyya coefficient
d [ ^p ( y ) , q^ ]
ρ [ ^p ( y ) , q^ ]
in Eq. (5.6) is equivalent to maximizing the
in Eq. (5.5). The optimization process is an iterative
process and is initialized with the target position, denoted by y o in the previous frame. By using
the Taylor expansion around coefficient
^pu ( y o )
, the linear approximation of the
Bhattacharyya in Eq. (5.5) can be obtained as:
(5.7)
Where,
(5.8)
Since the first term in Eq. (5.7) is independent of y, to minimize the distance in Eq. (5.6) is to
maximize the second term in Eq. (5.7). In the mean shift iteration, the estimated target moves
from y to a new position y1, which is defined as
(5.9)
When we choose the kernel k(x) with the Epanechnikov profile, there is g(x) =-k(x) =1, and Eq.
(5.9) can be reduced to [9].
(5.10)
By using Eq. (5.10), the mean shift tracking algorithm finds in the new frame the most similar
region to the object. From Eq. (5.10) it can be observed that the key parameters in the mean shift
tracking algorithm are the weights wi. In this project we will focus on the analysis of wi with
which the scale and orientation of the tracked target can be well estimated, and then a scale and
orientation adaptive mean shift tracking algorithm can be developed.
5.2 Modified Mean Shift Tracking for Scale and Orientation of target.
In this section, we first analyze how to calculate adaptively the scale and orientation of the target
in sub-sections 5.2.1 ~ 5.2.5, then in sub-section 5.2.6, a modified mean shift tracking (MMST)
algorithm for scale and orientation of target is presented.
The enlarging or shrinking of the target is usually a gradual process in consecutive
frames. Thus we can assume that the scale change of the target is smooth and this assumption
holds reasonably well in most video sequences. If the scale of the target changes abruptly in
adjacent frames, no general tracking algorithm can track it effectively. With this assumption, we
can make a small modification of the original mean shift tracking algorithm. Suppose that we
have estimated the area of the target (the area estimation will be discussed in sub-section 5.2.2)
in the previous frame, in the current frame we let the window size or the area of the target
candidate region be a little bigger than the estimated area of the target. Therefore, no matter how
the scale and orientation of the target change, it should be still in this bigger target candidate
region in the current frame. Now the problem turns to how to estimate the real area and
orientation from the target candidate region.
5.2.1 The Weight Images for Target Scale Changing
In the CAMSHIFT and the mean shift tracking algorithms, the estimation of the target location is
actually obtained by using a weight image [10, 24]. In CAMSHIFT, the weight image is
determined using a hue-based object histogram where the weight of a pixel is the probability of
its hue in the object model. While in the mean shift tracking algorithm, the weight image is
defined by Eq. (5.8) where the weight of a pixel is the square root of the ratio of its color
probability in the target model to its color probability in the target candidate model.
Moreover, it is not accurate to use the weight image by CAMSHIFT to estimate the location of
the target, and the mean shift tracking algorithm can have better estimation results. That is to say,
the weight image in the mean shift tracking algorithm is more reliable than that in the
CAMSHIFT algorithm.
Fig. 5.2: Weight images in CAMSHIF [6] and mean shift tracking [9] algorithms when the object
scale changes. (a) A synthesized target with three gray levels. (b) A target candidate window that
is bigger than the target. (c), (f) and (i) are the target candidate regions enclosed by the target
candidate window (dashed box) when the scale of the target decreases, keeps invariant and
increases, respectively. (d), (g) and (j) are respectively the weight images of the target candidate
regions in (c), (f) and (i) calculated by CAMSHIFT. (e), (h) and (k) are respectively the weight
images of the target candidate regions in (c), (f) and (i) calculated by mean shift tracking.
As in the CAMSHIFT algorithm, in the MMST scheme to be developed, the scale and
orientation of the target will be estimated by using the moment features [4-6] of the weight
image. Since those moment features depend only on the weight image, a properly calculated
weight image could lead to accurate moment features and consequently good estimates of the
target changes. Therefore, let’s analyze the weight images in the CAMSHIFT and mean shift
tracking methods in order for the development of the MMST algorithm.
As mentioned at the beginning of Section 6, we will track the target in a larger candidate
region than its size to ensure that the target will be within this candidate region when the tracking
process ends. With this strategy, let’s compare the weight images in CAMSHIFT and mean shift
tracking under different scale changes by using the following experiments. Figure 5.2-(a) shows
a synthesized target that has three gray levels. Figure 5.2-(b) shows the candidate region that is a
little bigger than the target. Figures 5.2-(c), (f) and (i) are the tracking results when the scale of
the synthesized target decreases, keeps invariant and increases, respectively. Figures 5.2-(d), (g)
and (j) illustrate the weight images calculated by the CAMSHIFT algorithm in the three cases,
while Figures 5.2-(e), (h) and (k) illustrate the weight images calculated by the mean shift
tracking algorithm in the three cases.
From Figure 6.1, we can see clearly the difference of the weight images between CAMSHIFT
and mean shift tracking. First, the weight image in the CAMSHIFT algorithm is constant and it
only depends on the target model, while the weight image in the mean shift tracking algorithms
will change dynamically with the scale changes of the target. Second, the weight image is closely
related to the target scale change in mean shift tracking. The closer the real scale of the target is
to the candidate region, the better the weight image approaches to 1. That is to say, the weight
image in mean shift tracking can be a good indicator of the scale change of the target. However,
the weight image in CAMSHIFT does not reflect this. Based on the above observation and
analysis, we could consider the weight image in the mean shift tracking algorithm as a density
distribution function of the target, where the weight value of a pixel reflects the possibility that it
belongs to the target. In the following sections, we can see that the scale and orientation of the
target can be well estimated by using this density distribution function together with the moment
features of the weight image.
5.2.2 Estimating the Target Area
Since the weight value of a pixel in the target candidate region represents the probability that it
belongs to the target, the sum of the weights of all pixels, i.e., the zero th order moment, can be
considered as the weighted area of the target in the target candidate region:
(5.11)
In mean shift tracking, the target is usually in the big target candidate region. Due to the
existence of the background features in the target candidate region, the probability of the target
features is less than that in the target model. So Eq. (5.8) will enlarge the weights of target pixels
and suppress the weight of background pixels. Thus, the pixels from the target will contribute
more to target area estimation, while the pixels from the background will contribute less. This
can be clearly seen in Figures 5.2-(e), 5.2-(h) and 5.2-(k).
On the other hand, the Bhattacharyya coefficient (referring to Eq. (5.5)) is an indicator of the
similarity between the target model
q^
and the target candidate model
^p ( y )
. A smaller
Bhattacharyya coefficient means that there are more features from the background and fewer
features from the target in the target candidate region, vice versa. If we take M oo as the estimation
of the target area, then according to Eq. (11), when the weights from the target become bigger,
the estimation error by taking M00 as the area of the target will be bigger, vice versa. Therefore,
the Bhattacharyya coefficient is a good indicator of how reliable it is by taking
M00 as the target area. Table 1 lists the real area of the target in Figure 5.2 and the estimation
error by taking M00 as the target area. We can see that with the increase of the Bhattacharyya
coefficient, the estimation accuracy by taking increase (e.g., the estimation error will decrease).
M00 as the target area will also based on the above analysis, we see that the Bhattacharyya
coefficient can be used to adjust M00 in estimating the target area, denoted by A. We propose the
following equation to estimate it:
(5.12)
Where c(ρ) is a monotonically increasing function with respect to the Bhattacharyya coefficient
ρ(0 ≤ ρ ≤1) . As can be seen in Figures 5.2-(e), 5.2-(h) and 5.2-(k) and Table 1, always greater
than the real target area and it will monotonically approach to the real target area with ρ
increasing. Thus we require that c(ρ) should be monotonically increase and reach maximum 1
when ρ is 1. Such a correction function c(ρ) is possible to shrink M00 back to the real target scale.
There can be alternative candidate functions of c(ρ), such as linear function c(ρ)=ρ, Gaussian
function, etc. Here we choose the exponential function as c(ρ) based on our experimental
experience:
(5.13)
From Eqs. (5.12) and (5.13) we can see that when ρ approaches to the upper bound 1, i.e., when
the target candidate model approaches to the target model, c(ρ) approaches to 1 and in this case it
is more reliable to use M00 as the estimation of target area. When ρ decreases, i.e. the candidate
model is not identical to the target model, M 00 will be much bigger than the target area but c(ρ) is
less than 1 so that A can avoid being biased too much from the real target area. When ρ
approaches to 0, i.e., the tracked target gets lost, c(ρ) will be very small so that A is close to zero.
Table 1. The area estimation (pixels) of the target under different scale changes by the
Tracking Result
Real Area of Target
Background Area
Bhattacharyya coefficient
Estimated area A
M00
under
different σ and the
σ = 1.5
relative estimation
error (%) in
σ=1
comparison with
M00.
σ = 0.8
σ = 0.5
Table 1 lists the area estimation results of the target by using Eq. (5.12) under different scale
changes in Figures 5.2-(e), 5.2-(h) and 5.2-(k). Though an optimal value of σ should be adaptive
to the video content, by our experimental experiences it was found that when the target model is
appropriately defined (containing not too many background features), setting σ between 1 and 2
can achieve very robust tracking results for most of the testing video sequences.
5.2.3 The Moment Features in Mean Shift Tracking
In this sub-section, we analyze the moment features in mean shift tracking and then combine
them with the estimated target area to further estimate the width, height and orientation of the
target in the next sub-section. Like in CAMSHIFT, we can easily calculate the moments of the
weight image as follows:
(5.14)
(5.15)
Where pair (xi,1, xi,2) is the coordinate of pixel i in the candidate region. Comparing Eq. (5.10)
with Eqs. (5.11) and (5.14), we can find that y 1 is actually the ratio of the first order moment to
the zeroth order moment:
(5.16)
´ ´
Where ( x 1 , x 2) represents the centroid of the target candidate region. The second order center
moment could describe the shape and orientation of an object. By using Eqs. (10), (11), (15) and
(16), we can convert Eq. (9) to the second order center moment as follows
(5.17)
Eq. (5.17) can be rewritten as the following covariance matrix in order to estimate the width,
height and orientation of the target:
(5.18)
5.2.4 Estimating the Width, Height and Orientation of the Target
By using the estimated area (sub-section 5.2.2) and the moment features (sub-section 5.2.3), the
width, height and orientation of the target can be well estimated. The covariance matrix in Eq.
(5.18) can be decomposed by using the singular value decomposition (SVD) [22] as follows
(5.19)
[
u u
U= 11 12
Where
u21 u22
]
[ ]
λ21 0
and S= 0 λ 2
2
.
2
λ1 and
2
λ2 are the Eigen values of Cov. The
vectors (u11, u21)T and (u12, u22)T represent, respectively, the orientation of the two main axes of
the real target in the target candidate region.
Because the weight image is a reliable density distribution function, the orientation
estimation of the target provided by matrix U is more reliable than that by CAMSHIFT.
Moreover, in the CAMSHIFT algorithm, λ1 and λ2 height of the target, which is actually
improper? Next, we present a new scheme to more accurately estimate the width and height of
the target.
Suppose that the target is represented by an ellipse, for which the lengths of the semimajor axis and semi-minor axis are denoted by a and b, respectively. Instead of using and λ1 and
λ2 directly as the width a and height b, it has been shown that the ratio of λ1 and λ2 can well
approximate the ratio of a to b, i.e.,
λ1 / λ 2 ≈ a /b
Thus we can set a = k λ1 and b= k λ2 , where k
is a scale factor. Since we have estimated the target area A, there is πab = π (k λ1) (k λ2) = A.
Then it can be easily derived that
(5.20)
(5.21)
Now the covariance matrix becomes
(5.22)
The adjustment of covariance matrix Cov in Eq. (22) is a key step of the proposed algorithm. It
should be noted that the EM-like algorithm by Zivkovic and Krose [11] estimates iteratively the
covariance matrix for each frame based on the mean shift tracking algorithm. Unlike the EM-like
algorithm, our algorithm combines the area of target, i.e., A, with the covariance matrix to
estimate the width, height and orientation of the target. In Section 6.1, we listed the estimated
width, height and orientation of the synthetic ellipse sequence in Figure 6.1 together with the
relative estimation error by using the Developed MMST algorithm. It can be seen that the
estimation accuracy is very satisfying.
5.2.5 Determining the Candidate Region in Next Frame
Once the location, scale and orientation of the target are estimated in the current frame, we need
to determine the location of the target candidate region in the next frame. With Eq. (5.22), we
define the following covariance matrix to represent the size of the target candidate region in the
next frame
(5.23)
where Δd is the increment of the target candidate region in the next frame. The position of the
initial target candidate region is defined by the following ellipse region
(5.24)
5.2.6 Implementation of the MMST Algorithm
Based on the above analyses in sub-sections 5.2.1 ~ 5.2.5, the scale and orientation of the target
can be estimated and then a scale and orientation adaptive mean shift tracking algorithm, i.e. the
MMST algorithm, can be developed. The implementation of the whole algorithm is summarized
as follows.
Algorithm of Modified Mean Shift
Tracking
1) Initialization: calculate the target model
candidate model in the previous frame.
2) Initialize the iteration number k ←0.
q^
and initialize the position y0 of the target
3) Calculate the target candidate model
4) Calculate the weight vector
^p ( y 0 )
in the current frame.
{w i }i=1⋯ n using Eq. (5.8).
5) Calculate the new position y1 of the target candidate model using Eq. (5.10).
6) Let d ← || y1-y0 ||, y0 ← y1.Set the error threshold ε (default 0.1) and the maximum
Iteration number N (default 15).
If (d < ε∨k ≥ N )
Otherwise
Stop and go to step 7;
k← k+1, and go to step 3.
7) Estimate the width, height and orientation from the target candidate model using Eq.
(5.22).
8) Estimate the initial target candidate model for next frame using Eq. (5.24).
CHEPTER 6
6. Experimental Results and Discussions
This section evaluates the Developed MMST algorithm in comparison with the original mean
shift algorithm, i.e., mean shift tracking with a fixed scale, the adaptive scale algorithm [9] and
the EM-shift algorithm [11, 25]. The adaptive scale algorithm and the EM-shift algorithm are
two representative schemes to address the scale and orientation changes of the targets under the
mean shift framework. Because the weight image estimated by CAMSHIFT is not reliable, it is
prone to errors in estimating the scale and orientation of the object. So CAMSHIFT is not used in
the experiments.
We selected RGB color space as the feature space and it was quantized into 16×16×16
bins for a fair comparison between different algorithms. It should be noted that other color space
such as the HSV color space can also be used in MMST. One synthetic video sequence and three
real video sequences are used in the experiments.
6.1 Experiments on a Synthetic Sequence
We first use a synthetic ellipse sequence to verify the efficiency of the proposed MMST
algorithm. As shown in Figure 6.1-(d), the window size of the initial target (blue ellipse) is
59×89. We select Δk =10 in the developed MMST algorithm so that the window size of the initial
target candidate region (red ellipse in Figure 6.1-(b)) is 79×109 in frame 1. For other frames in
the MMST results, the external ellipses represent the target candidate regions, which are used to
estimate the real targets, i.e., the inner ellipses. The experimental results show that the developed
MMST algorithm could reliably track the ellipse with scale and orientation changes. Meanwhile,
the experimental results by the fixed-scale mean shift is not good because of significant scale and
orientation changes of the object. The adaptive scale algorithm does not estimate the target
orientation change and has bad tracking results. The EM-shift algorithm fails to correctly
estimate the scale and orientation of the synthetic ellipse, although the target in this sequence is
very simple.
(d) The developed MMST algorithm
Fig. 6.1: Tracking results of the synthetic ellipse sequence by different tracking algorithms. The
red ellipses represent the target candidate region while the blue ellipse represents the estimated
target region. The frames 1, 20, 30, 40, 50, 70 are displayed.
Table 2 lists the estimated width, height and orientation of the ellipse in this sequence by
using the MMST scheme. The orientation is calculated as the angle between the major axis and
x-axis. The first frame of the sequence was used to define the target model and the rest frames
were used for testing. It can be seen that the developed MMST method achieves good estimation
accuracy of the scale and orientation of the target.
Table 2. The estimation result and accuracy of the width, height and orientation of the
ellipse by the Developed MMST method.
Semi Major Length a
Fram Real
Estimat
e No. Leng
ed
Error
th
Length
(%)
20
45
46.13
2.51
30
39
41.25
5.77
40
26
27.03
3.97
50
24
24.72
3
60
36
37.93
5.36
70
44
45.12
2.55
Average Error Over 71
Frames
3.5
Orientation
Estimat
ed
Angle
95.26
145.03
14.68
63.38
114.7
165.01
Error
(%)
0.27
0.02
2.13
2.49
0.26
0.01
1.47
6.2 Experiments on Real Video Sequences
The developed MMST algorithm is then tested by using four real video sequences. The first
video is a Torch sequence recorded in home (Figure 6.3) where the object has clearly scale and
orientation changes. To show the efficiency of developed MMST algorithm figure 6.3 consists
subsequent frames 20, 40, 80. The second video is a palm sequence (Figure 6.4) of 26 frames,
where the object has clearly scale and orientation changes, the estimated target scale and
orientation by is accurate by the MMST algorithm.
Fig. 6.3: Tracking results of the torch sequence for MMST algorithms. The frames 20, 40 and 80
are displayed.
The third video is a car sequence where the scale of the object (a white car) increases gradually
as shown in Figure 6.5. The experimental results show that the developed MMST algorithm
estimates more accurately the scale changes than the adaptive scale and the EM-shift algorithms.
(a)
(b)
(c)
Fig. 6.4: Tracking results of the palm sequence by MMST algorithm. The frames 05, 15 and 26
are displayed.
(c) The developed MMST algorithm
Fig. 6.5: Tracking results of the car sequence by different tracking algorithms. The frames 15, 40,
60 and 75 are displayed.
The last experiment is on a Card reader sequence which is complex because the object is
small and having scale and orientation change. The object exhibits large scale changes with
partial occlusion. The MMST scheme works much better in estimating the scale and orientation
of the target.
(a)
(b)
(c)
(d)
(e)
Fig. 6.6: Tracking results of the Card Reader sequence with MMST algorithms. The frames 10,
40, 50,60 and 82 are displayed.
Table 3. The average number of iterations by different methods on the four sequences.
Methods
Synthetic
ellipse
Car sequence
Fixedscale
mean
shift
Adaptiv
e scale
EMshift
MMST
2.34
3.82
13.62
11.25
6.27
6.34
2.59
3.34
Table 3 lists the average numbers of iterations by different schemes on the four video sequences.
The average number of iterations of the developed MMST is approximately equal to that of the
original mean shift algorithm with fixed scale. The iteration number of the modified scale
algorithm is the highest because it runs mean shift algorithm three times. The main factors which
affect the convergence speed of the EM-shift and the MMST algorithms are the computation of
the covariance matrix. EM-shift estimates it in each iteration while MMST only estimates it once
for each frame. So MMST is faster than EM-shift.
In general, the developed MMST algorithm, which is motivated by the CAMSHIFT
algorithm [6], extends the mean shift algorithm when the target has large scale and orientation
variations. It inherits the simplicity and effectiveness of the original mean shift algorithm while
being adaptive to the scale and orientation changes of the target.
CHEPTER 7
7. Conclusions
By analyzing the moment features of the weight image of the target candidate region and the
Bhattacharyya coefficients, we developed a scale and orientation adaptive mean shift tracking
(MMST) algorithm. It can well solve the problem of how to estimate robustly the scale and
orientation changes of the target under the mean shift tracking framework.
The weight of a pixel in the candidate region represents its probability of belonging to the
target, while the zeroth order moment of the weights image can represent the weighted area of the
candidate region. By using the zeroth order moment and the Bhattacharyya coefficient between
the target model and the candidate model, a simple and effective method to estimate the target
area was proposed. Then a new approach, which is based on the area of the target and the
corrected second order center moments, was proposed to adaptively estimate the width, height
and orientation changes of the target.
The developed MMST method inherits the merits of mean shift tracking, such as
simplicity, efficiency and robustness. Extensive experiments were performed and the results
showed that MMST can reliably track the objects with scale and orientation changes, which is
difficult to achieve by other state-of-the-art schemes. In the future research, we will focus on
how to detect and use the true shape of the target, instead of an ellipse or a rectangle model, for a
more robust tracking.
CHEPTER 8
8. References
1) Kailath T.: ‘The Divergence and Bhattacharyya Distance Measures in Signal Selection’,
IEEE Trans. Communication Technology, 1967, 15, (1), pp. 52-60.
2) Fukunaga F., Hostetler L. D.: ‘The Estimation of the Gradient of a Density Function, with
Applications in Pattern Recognition’, IEEE Trans. on Information Theory, 1975, 21, (1),
pp. 32-40.
3) Cheng Y.: ‘Mean Shift, Mode Seeking, and Clustering’, IEEE Trans on Pattern Anal.
Machine Intell., 1995, 17, (8), pp. 790-799.
4) Mukundan R., Ramakrishnan K. R.: ‘Moment Functions in Image Analysis: Theory and
Applications’, World Scientific, Singapore, 1996.
5) Wren C., Azarbayejani A., Darrell T., Pentland A.: ‘Pfinder: Real-Time Tracking of the
Human Body’, IEEE Trans. Pattern Anal. Machine Intell, 1997, 19, (7), pp. 780-785.
6) Bradski G.: ‘Computer Vision Face Tracking for Use in a Perceptual User Interface’, Intel
Technology Journal, 1998, 2(Q2), pp. 1-15.
7) Comaniciu D., Ramesh V., Meer P.: ‘Real-Time Tracking of Non-Rigid Objects Using
Mean Shift’. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hilton Head, SC,
June, 2000, vol. 2, pp. 142-149.
8) Comaniciu D., Meer P.: ‘Mean Shift: a Robust Approach toward Feature Space Analysis’, IEEE
Trans Pattern Anal. Machine Intell., 2002, 24, (5), pp. 603-619.
9) Comaniciu D., Ramesh V., Meer P.: ‘Kernel-Based Object Tracking’, IEEE Trans. Pattern Anal.
Machine Intell. 2003, 25, (2), pp. 564-577.
10) Collins R.: ‘Mean-Shift Blob Tracking through Scale Space’, Proc. IEEE Conf. Computer Vision
and Pattern Recognition, Wisconsin, USA, 2003, pp. 234-240.
11) Zivkovic Z., Krose B.: ‘An EM-like Algorithm for Color-Histogram-Based Object Tracking’,
Proc. IEEE Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, 2004, vol.1,
pp. 798-803.
12) Yang C., Ramani D., Davis L.: ‘Efficient Mean-Shift Tracking via a New Similarity Measure’,
Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA, 2005, vol. 1,
pp.176-183.
13) Fashing M., Tomasi C.: ‘Mean Shift is a Bound Optimization’, IEEE Trans. Pattern Anal.
Machine Intell., 2005, 27, (3), pp. 471-474.
14) Yilmaz A., Javed O., Shah M.: ‘Object Tracking: a Survey’, ACM Computing Surveys, 2006, 38,
(4), Article 13.
15) Carreira-Perpinan M. A. ‘Gaussian Mean-Shift is an EM Algorithm’, IEEE Trans. Pattern Anal.
Machine Intell., 2007, 29, (5), pp. 767-776.
16) Birchfield S., Rangarajan S.: ‘Spatiograms versus histograms for region-based tracking’, Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, 2005, vol. 2, pp. 1158–1163, 2005.
17) Hu J., Juan C., Wang J.: ‘A spatial-color mean-shift object tracking algorithm with scale and
orientation estimation’, Pattern Recognition Letters, 2008, 29, (16), pp. 2165-2173.
18) Srikrishnan V., Nagaraj T., Chaudhuri S.: ‘Fragment Based Tracking for Scale and Orientation
Adaption’, Proc. Indian Conf. on In Computer Vision, Graphics & Image Processing, 2008, pp.
328-335.
19) Linderberg T.: ‘Feature Detection with Automatic Scale Selection’, International Journal of
Computer Vision. 1998, 30, (2), pp. 79-116.
20) Bretzner L., Lindeberg T.: ‘Qualitative Multi-Scale Feature Hierarchies for Object
Tracking’, Journal of Visual Communication and Image Representation, 2000, 11, (2),
pp.115-129.
21) Nummiaro K., Koller-Meier E., Gool L. V.: ‘An Adaptive Color-Based Particle Filter’,
Image and Vision Computing, 2003, 21, (1), pp. 99-110.
22) Horn R. A., Johnson C. R., Topics in Matrix Analysis, Cambridge University Press, U.K.,
1991.
23) Quast K., Kaup A.: ‘Scale and Shape adaptive Mean Shift Object Tracking in Video
Sequences’, Proc. European Signal Processing Conference, Glasgow, Scotland, 2009, pp.
1513-1517.