What Motivated Data Mining? Why Is It Important?
Data mining has attracted a great deal of attention in the information
industry and in
society as a whole in recent years, due to the wide availability of huge
amounts of data
and the imminent need for turning such data into useful information and
The information and knowledge gained can be used for applications ranging
from market analysis, fraud detection, and customer retention, to production
control and science exploration.
Data can now be stored in many different kinds of databases and information
repositories. One data repository architecture that has emerged is the data
(Section 1.3.2), a repository of multiple heterogeneous data sources
organized under a
unified schema at a single site in order to facilitate management decision
warehouse technology includes data cleaning, data integration, and on-line
processing (OLAP), that is, analysis techniques with functionalities such as
summarization, consolidation, and aggregation as well as the ability to view
information from different angles. Although OLAP tools support
multidimensional analysis and decision making, additional data analysis tools
are required for in-depth analysis, such as data classification, clustering, and
the characterization of data changes over time. In
addition, huge volumes of data can be accumulated beyond databases and
data warehouses. Typical examples include the World Wide Web and data
streams, where data flow in and out like streams, as in applications like video
surveillance, telecommunication, and sensor networks. The effective and
efficient analysis of data in such different forms becomes a challenging task.
The fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools. In addition, consider expert system
technologies, which typically rely on users or domain experts to manually
input knowledge into knowledge bases. Unfortunately, this procedure is
prone to biases and errors, and is extremely time-consuming and costly. Data
mining tools perform data analysis and may uncover important data
patterns, contributing greatly to business strategies, knowledge bases, and
scientific and medical research. The widening gap between data and
information calls for a systematic development of data mining tools.
What Is Data Mining?
Simply stated, data mining refers to extracting or “mining” knowledge from
of data. Many people treat data mining as a synonym for another popularly
used term, Knowledge Discovery from Data, or KDD. Alternatively, others
view data mining as simply an essential step in the process of knowledge
discovery. Knowledge discovery as a process
is depicted in Figure 1.4 and consists of an iterative sequence of the
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)1
3. Data selection (where data relevant to the analysis task are retrieved
4. Data transformation (where data are transformed or consolidated into
for mining by performing summary or aggregation operations, for instance)2
5. Data mining (an essential process where intelligent methods are applied
in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
based on some interestingness measures; Section 1.5)
7. Knowledge presentation (where visualization and knowledge
are used to present the mined knowledge to the user)
data mining is the process of discovering interesting knowledge from large
amounts of data stored in databases, data warehouses, or other information
Based on this view, the architecture of a typical data mining system may
following major components (Figure 1.5):
Database, data warehouse,WorldWideWeb, or other information
is one or a set of databases, data warehouses, spreadsheets, or other kinds
repositories. Data cleaning and data integration techniques may be
on the data.
Database or data warehouse server: The database or data warehouse
server is responsible
for fetching the relevant data, based on the user’s data mining request.
Knowledge base: This is the domain knowledge that is used to guide the
evaluate the interestingness of resulting patterns. Such knowledge can
hierarchies, used to organize attributes or attribute values into different
abstraction. Knowledge such as user beliefs, which can be used to assess a
interestingness based on its unexpectedness, may also be included. Other
of domain knowledge are additional interestingness constraints or
metadata (e.g., describing data from multiple heterogeneous sources).
Data mining engine: This is essential to the data mining system and
ideally consists of
a set of functional modules for tasks such as characterization, association
analysis, classification, prediction, cluster analysis, outlier analysis, and
Pattern evaluation module: This component typically employs
(Section 1.5) and interacts with the data mining modules so as to focus the
search toward interesting patterns. It may use interestingness thresholds to
out discovered patterns. Alternatively, the pattern evaluation module may be
with the mining module, depending on the implementation of the data
mining method used. For efficient data mining, it is highly recommended to
push the evaluation of pattern interestingness as deep as possible into the
mining process so as to confine the search to only the interesting patterns.
User interface: This module communicates between users and the data
allowing the user to interact with the system by specifying a data mining
task, providing information to help focus the search, and performing
mining based on the intermediate data mining results. In addition, this
allows the user to browse database and data warehouse schemas or data
evaluate mined patterns, and visualize the patterns in different forms.
From a data warehouse perspective, data mining can be viewed as an
of on-line analytical processing (OLAP).
Data Mining Functionalities
Data mining functionalities are used to specify the kind of patterns to be
data mining tasks. In general, data mining tasks can be classified into two
descriptive and predictive. Descriptive mining tasks characterize the general
of the data in the database. Predictive mining tasks perform inference on the
in order to make predictions.
1) Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the
AllElectronics store, classes of items for sale include computers and printers,
and concepts of customers include bigSpenders and budgetSpenders. These
descriptions can be derived via
(1) data characterization, by summarizing the data of the class under study
(often called the target class) in general terms. There are several methods
for effective data summarization and characterization.
Simple data summaries based on statistical measures and plots, the data
cube–based OLAP roll-up operation (used to perform user-controlled data
summarization along a specified dimension), an attribute-oriented induction
technique (used to perform data generalization and
characterization without step-by-step user interaction).
The output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs, generalized
relations or in rule form(called characteristic rules).
(2) data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination.
2. Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently
in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and
A frequent itemset typically refers to a set of items that frequently appear
in a transactional data set, such as milk and bread. A frequently occurring
such as the pattern that customers tend to purchase first a PC, followed by a
and then a memory card, is a (frequent) sequential pattern. A substructure
to different structural forms, such as graphs, trees, or lattices, which may be
with itemsets or subsequences. If a substructure occurs frequently, it is
called a (frequent)
structured pattern. Mining frequent patterns leads to the discovery of
and correlations within data.
An example of such a rule, mined from the AllElectronics transactional
buys(X; “computer”))buys(X; “software”) [support = 1%; confidence =
where X is a variable representing a customer. A confidence, or certainty, of
that if a customer buys a computer, there is a 50% chance that she will buy
as well. A 1% support means that 1% of all of the transactions under analysis
that computer and software were purchased together.
3. Classification and Prediction
Classification is the process of finding a model (or function) that describes
data classes or concepts, for the purpose of being able to use the model to
the class of objects whose class label is unknown.
There are many methods for constructing classification models, such as
Bayesian classification, support vector machines, and k-nearest neighbor
Whereas classification predicts categorical (discrete, unordered) labels,
models continuous-valued functions. That is, it is used to predict missing or
numerical data values rather than class labels.
4. Cluster Analysis
“Whatis cluster analysis?”Unlike classificationandprediction,whichanalyze
data objects, clustering analyzes data objects without consulting a known
In general, the class labels are not present in the training data simply
because they are
not known to begin with. Clustering can be used to generate such labels. The
clustered or grouped based on the principle of maximizing the intraclass
minimizing the interclass similarity. That is, clusters of objects are formed so
within a cluster have high similarity in comparison to one another, but are
to objects in other clusters. Each cluster that is formed can be viewed as a
class of objects,
fromwhich rules can be derived.
5. Outlier Analysis
A database may contain data objects that do not comply with the general
model of the data. These data objects are outliers. Most data mining
outliers as noise or exceptions.However, in someapplications such as fraud
rare events can be more interesting than the more regularly occurring ones.
of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or
model for the data, or using distance measures where objects that are a
distance from any other cluster are considered outliers.
6. Evolution Analysis
Data evolution analysis describes and models regularities or trends for
behavior changes over time. Although this may include characterization,
association and correlation analysis, classification, prediction, or clustering of
data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
Major Issues in Data Mining
1. Mining different kinds of knowledge in databases: Because different users
be interested in different kinds of knowledge, data mining should cover a
spectrum of data analysis and knowledge discovery tasks.
2.Interactive mining of knowledge at multiple levels of abstraction: Because
difficult to know exactly what can be discovered within a database, the data
mining process should be interactive.
3. Incorporation of background knowledge: Background knowledge, or
regarding the domain under study, may be used to guide the discovery
allow discovered patterns to be expressed in concise terms and at different
4. Data mining query languages and ad hoc data mining: Relational query
(such as SQL) allow users to pose ad hoc queries for data retrieval. In a
vein, high-level data mining query languages need to be developed to allow
to describe ad hoc data mining tasks
5. Presentation and visualization of data mining results: Discovered
be expressed in high-level languages, visual representations, or other
forms so that the knowledge can be easily understood and directly usable by
6. Handling noisy or incomplete data: The data stored in a database may
exceptional cases, or incomplete data objects.When mining data regularities,
objects may confuse the process, causing the knowledge model constructed
overfit the data. As a result, the accuracy of the discovered patterns can be
Data cleaning methods and data analysis methods that can handle noise are
required, as well as outlier mining methods for the discovery and analysis of
7. Pattern evaluation—the interestingness problem: A data mining systemcan
thousands of patterns. Many of the patterns discovered may be uninteresting
the given user, either because they represent common knowledge or lack
Several challenges remain regarding the development of techniques to
the interestingness of discovered patterns, particularly with regard to
measures that estimate the value of patterns with respect to a given user
based on user beliefs or expectations.
Real-world data tend to be incomplete, noisy, and inconsistent. Data
cleaning (or data
cleansing) routines attempt to fill in missing values, smooth out noise while
outliers, and correct inconsistencies in the data. Methods used:
Imagine that you need to analyze AllElectronics sales and customer data. You
many tuples have no recorded value for several attributes, such as customer
can you go about filling in the missing values for this attribute? Let’s look at
1.Ignore the tuple: This is usually done when the class label is missing.
This method is not very effective, unless the tuple contains several attributes
with missing values.
2.Fill in the missing value manually: In general, this approach is timeconsuming and may not be feasible given a large data set with many missing
3. Use a global constant to fill in the missing value: Replace all missing
by the same constant, such as a label like “Unknown”.
4.Use the attribute mean to fill in the missing value: For example,
suppose that the
average income of AllElectronics customers is $56,000. Use this value to
missing value for income.
5. Use the attribute mean for all samples belonging to the same class
as the given tuple:
For example, if classifying customers according to credit risk, replace the
with the average income value for customers in the same credit risk
category as that
of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
with regression, inference-based tools using a Bayesian formalism, or
induction. For example, using the other customer attributes in your data set,
may construct a decision tree to predict the missing values for income.
“What is noise?” Noise is a random error or variance in a measured variable.
numerical attribute such as, say, price, how can we “smooth” out the data to
noise? Let’s look at the following data smoothing techniques:
1. Binning: Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it. The sorted values are
distributed into a number
of “buckets,” or bins.
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified
as the bin boundaries.
2. Regression: Data can be smoothed by fitting the data to a function,
such as with
regression. Linear regression involves finding the “best” line to fit two
variables), so that one attribute can be used to predict the other. Multiple
regression is an extension of linear regression, where more than two
involved and the data are fit to a multidimensional surface.
3. Clustering: Outliers may be detected by clustering, where similar
values are organized
into groups, or “clusters.” Intuitively, values that fall outside of the set of
be considered outliers.
which combines data from multiple sources into a coherent data store, as in
data warehousing. These sources may include multiple databases, data
cubes, or flat files.
There are a number of issues to consider during data integration. Schema
integration and object matching can be tricky. How can equivalent real-world
entities from multiple data sources be matched up? This is referred to as the
entity identification problem.
For example, how can the data analyst or the computer be sure that
customer id in one database and cust number in another refer to the same
attribute? Examples of metadata
for each attribute include the name, meaning, data type, and range of values
for the attribute, and null rules for handling blank, zero, or null values
Such metadata can be used to help avoid errors in schema integration. The
may also be used to help transform the data (e.g., where data codes for pay
type in one
database may be “H” and “S”, and 1 and 2 in another). Hence, this step also
data cleaning, as described earlier.
Redundancy is another important issue. An attribute (such as annual
instance) may be redundant if it can be “derived” from another attribute or
Someredundancies can be detected by correlation analysis. Given two
analysis can measure how strongly one attribute implies the other, based on
data. For numerical attributes, we can evaluate the correlation between two
and B, by computing the correlation coefficient (also known as Pearson’s
coefficient, named after its inventer, Karl Pearson). This is
where N is the number of tuples, ai and bi are the respective values of A and
B in tuple i,
A and B are the respective mean values of A and B, sA and sB are the
deviations of A and B (as defined in Section 2.2.2), and S(aibi) is the sum of
cross-product (that is, for each tuple, the value for A is multiplied by the
value for B in
that tuple).Note that�1_rA;B _+1. If rA;B is greater than 0, then A and B are
correlated, meaning that the values of A increase as the values of B increase.
the value, the stronger the correlation (i.e., the more each attribute implies
Hence, a higher value may indicate that A (or B) may be removed as a
redundancy. If the
resulting value is equal to 0, then A and B are independent and there is no
between them. If the resulting value is less than 0, then A and B are
where the values of one attribute increase as the values of the other
This means that each attribute discourages the other.
Scatter plots can also be used to view correlations between attributes.
In addition to detecting redundancies between attributes, duplication should
be detected at the tuple level (e.g., where there are two or more identical
tuples for a
given unique data entry case). The use of denormalized tables (often done to
performance by avoiding joins) is another source of data redundancy.
often arise between various duplicates, due to inaccurate data entry or
but not all of the occurrences of the data.
A third important issue in data integration is the detection and resolution of
value conflicts. For example, for the same real-world entity, attribute values
different sources may differ. This may be due to differences in
or encoding. For instance, a weight attribute may be stored in metric units in
system and British imperial units in another.
When matching attributes from one database to another during integration,
attention must be paid to the structure of the data. This is to ensure that any
functional dependencies and referential constraints in the source system
match those in
the target system. For example, in one system, a discount may be applied to
whereas in another system it is applied to each individual line item within the
The semantic heterogeneity and structure of data pose great challenges in
Careful integration of the data frommultiple sources can help reduce and
redundancies and inconsistencies in the resulting data set.
In data transformation, the data are transformed or consolidated into forms
for mining. Data transformation can involve the following:
Smoothing, which works to remove noise from the data. Such techniques
binning, regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the
example, the daily sales data may be aggregated so as to compute monthly
total amounts. This step is typically used in constructing a data cube for
the data at multiple granularities.
Generalization of the data, where low-level or “primitive” (raw) data are
higher-level concepts through the use of concept hierarchies. For example,
attributes, like street, can be generalized to higher-level concepts, like city or
Similarly, values for numerical attributes, like age, may be mapped to higherlevel
concepts, like youth, middle-aged, and senior.
Normalization, where the attribute data are scaled so as to fall within a
range, such as -1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction),where new attributes are
and added from the given set of attributes to help the mining process.
Normalization is particularly useful for classification algorithms involving
neural networks, or distance measurements such as nearest-neighbor
classification and clustering. There are many
methods for data normalization. We study three: min-max normalization, zscore ormalization,
and normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original data.
that minA and maxA are the minimum and maximum values of an attribute,
Min-max normalization maps a value, v, of A to v0 in the range [new
In z-score normalization (or zero-mean normalization), the values for an
A, are normalized based on the mean and standard deviation of A. A value, v,
of A is
normalized to v0 by computing
where A and sA are the mean and standard deviation, respectively, of
attribute A. This
method of normalization is useful when the actual minimum and maximum
A are unknown, or when there are outliers that dominate the min-max
Normalization by decimal scaling normalizes by moving the decimal point of
of attribute A. The number of decimal points moved depends on the
value of A. A value, v, of A is normalized to v0 by computing
where j is the smallest integer such that Max(jv0j) < 1.
In attribute construction, new attributes are constructed from the given
and added in order to help improve the accuracy and understanding of
high-dimensional data. For example, we may wish to add the attribute area
the attributes height and width. By combining attributes, attribute
construction can discover
missing information about the relationships between data attributes that can
useful for knowledge discovery.
Data reduction techniques can be applied to obtain a reduced representation
data set that is much smaller in volume, yet closely maintains the integrity of
data. That is, mining on the reduced data set should be more efficient yet
same (or almost the same) analytical results.
Strategies for data reduction include the following:
1. Data cube aggregation, where aggregation operations are applied to the
data in the construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly relevant or redundant
attributes or dimensions may be detected and removed.
3. Dimensionality reduction, where encoding mechanisms are used to reduce
the data set size.
4. Numerosity reduction,where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which
need store only the model parameters instead of the actual data) or
nonparametric methods such as clustering, sampling, and the use of
5. Discretization and concept hierarchy generation,where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data
discretization is a form of numerosity reduction that is very useful for the
automatic generation of concept hierarchies.
Data Discretization and Concept Hierarchy Generation
Data discretization techniques can be used to reduce the number of values
for a given continuous attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual data
values.Replacing numerous values of a continuous attribute by a small
number of interval labels thereby reduces and simplifies the original
data.This leads to a concise, easy-to-use,knowledge-level representation of
Discretization techniques can be categorized based on how the discretization
is performed, such as whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the discretization process uses
class information, then we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding one or a few points (called
split points or cut points) to split the entire attribute range, and then repeats
this recursively on the resulting intervals, it is called top-down discretization
or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points,
removes some by merging neighborhood values to form intervals, and then
recursively applies this process to the resulting intervals. Discretization can
be performed recursively on an attribute to provide a hierarchical or
multiresolution partitioning of the attribute values, known as a concept
A concept hierarchy for a given numerical attribute defines a discretization of
the attribute. Concept hierarchies can be used to reduce the data by
collecting and replacing low-level concepts (such as numerical values for the
attribute age) with higher-level concepts (such as youth, middle-aged, or
senior). Although detail is lost by such data generalization, the generalized
data may be more meaningful and easier to interpret. This contributes to a
consistent representation of data mining results among multiple mining
tasks, which is a common requirement. In addition, mining on a reduced data
set requires fewer input/output operations and is more efficient than mining
on a larger, ungeneralized data set. Because of these benefits, discretization
techniques and concept hierarchies are typically applied before data mining
as a preprocessing step, rather than during mining.
Discretization and Concept Hierarchy Generation for
It is difficult and laborious to specify concept hierarchies for numerical
of the wide diversity of possible data ranges and the frequent updates of
data values. Such
manual specification can also be quite arbitrary.
Concept hierarchies for numerical attributes can be constructed
on data discretization. We examine the following methods: binning,
entropy-based discretization, c2-merging, cluster analysis, and discretization
partitioning. In general, each method assumes that the values to be
discretized are sorted
in ascending order.
Binning is a top-down splitting technique based on a specified number of
bins. These methods are also used as discretization methods for numerosity
reduction and concept hierarchy
generation. These techniques can be applied recursively to the resulting
partitions in order to generate concept hierarchies. Binning does not use
class information and is therefore an unsupervised discretization technique.
It is sensitive to the user-specified number of bins, as well as the presence of
Like binning, histogram analysis is an unsupervised discretization technique
it does not use class information. Histograms partition the values for an
into disjoint ranges called buckets. The histogram analysis algorithm can be
to each partition in order to automatically generate a multilevel concept
with the procedure terminating once a pre specified number of concept
levels has been
Entropy-based discretization is a supervised, top-down splitting technique. It
explores class distribution information in its calculation and determination of
split-points (data values for partitioning an attribute range). To discretize a
numerical attribute, A, the method selects the value of A that has the
minimum entropy as a split-point, and recursively partitions the resulting
intervals to arrive at a hierarchical discretization. Such discretization forms a
concept hierarchy for A.
Let D consist of data tuples defined by a set of attributes and a class-label
attribute. The class-label attribute provides the class information per tuple.
The basic method for entropy-based discretization of an attribute A within
the set is as follows:
1. Each value of A can be considered as a potential interval boundary or
split-point to partition the range of A. That is, a split-point for A can partition
the tuples in D into two subsets satisfying the conditions A =<split point and
A > split point, respectively, thereby creating a binary discretization.
2. Entropy-based discretization, as mentioned above, uses information
regarding the class label of tuples. Suppose we want to classify the tuples in
D by partitioning on attribute A and some split-point. Ideally, we would like
this partitioning to result in an exact classification of the tuples. For example,
if we had two classes, we would hope that all of the tuples of, say, class C1
will fall into one partition, and all of the tuples of class C2 will fall into the
other partition. However, this is unlikely. For example, the first partition may
contain many tuples of C1, but also some of C2. How much more information
would we still need for a perfect classification, after this partitioning? This
amount is called the expected information requirement for classifying a tuple
in D based on partitioning by A. It is given by
where D1 and D2 correspond to the tuples in D satisfying the conditions A _
split point and A > split point, respectively; |D| is the number of tuples in D,
and so on. The entropy function for a given set is calculated based on the
class distribution of the tuples in the set. For example, given m classes,
C1;C2; : : : ;Cm, the entropy of D1 is
where pi is the probability of class Ci in D1, determined by dividing the
number of tuples of class Ci in D1 by |D1|, the total number of tuples in D1.
Therefore, when selecting a split-point for attribute A, we want to pick the
attribute value that gives the minimumexpected information requirement
(i.e., min(InfoA(D))). This would result in the minimum amount of expected
information (still) required to perfectly classify the tuples after partitioning by
A_split point and A>split point.
3.The process of determining a split-point is recursively applied to each
partition obtained, until some stopping criterion is met, such as when the
minimum information requirement on all candidate split-points is less than a
small threshold, e, or when the number of intervals is greater than a
threshold, max interval.
Interval Merging by X2 Analysis
this employs a bottom-up approach by finding the best neighboring intervals
and then merging these to form larger intervals, recursively. The method is
supervised in that it uses class information. The basic notion is that for
accurate discretization, the relative class frequencies should be fairly
consistent within an interval.
ChiMerge proceeds as follows. Initially, each distinct value of a numerical
attribute A is considered to be one interval. X2 tests are performed for every
pair of adjacent intervals.
Adjacent intervals with the least X2 values are merged together, because low
X2 values for
a pair indicate similar class distributions. This merging process proceeds
a predefined stopping criterion is met.
The X2 statistic tests the hypothesis that two adjacent intervals for a given
attribute are independent of the class. Low X2 values for an interval pair
indicate that the intervals are independent of the class and can, therefore,
The stopping criterion is typically determined by three conditions. First,
stops when X2 values of all pairs of adjacent intervals exceed some
threshold. Second, the number of intervals cannot be over a prespecified
max-interval, such as 10 to
15. Finally, recall that the premise behind ChiMerge is that the relative class
should be fairly consistent within an interval.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called clustering. A cluster is a collection of data objects
that are similar to one another within the same cluster and are dissimilar to
the objects in other clusters. Although classification is an effective means for
distinguishing groups or classes of objects, it requires the often costly
collection and labeling of a large set of training tuples or patterns, which the
classifier uses to model each group. It is often more desirable to proceed in
the reverse direction: First partition the set of data into groups based on data
similarity (e.g., using clustering), and then assign labels to the relatively
small number of groups. Additional advantages of such a clustering-based
process are that it is adaptable to changes and helps single out useful
features that distinguish different groups. By automated clustering, we can
identify dense and sparse regions in object space and, therefore, discover
overall distribution patterns and interesting correlations among data
attributes. Cluster analysis has been widely used in numerous applications,
including market research, pattern recognition, data analysis, and image
processing. In business, clustering can help marketers discover distinct
groups in their customer bases and characterize customer groups based on
Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers may be more
interesting than common cases. Applications of outlier detection include the
detection of credit card fraud and the monitoring of criminal activities in
electronic commerce. For example, exceptional cases in credit card
transactions, such as very expensive and frequent purchases, may be of
interest as possible fraudulent activity. As a data mining function, cluster
analysis can be used as a stand-alone tool to gain insight into the distribution
of data, to observe the characteristics of each cluster, and to focus on a
particular set of clusters for further analysis. Alternatively, it may serve as a
preprocessing step for other algorithms, such as characterization, attribute
subset selection, and classification, which would then operate on the
detected clusters and the selected attributes or features.
In machine learning, clustering is an example of unsupervised learning.
Unlike classification, clustering and unsupervised learning do not rely on
predefined classes and class-labeled training examples.
The following are typical requirements of clustering in data mining:
Scalability: Many clustering algorithms work well on small data sets
containing fewer than several hundred data objects; however, a large
database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are
designed to cluster interval-based (numerical) data. However, applications
may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms
determine clusters based on Euclidean or Manhattan distance measures.
Algorithms based on such distance measures tend to find spherical clusters
with similar size and density. However, a cluster could be of any shape. It is
important to develop algorithms that can detect clusters of arbitrary shape.
Minimal requirements for domain knowledge to determine input
parameters: Many clustering algorithms require users to input certain
parameters in cluster analysis (such as the number of desired clusters). The
clustering results can be quite sensitive to input parameters. Parameters are
often difficult to determine, especially for data sets containing high-
dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world databases contain outliers
or missing, unknown, or erroneous data. Some clustering algorithms are
sensitive to such data and may lead to clusters of poor quality.
Incremental clustering and insensitivity to the order of input
records: Some clustering algorithms cannot incorporate newly inserted data
(i.e., database updates) into existing clustering structures and, instead, must
determine a new clustering from scratch. Some clustering algorithms are
sensitive to the order of input data. That is, given a set of data objects, such
an algorithm may return dramatically different clusterings depending on the
order of presentation of the input objects. It is important to develop
incremental clustering algorithms and algorithms that are insensitive to the
order of input.
High dimensionality: A database or a data warehouse can contain several
dimensions or attributes. Many clustering algorithms are good at handling
low-dimensional data, involving only two to three dimensions. Human eyes
are good at judging the quality of clustering for up to three dimensions.
Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform
clustering under various kinds of constraints.
Interpretability and usability: Users expect clustering results to be
interpretable, comprehensible, and usable. That is, clustering may need to
be tied to specific semantic interpretations and applications. It is important
to study how an application goal may influence the selection of clustering
features and methods.
Types of Data in Cluster Analysis: Suppose that a
data set to be clustered contains n objects, which may represent persons,
houses, documents, countries, and so on. Main memory-based clustering
algorithms typically operate on either of the following two data structures:
Data matrix (or object-by-variable structure): This represents n objects,
such as persons, with p variables (also called measurements or attributes),
such as age, height, weight, gender, and so on. The structure is in the form
of a relational table, or n-by-p matrix (n objects _p variables):
Dissimilarity matrix (or object-by-object structure): This stores a
collection of proximities that are available for all pairs of n objects. It is often
represented by an n-by-n table:
where d(i, j) is the measured difference or dissimilarity between objects i and
j. In general, d(i, j) is a nonnegative number that is close to 0 when objects i
and j are highly similar or “near” each other, and becomes larger the more
they differ. Since d(i, j)=d( j, i), and d(i, i)=0
The rows and columns of the data matrix represent different entities, while
those of the dissimilarity matrix represent the same entity. Thus, the data
matrix is often called a two-mode matrix, whereas the dissimilarity matrix is
called a one-mode matrix. Many clustering algorithms operate on a
dissimilarity matrix. If the data are presented in the form of a data matrix, it
can first be transformed into a dissimilarity matrix before applying such
In this section, we discuss how object dissimilarity can be computed for
objects described by interval-scaled variables; by binary variables; by
categorical, ordinal, and ratio-scaled variables; or combinations of these
Interval-scaled variables are continuous measurements of a roughly linear
scale. Typical examples include weight and height, latitude and longitude
coordinates (e.g., when clustering houses), and weather temperature.
The measurement unit used can affect the clustering analysis. For example,
changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to a very different clustering
structure. In general, expressing a variable in smaller units will lead to a
larger range for that variable, and thus a larger effect on the resulting
clustering structure. To help avoid dependence on the choice of
measurement units, the data should be standardized. Standardizing
measurements attempts to give all variables an equal weight. This is
particularly useful when given no prior knowledge of the data.
To standardize measurements, one choice is to convert the original
measurements to unitless variables. Given measurements for a variable f ,
this can be performed as follows.
The mean absolute deviation, s f , is more robust to outliers than the
sf .When computing the mean absolute deviation, the deviations from the
(i.e., jxi f �mf j) are not squared; hence, the effect of outliers is somewhat
There are more robust measures of dispersion, such as the median absolute
However, the advantage of using the mean absolute deviation is that the zscores of outliers do not become too small; hence, the outliers remain
After standardization, or without standardization in certain applications, the
(or similarity) between the objects described by interval-scaled variables is
computed based on the distance between each pair of objects. The most
measure is Euclidean distance, which is defined as
where i=(xi1, xi2, : : : , xin) and j =(x j1, x j2, : : : , x jn) are two ndimensional data objects.
Another well-known metric is Manhattan (or city block) distance, defined
Minkowski distance is a generalization of both Euclidean distance and
distance. It is defined as
where p is a positive integer. Such a distance is also called Lp norm, in some
It represents the Manhattan distance when p = 1 (i.e., L1 norm) and
when p = 2 (i.e., L2 norm).
A binary variable has only two states: 0 or 1, where 0 means that the
variable is absent, and 1 means that it is present. Treating binary variables
as if they are interval-scaled can lead to
misleading clustering results. Therefore, methods specific to binary data are
for computing dissimilarities.
One approach involves computing a dissimilarity matrix from the given
binary data. If all binary variables are thought of as having the same weight,
we have the 2-by-2 contingency table of
Table 7.1, where q is the number of variables that equal 1 for both objects i
and j, r is the number of variables that equal 1 for object i but that are 0 for
object j, s is the number of variables that equal 0 for object i but equal 1 for
object j, and t is the number of variables that equal 0 for both objects i and j.
The total number of variables is p, where p = q+r+s+t.
Types of binary variables:
A binary variable is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome
should be coded as 0 or 1. One such example could be the attribute gender
having the states male and female.
A binary variable is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a disease
test. By convention, we shall code the most important outcome, which is
usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative). Given two asymmetric binary variables, the agreement of two 1s
(a positive match) is then considered more significant than that of two 0s (a
negative match). Therefore, such binary variables are often considered
“monary” (as if having one state). The dissimilarity based on such variables
is called asymmetric binary dissimilarity, where the number of negative
matches, t, is considered unimportant and thus is ignored in the computation
A categorical variable is a generalization of the binary variable in that it can
take on more than two states. For example, map color is a categorical
variable that may have, say, five states: red, yellow, green, pink, and blue.
where m is the number of matches (i.e., the number of variables for which i
and j are in the same state), and p is the total number of variables.