Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.
Data Mining Functionalities
• Discrimination
– Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions (in image processing)
• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”)
– buys(X, “computer”) à buys(X, “software”)
• Classification and Prediction
– Finding models that describe and distinguish classes or
concepts for future prediction
– Prediction of some unknown or missing values
Data Mining Functionalities
• Cluster detection
– Group data to form new classes, e.g., cluster customers based on
similarity of some sort
– Based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity
• Outlier analysis
– Outlier: a data object that does not comply with the general
behavior of the data
– It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
Supervised vs. Unsupervised Learning
• Supervised learning (e.g. classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations or actual outcome
– New data is classified based on the model obtained from
the training set
• Unsupervised learning (e.g. clustering)
– The class labels of training data are unknown
– Given a set of measurements, observations, etc., the aim is
to establish the existence of classes, clusters, links or
associations in the data
Classification
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) of a target / class attribute and
uses it for classifying new data
• Typical Applications of Classification
– credit approval
– target marketing
– medical diagnosis
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is known as the training
set
– The model is represented as classification rules, decision trees, a
trained neural network or mathematical formulae
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting can
occur
• Model usage: for classifying future or unknown objects
Classification Process: Model Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN Ind. Chrg. = ‘yes’
Classification Process: Use the Model
(Testing & Prediction)
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Ind. Chrg.?
Data Preparation for Classification and
Prediction
• Data cleaning
– Preprocess data in order to reduce noise and handle missing
values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Decision Tree Classification Algorithms
Decision Tree Algorithms
Income group 1
age > 27
Income group 2
Income group 3
or 4
age >41
Saw Independence Day
Income
group 3
age < 41
Income
group 4
age < 41
Saw Birdcage
Income group 1
age < 27
Saw Courage Under Fire
Saw Nutty Professor
Saw Birdcage
Saw Courage
Under Fire
Classification by Decision Tree Induction
• Decision tree
–
–
–
–
A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Training Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Some algorithms assume attributes to be categorical or ordinal (if
continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued
attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values
for each attribute
– Can be modified for categorical attributes
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p
p
n
n
I ( p, n) = −
log 2
−
log 2
p+n
p+n p+n
p+n
Information Gain in Decision
Tree Induction
• Assume that using attribute A a set S will be
partitioned into subsets {S1, S2 , …, Sk}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subsets Si is
E ( A) =
k
∑
i =1
p i + ni
I ( p i , ni )
p+n
• The encoding information that would be gained by
branching on A is
Gain( A) = I ( p, n) − E ( A)
Attribute Selection by Information
Gain Computation
g Class P: buys_computer =
“yes”
5
4
I ( 2 ,3 ) +
I ( 4 ,0 )
14
14
5
+
I ( 3 , 2 ) = 0 . 69
14
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
gini (T ) = 1 − ∑ p 2j
j =1
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as
N 1 gini ( ) + N 2 gini ( )
(
T
)
=
gini split
T1
T2
N
N
• The attribute that provides the smallest ginisplit (T) is chosen to
split the node
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
Approaches to Determine the Final
Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized
Classification Accuracy: Estimating
Error Rates
• Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
• Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
– for data set with moderate size
• Bootstrapping (leave-one-out)
– for small size data
Boosting
• Boosting increases classification accuracy
– Applicable to decision trees
– Learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
• Boosting requires only linear time and constant
space
Decision Trees (Strengths & Weaknesses)
• Generates understandable rules
• Scalability: Fast classification of new cases - can
classify data sets with millions of examples and
hundreds of attributes with reasonable speed
• Handles both continuous and categorical values
• Indicates best fields
• Comparable classification accuracy with other
methods
• Expensive to train (but still better than most other
classification methods)
• Uses rectangular regions
• Accuracy can decrease due to over-fitting of data