Business Intelligence & Data Mining-6-7

Published on June 2016 | Categories: Types, School Work | Downloads: 63 | Comments: 0 | Views: 255

of 27

Content

Data Mining

Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.

Data Mining Functionalities
• Discrimination
– Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions (in image processing)

• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”)
– buys(X, “computer”) à buys(X, “software”)

• Classification and Prediction
– Finding models that describe and distinguish classes or
concepts for future prediction
– Prediction of some unknown or missing values

Data Mining Functionalities
• Cluster detection
– Group data to form new classes, e.g., cluster customers based on
similarity of some sort
– Based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity

• Outlier analysis
– Outlier: a data object that does not comply with the general
behavior of the data
– It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis

Supervised vs. Unsupervised Learning
• Supervised learning (e.g. classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations or actual outcome
– New data is classified based on the model obtained from
the training set

• Unsupervised learning (e.g. clustering)
– The class labels of training data are unknown
– Given a set of measurements, observations, etc., the aim is
to establish the existence of classes, clusters, links or
associations in the data

Classification

Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) of a target / class attribute and
uses it for classifying new data

• Prediction:
– models continuous-valued functions, i.e., predicts unknown
or missing values

• Typical Applications of Classification
– credit approval
– target marketing
– medical diagnosis

Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is known as the training
set
– The model is represented as classification rules, decision trees, a
trained neural network or mathematical formulae
– Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting can
occur

• Model usage: for classifying future or unknown objects

Classification Process: Model Construction

Training
Data

NAME
Mike
Mary
Bill
Jim
Dave
Anne

RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no

Classification
Algorithms

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN Ind. Chrg. = ‘yes’

Classification Process: Use the Model
(Testing & Prediction)
Classifier
Testing
Data

Unseen Data
(Jeff, Professor, 4)

NAME
Tom
Merlisa
George
Joseph

RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes

Ind. Chrg.?

Data Preparation for Classification and
Prediction
• Data cleaning
– Preprocess data in order to reduce noise and handle missing
values

• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes

• Data transformation
– Generalize and/or normalize data

Decision Tree Classification Algorithms

Decision Tree Algorithms
Income group 1
age > 27

Income group 2

Income group 3
or 4
age >41

Saw Independence Day

Income
group 3
age < 41

Income
group 4
age < 41

Saw Birdcage

Income group 1
age < 27

Saw Courage Under Fire

Saw Nutty Professor

Saw Birdcage

Saw Courage
Under Fire

Classification by Decision Tree Induction
• Decision tree
–
–
–
–

A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels

• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree

Training Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40

income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent

buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no

Output: A Decision Tree for
“buys_computer”
age?
<=30
student?

overcast
30..40
yes

>40
credit rating?

no

yes

excellent

fair

no

yes

no

yes

Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Some algorithms assume attributes to be categorical or ordinal (if
continuous-valued, they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left

Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued
attributes

• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values
for each attribute
– Can be modified for categorical attributes

Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as

p
p
n
n
I ( p, n) = −
log 2
−
log 2
p+n
p+n p+n
p+n

Information Gain in Decision
Tree Induction
• Assume that using attribute A a set S will be
partitioned into subsets {S1, S2 , …, Sk}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subsets Si is
E ( A) =

k

∑

i =1

p i + ni
I ( p i , ni )
p+n

• The encoding information that would be gained by
branching on A is

Gain( A) = I ( p, n) − E ( A)

Attribute Selection by Information
Gain Computation
g Class P: buys_computer =
“yes”

5
4
I ( 2 ,3 ) +
I ( 4 ,0 )
14
14
5
+
I ( 3 , 2 ) = 0 . 69
14

E ( age ) =

g Class N: buys_computer =
“no”

Hence

g I(p, n) = I(9, 5) =0.940

Gain(age) = I ( p, n) − E (age)

g Compute the entropy for
age:

= 0.94 – 0.69 = 0.25
Similarly
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048

age
<=30
30…40
>40

pi
2
4
3

ni I(pi, ni)
3 0.971
0 0
2 0.971

Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
gini (T ) = 1 − ∑ p 2j
j =1

where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as

N 1 gini ( ) + N 2 gini ( )
(
T
)
=
gini split
T1
T2
N
N
• The attribute that provides the smallest ginisplit (T) is chosen to
split the node

Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”

Approaches to Determine the Final
Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution

• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized

Classification Accuracy: Estimating
Error Rates
• Partition: Training-and-testing
– use two independent data sets, e.g., training set (2/3), test
set(1/3)
– used for data set with large number of samples
• Cross-validation
– divide the data set into k subsamples
– use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
– for data set with moderate size
• Bootstrapping (leave-one-out)
– for small size data

Boosting
• Boosting increases classification accuracy
– Applicable to decision trees
– Learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
• Boosting requires only linear time and constant
space

Decision Trees (Strengths & Weaknesses)
• Generates understandable rules
• Scalability: Fast classification of new cases - can
classify data sets with millions of examples and

hundreds of attributes with reasonable speed
• Handles both continuous and categorical values
• Indicates best fields

• Comparable classification accuracy with other
methods
• Expensive to train (but still better than most other
classification methods)
• Uses rectangular regions
• Accuracy can decrease due to over-fitting of data

Business Intelligence & Data Mining-6-7

Comments

Content

Sponsor Documents

Recommended