Business Intelligence & Data Mining-9

Published on January 2017 | Categories: Documents | Downloads: 33 | Comments: 0 | Views: 308
of 35
Download PDF   Embed   Report

Comments

Content

Association Rules

Market Basket Analysis

Association Rules
• Usually applied to market baskets but other
applications are possible
• Useful Rules contain novel and actionable
information: e.g. On Thursdays grocery customers are
likely to buy diapers and beer together
• Trivial Rules contain already known information: e.g.
People who buy maintenance agreements are the ones
who have also bought large appliances
• Some novel rules may not be useful: e.g. New
hardware stores most commonly sell toilet rings

Association Rule: Basic Concepts
• Given: (1) a set of transactions, (2) each transaction is a
set of items (e.g. purchased by a customer in a visit)
• Find: (all ?)rules that correlate the presence of one set
of items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories
also get automotive services done

• Applications





Retailing (What other products should the store stocks up)
Attached mailing in direct marketing
Market Basket Analysis (what do people buy together?)
Catalog design (Which items should appear next to each
other)

What Is Association Rule Mining?
• Association rule mining:
– Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.

• Examples.
– Rule form: “Body → Ηead [support, confidence]”.
– buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) → grade(x, “CS”, “A”)
[1%, 75%]

Rule Measures: Support and Confidence
Customer
buys both

Customer
buys beer

Customer
buys diaper

Find all the rules X & Y ⇒ Z with
minimum confidence and support
– support, s, probability that a
transaction contains {X & Y &
Z}
– confidence, c, conditional
probability that a transaction
having {X & Y} also contains Z

Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F

Let minimum support 50%,
and minimum confidence
50%, we have
– A ⇒ C (50%, 66.6%)
– C ⇒ A (50%, 100%)

Mining Association Rules—An Example
Frequent Itemsets
Transaction ID
2000
1000
4000
5000

Items Bought
A,B,C
A,C
A,D
B,E,F

For rule A ⇒ C:

Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%

support = support({A &C}) = 50%
confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle (Agarwal, 1995):
Any subset of a frequent itemset must also be frequent

Mining Frequent Itemsets: the Key Step
• Find the frequent itemsets: the sets of items that
have minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both { A} and {B} should be
frequent itemsets

– Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)

• Use the frequent itemsets to generate association
rules.

The Apriori Algorithm
• Generate C1: all 1 unique items
• Generate L1: all 1 unique items with minimum
support
• Join Step: Ck is generated forming Cartesian Product of
Lk-1with L1. Since, any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset

• Prune Step: Lk-1 is generated by selecting from Ck itemsets
those with minimum support

The Apriori Algorithm — Example
Database D
TID
100
200
300
400

L2

itemset sup.
{1}
2
C1
{2}
3
3
Scan D {3}
{4}
1
{5}
3

Items
134
235
1235
25

itemset
{1 3}
{2 3}
{2 5}
{3 5}

C3

sup
2
2
3
2

itemset
{1 3 5}
{2 3 5}

C2 itemset sup
{1
{1
{1
{2
{2
{3

Scan D

2}
3}
5}
3}
5}
5}

1
2
1
2
3
2

L1 itemset sup.
{1}
{2}
{3}
{5}

2
3
3
3

C2 itemset
{1 2}
Scan D

L3 itemset sup
{2 3 5} 2

{1
{1
{2
{2
{3

3}
5}
3}
5}
5}

Is Apriori Fast Enough? —
Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent ( k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and matching to collect counts for the candidate
itemsets

• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1 -itemset will generate 107 candidate 2 -itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a 100}, one
needs to generate 2 100 ≈ 1030 candidates.

– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern

Methods to Improve Apriori’s Efficiency
• Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
• Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
• Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
• Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent

How to Count Supports of Candidates?
• Why is counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates

• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction

Criticism of Support and Confidence
• Example 1: (Agarwal & Yu, PODS98)
– Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
– not play basketball ⇒ eat cereal [35%, 87.5%] lower support but
higher confidence!
– play basketball ⇒ not eat cereal [20%, 33.3%] is more informative,
although with lower support and confidence

basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000

Criticism of Support and Confidence
• Example 2:
– X and Y: positively
correlated,
– X and Z, Y and Z: negatively
correlated
– support and confidence of
X=>Z dominates

• We need a measure of
dependent or correlated events
• P(B|A)/P(B) is called the lift
of rule A => B

X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
X=>Y 25%
X=>Z 37.5%
Y=>Z 12.5%

50%
75%
50%

Other Interestingness Measures: lift
P( A ∧ B)
• Lift = P(B|A)/P(B) =
P( A) P( B)
– takes both P(A) and P(B) into consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if lift < 1
– If lift > 1, A and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1

Rule

Support

Lift

X=>Y
X=>Z
Y=>Z

25%
37.50%
12.50%

2
0.86
0.57

Extensions

Multiple-Level Association Rules
• Items often form hierarchy.
• Items at the lower level are
expected to have lower
support.
• Rules regarding itemsets at
appropriate levels could be
quite useful.
• Transaction database can be
encoded based on
dimensions and levels

Food
bread

milk
skim
Fraser

2%

wheat

Sunset

white

Mining Multi-Level Associations
• A top-down, progressive deepening approach:
– First find high-level strong rules (Ancestors):

milk → bread [20%, 60%].
– Then find their lower-level “weaker” rules (Descendants):
2% milk → wheat bread [6%, 50%].

• Variations of mining multiple-level association rules.
– Level-crossed association rules:
2% milk → Wonder wheat bread
– Association rules with multiple, alternative hierarchies:
2% milk → Wonder bread

Multi-level Association: Uniform
Support vs. Reduced Support
• Uniform Support: the same minimum support for all levels
§ + One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum
support.

§ – Lower level items do not occur as frequently. If support
threshold
• too high ⇒ miss low level associations
• too low ⇒ generate too many high level associations

• Reducing Support: reduced minimum support at lower
levels
– Needs modification to the basic algorithm

Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%

Level 2
min_sup = 5%

Milk
[support = 10%]

2% Milk

Skim Milk

[support = 6%]

[support = 4%]

Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%

Level 2
min_sup = 3%

Milk
[support = 10%]

2% Milk

Skim Milk

[support = 6%]

[support = 4%]

Multi-level Association:
Redundancy Filtering
• Some rules may be redundant due to “ancestor”
relationships between items.
• Example
– milk ⇒ wheat bread [support = 8%, confidence = 70%]
– 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]

• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.

Multi-Level Mining: Progressive
Deepening
• A top-down, progressive deepening approach:
– First mine high-level frequent items:
milk (15%), bread (10%)
– Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)

• Different min-support threshold across multi-levels
lead to different algorithms:
– If adopting the same min-support across multi-levels
then reject t if any of t’s ancestor is infrequent.

– If adopting reduced min-support at lower levels
then examine only those descendants whose ancestor’s support is
frequent and whose own support is > reduced min-support

Sequence pattern mining
• Sequence events database
– Consists of sequences of values or events changing
with time
– Data is recorded at regular intervals
– Characteristic sequence events’ characteristics
• Trend, cycle, seasonal, irregular

• Examples
– Financial: stock price, inflation
– Biomedical: blood pressure
– Meteorological: precipitation

Two types of sequence data
• Event series
– Record events that happens at certain time
– E.g. network logins

• Time series
– Record changes of certain (typically numeric)
values over time
– E.g. stock price movements, blood pressure

Event series
• Series can be represented in two ways:
– As a sequence (string) of events.
• Empty space if no events occur at a certain time
• Hard to represent multiple events

– As a set of tuples: {(time, event)}
• Allow for multiple event at the same time

Types of interesting info
• Which events happen often (not too interesting)
• What group of event happen often
– People who rent “Star Wars” also rent “Star Trek”

• What sequence of event happen often
– Renting “Star Wars”, then “Empire Strikes Back”, then
“Return of the Jedi” in that order

• Association of events within a time window
– People who rent “Star Wars” tend to rent “Empire
Strikes Back” within one week

Similarity/Difference with
Association Rules
• Similarities:
– Groups of events : frequent item sets
– Associations : Association rules

• Differences:
– Notion of (time) windows:
• People who rent “Star Wars” tend to rent “Empire Strikes
Back” within one week

– Ordering of events is important

Episodes
• A partially ordered sequence of events
A
A

B

C
B

Serial
(B follows A)

A

Parallel
(B follows A OR
A follows B)

B
General
(order between
A & B unknown
or immaterial but A
& B precede C)

Sub-episode / super-episode
• If A, B & C occur within a time window:
§
§

A & B is a sub-episode of A, B & C
A,B & C is the super-episode of A, B, C, A & B,
B&C

Frequent episodes / Episode Rules
• Frequent episodes
– Find episodes that appear often

• Episode rules
– Used to emphasize the effect of events on
episodes
– Support/confidence as defined in association
rules

• Example (window size = 11)
AB-C-DEABE-F-A-DFECDAABBCDE

Episode Rules : Example
A

A
C

B

B

Window size 10: Support 4%, Confidence 80%

• Meaning: Given episode on the left appear,
episode on the right appears 80% of the time.
• This essentially says that when (A,B) appears,
then C appears (within a given window size)

Mining episode rules
• Apriori principle for episode
– An episode is frequent if and only if all its subepisode is frequent

• Thus apriori-based algorithm can be applied
• However, there are a few tricky issues

Mining episode rules
• Recognizing episode in sequences
– Parallel episode: standard association rules techniques
– Serial/General episode: Finite state machine based
construction
• Alternative: Count parallel episodes first, then use them to
generate candidate episodes of other types

• Counting number of windows
– One event appears in n windows for window size w
– O.K. if sequence long, as the ratios even out.
– However when sequence size is small, the ‘edges’ can
dominate

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close