ICT role in 21st century education and its challenges
Dbm630 lecture05
1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 5
Association Rule Mining
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Topics
Association rule mining
Mining single-dimensional association rules
Mining multilevel association rules
Other measurements: interest and conviction
Association rule mining to correlation analysis
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. What is Association Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
Applications:
Basket data analysis, cross-marketing, catalog design,
clustering, classification, etc.
Ex.: Rule form: “Body Head [support, confidence]”
buys(x, “diapers*”) Consequent [support, confidence]”
“Antecedent buys(x, “beers”) [0.5%,60%]
major(x, “CS”)^takes(x, “DB”) grade(x, “A”) [1%, 75%]
3 Data Warehousing and Data Mining by Kritsada Sriphaew
4. A typical example of association rule mining is
market basket analysis.
4 Data Warehousing and Data Mining by Kritsada Sriphaew
5. Rule Measures: Support/Confidence
Find all the rules “Antecedent(s) Consequent(s)” with minimum
support and confidence
support, s, probability that a transaction contains {A C}
confidence, c, conditional probability that a transaction having A also contains C
Let min. sup. 50%, and min. conf. 50%, • Support= 50% means that 50% of all
transactions under analysis show that
A C (s=50%, c=66.7%) A and C are purchased together
• Confidence=66.7% means that 66.7% of the
C A (s=50%, c=100%) customers who purchased A also bought C
Typically association rules are considered
interesting if they satisfy both
a minimum support threshold and Transactional databases
a mininum confidence threshold Transaction ID Items Bought
Such thresholds can be set by users 2000 A,B,C
or domain experts 1000 A,C
4000 A,D
5000 B,E,F
5 Data Warehousing and Data Mining by Kritsada Sriphaew
6. Rule Measures: Support/Confidence
probability
TransID Items Bought Rule: A C
T001 A,B,C
T002 A,C support (AC) = P({AC}) = P(AC)
T003 A,D confidence(AC) = P(C|A)
T004 B,E,F = P({AC})/P({A})
Frequency • A B (1/4 = 25%, 1/3 = 33.3%) Customer buys both (A&C)
A =3 • B A (1/4 = 25%, 1/2 = 50%)
Customer buys diaper(C)
B =2 • A C (2/4 = 50%, 2/3 = 66.7%)
• C A (2/4 = 50%, 2/2 =100%)
C =2
• A, B C (1/4 = 25%, 1/1 = 100%)
AB = 1 • A, C B (1/4 = 25%, 1/2 = 50%)
AC = 2 • B, C A (1/4 = 25%, 1/1 = 100%)
BC = 1
ABC = 1
Customer buys beer (A)
6 Data Warehousing and Data Mining by Kritsada Sriphaew
7. Association Rule: Support/Confidence for
Relational Tables
In case that each transaction is a row in a relational table
Find: all rules that correlate the presence of one set of
attributes with that of another set of attributes
outlook temp. humidity windy Sponsor play-time play
• If temperature = hot
sunny hot high True Sony 85 Y
then humidity = high
(s=3/10,c=3/5) sunny hot high False HP 90 Y
overcast hot normal True Ford 63 Y
• If windy=true and play=Y rainy mild high True Ford 5 N
then humidity=high and
rainy cool low False HP 56 Y
outlook=overcast
(s=2/10, c=2/4) sunny hot low True Sony 25 N
rainy cool normal True Nokia 5 N
• If windy=true and play=Y overcast mild high True Honda 86 Y
and humidity=high
rainy mild low False Ford 78 Y
then outlook=overcast
(s=2/10, c=2/3) overcast hot high True Sony 74 Y
7 Data Warehousing and Data Mining by Kritsada Sriphaew
8. Association Rule Mining: Types
Boolean vs. quantitative associations (Based on the types of
values handled) (Single vs. multiple Dim.)
SQLServer ^ DMBooks DBMiner [0.2%, 60%]
buys(x, “SQLServer”) ^ buys(x, “DMBook”)
buys(x, “DBMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”)
buys(x, “PC”) [1%, 75%]
Single level vs. multilevel analysis
What brands of beers are associated with what brands of diapers?
Various extensions
Maxpatterns and closed itemsets
8 Data Warehousing and Data Mining by Kritsada Sriphaew
9. An Example (single dimensional Boolean
association Rule Mining)
For rule A C: Min. support 50%
support = support({A, C}) = 50% Min. confidence 50%
confidence = support({A, C})/support({A}) = 66.7%
The Apriori principle:
Any subset of a frequent itemset must be
frequent
Transaction ID Items Bought Frequent Itemset Support
2000 A,B,C {A} 75%
1000 A,C {B} 50%
4000 A,D {C} 50%
5000 B,E,F {A,C} 50%
9 Data Warehousing and Data Mining by Kritsada Sriphaew
10. Two Steps in Mining Association Rules
A subset of a frequent itemset must also be a frequent
itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} must
be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to
k (k-itemset)
Step1: Find the frequent itemsets: the sets of items that
have minimum support
Step2: Use the frequent itemsets to generate association
rules
10 Data Warehousing and Data Mining by Kritsada Sriphaew
11. Find the frequent itemsets
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset
of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent 1-itemsets}; 1
for (k = 1; Lk !=f; k++) do begin
Ck+1 = candidates generated from Lk; 2
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Uk Lk;
11 Data Warehousing and Data Mining by Kritsada Sriphaew
13. How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, …,
p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1 = q.item1, …,
p.itemk-2 = q.itemk-2,
p.itemk-1 < q.itemk-1
Step 2: pruning
ForAll itemsets c IN Ck DO
ForAll (k-1)-subsets s OF c DO
13
IF (s is not in Lk-1) THEN DELETE c FROM Ck
14. Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3×L3
abc + abd abcd
acd + ace acde
Pruning:
C4={abcd} acde is removed because
ade is not in L3
14 Data Warehousing and Data Mining by Kritsada Sriphaew
15. How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
15 Data Warehousing and Data Mining by Kritsada Sriphaew
16. Subset Function
Subset function: finds all the candidates contained in a
transaction. (1) Generate Hash Tree (2) Hashing each item in
the transactions 2 1
C2 1 3 1+1 Database
itemset TID Items
{1 2} 5 1 100 134
{1 3} f 200 235
{1 5} 3 300 1235
1+1
{2 3} 2 400 25
{2 5} 5 1+1+1
{3 5}
3 5 1+1
16 Data Warehousing and Data Mining by Kritsada Sriphaew
17. Is Apriori Fast Enough? — Performance
Bottlenecks
The core of the Apriori algorithm:
Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
Use database scan and pattern matching to collect counts for
the candidate itemsets
The bottleneck of Apriori: candidate generation
Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-
itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 1030 candidates.
Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest pattern
17 Data Warehousing and Data Mining by Kritsada Sriphaew
18. Mining Frequent Patterns Without Candidate
Generation
Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly condensed, but complete for frequent pattern
mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern
mining method
A divide-and-conquer methodology: decompose mining
tasks into smaller ones
Avoid candidate generation: sub-database test only!
18 Data Warehousing and Data Mining by Kritsada Sriphaew
19. Construct FP-tree from Transaction DB
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Steps: Header Table
1. Scan DB once, find Item frequency head f:4 c:1
frequent 1-itemset f 4
(single item pattern) c 4 c:3 b:1 b:1
a 3
2. Order frequent items b 3 a:3 p:1
in frequency m 3
descending order p 3 m:2 b:1
3. Scan DB again,
construct FP-tree p:2 m:1
19 Data Warehousing and Data Mining by Kritsada Sriphaew
20. Mining Frequent Patterns using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent pattern path using the FP-tree
Method
For each item, construct its conditional pattern-base, and then its
conditional FP-tree
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern)
Benefit: Completeness & Compactness
Completeness: never breaks a long pattern of any transaction and
preserves complete information for frequent pattern mining
Compactness: reduces irrelevant information (infrequent items are gone),
orders in frequency descending ordering (more frequent items are likely to
be shared), and smaller than the original database.
20 Data Warehousing and Data Mining by Kritsada Sriphaew
22. Step 2: Construct Conditional FP-tree
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern base
m-conditional pattern base:
{} fca:2,
Header Table
fcab:1
Item frequency head f:4 c:1
All frequent
f 4 patterns
c 4 c:3 b:1 b:1 {}
concerning m
a 3
b 3 a:3 p:1 f:3 m,
m 3
c:3 fm, cm, am,
p 3 m:2 b:1
fcm, fam, cam,
a:3
p:2 m:1 fcam
22
m-conditional FP-tree
23. Mining Frequent Patterns by
(Creating Conditional Pattern-Bases)
Item Conditional pattern-base Conditional FP-tree
p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
23 Data Warehousing and Data Mining by Kritsada Sriphaew
24. Step 3: Recursively mine the conditional FP-
tree {}
f:3
{}
c:3
f:3
am-conditional FP-tree
c:3
{}
a:3
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
f:3
cam-conditional FP-tree
24 Data Warehousing and Data Mining by Kritsada Sriphaew
25. Single FP-tree Path Generation
Suppose an FP-tree T has a single path P
The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
m-conditional pattern base:
{} fca:2,
Header Table
fcab:1
Item frequency head f:4 c:1
All frequent
f 4 patterns
c 4 c:3 b:1 b:1 {}
concerning m
a 3
b 3 a:3 p:1 f:3 m,
m 3
c:3 fm, cm, am,
p 3 m:2 b:1
fcm, fam, cam,
a:3
p:2 m:1 fcam
25
m-conditional FP-tree
26. FP-growth vs. Apriori: Scalability With the
Support Threshold
Data set T25I20D10K
100
90
80 D1 FP-growth runtime
Run time(sec.)
70 D1 Apriori runtime
60
50
40
30
20
10
0
0 1 2 3
Support threshold(%)
26 Data Warehousing and Data Mining by Kritsada Sriphaew
27. CHARM - Mining Closed Association Rules
Instead of horizontal DB format, vertical format is used.
Instead of traditional frequent itemsets, closed frequent
itemsets are mined.
Horizontal DB Vertical DB
Transaction Items Items Transaction
1 ABDE A 1345
2 BCE B 123456
3 ABDE C 2456
4 ABCE D 1356
5 ABCDE E 12345
6 BCD
27 Data Warehousing and Data Mining by Kritsada Sriphaew
28. CHARM – Frequent Itemsets and Their Supports
An example database and its frequent itemsets
Items Trans. Support Itemsets
A 1345 1.00 B
B 123456 0.83 BE, E
C 2456 0.67 A, C, D, AB,AE,
D 1356 BC, BD, ABE
E 12345 0.50 AD, CE, DE,
ABD, ADE, BCE,
BDE, ABDE
Vertical DB
Min. support = 0.5
28 Data Warehousing and Data Mining by Kritsada Sriphaew
29. CHARM - Closed Itemsets
Closed frequent itemsets and their corresponding
frequent itemsets
Closed
Itemsets Tidsets Sup. Freq. Itemsets
B 123456 1.00 B
BE 12345 0.83 BE, E
ABE 1345 0.67 ABE, AB, AE, A
BD 1356 0.67 BD, D
BC 2456 0.67 BC, C
ABDE 135 0.50 ABDE, ABD, ADE,
BDE, AD, DE
BCE 245 0.50 CE, BCE
29 Data Warehousing and Data Mining by Kritsada Sriphaew
30. The CHARM Algorithm
CHARM (? I T, minsup); CHARM-PROPERY(Nodes, NewN)
1. Nodes = { Ij t(Ij) : Ij I |t(Ij )| minsup } 1. if (|Y| minsup) then
2. CHARM-EXTEND (Nodes, C) 2. if t(Xi) = t(Xj) then // Propery 1
3. Remove Xj from Nodes
CHARM-EXTEND (Nodes, C) 4. Replace all Xi with X’
3. for each Xi t(Xi) in Nodes 5. else if t(Xj) t(Xj) then // Propery 2
4. NewN = f and X = Xi 6. Replace all Xi with X’
5. for each Xj t(Xj) in Nodes, with f(j) > f(I) 7. else if t(Xj) t(Xj) then // Propery 3
6. X’ = X Xj and Y = t(Xi) t(Xj) 8. Remove Xj from Nodes
7. CHARM-PROPERTY(Nodes, NewN) 9. Add X Y to NewN
8. if NewN f then CHARM-EXTEND(NewN) 10. else if t(Xj) t(Xj) then // Propery 4
9. C = C {X} // if X is not subsumed 11. Add X Y to NewN
f
Ax1345 Bx123456 Cx2456 Dx1356 Ex12345
ABx1345
ABEx1345
ABCx45 ABDx135 BCx2456 BDx1356 BEx12345
ABDEx135
BCDx56 BCEx245 BDEx135
30 Data Warehousing and Data Mining by Kritsada Sriphaew
34. Mining multilevel association rules from transactional databases
Multiple-Level Association Rules
TID ITEMS
Items often form hierarchy. T1 {1121, 1122, 1212}
Items at the lower level are T2 {1222, 1121, 1122, 1213}
expected to have lower T3 {1124, 1213}
T4 {1111, 1211, 1232, 1221, 1223}
support.
Food
Rules regarding itemsets at (1)
the appropriate levels could Milk Bread
be quite useful. (11) (12)
Transaction database can be Skim 2% Wheat White
encoded based on (111) (112) (121) (122)
dimensions and levels
Wonder
We can explore shared Fraser Sunset (1222)
multi-level mining (1121) (1124)
Wonder
(1213)
34 Data Warehousing and Data Mining by Kritsada Sriphaew
35. Mining Multi-Level Associations
A top_down, progressive deepening approach:
First find high-level strong rules:
milk bread [20%, 60%]
Then find their lower-level “weaker” rules:
2% milk wheat bread [6%, 50%]
Variations at mining multiple-level association rules.
Level-crossed association rules:
2% milk Wonder wheat bread [3%, 60%]
Association rules with multiple, alternative hierarchies:
2% milk Wonder bread [8%, 72%]
35 Data Warehousing and Data Mining by Kritsada Sriphaew
36. Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor”
relationships between items.
Example
milk wheat bread [s=8%, c=70%]
2% milk wheat bread [s=2%, c=72%]
We say the first rule is an ancestor of the second
rule.
A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
36 Data Warehousing and Data Mining by Kritsada Sriphaew
37. Multi-Level Mining: Progressive Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels lead
to different algorithms:
If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.
If adopting reduced min_support at lower levels
then examine only those descendants whose ancestor’s
support is frequent/non-negligible.
37 Data Warehousing and Data Mining by Kritsada Sriphaew
38. Problem of Confidence
Example: (Aggarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the overall
percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more accurate, although
with lower support and confidence
basketball not basketball sum(row)
cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
38 Data Warehousing and Data Mining by Kritsada Sriphaew
39. Interest/Lift/Correlation
Interest (or lift, correlation)
taking both P(A) and P(B) in consideration P( A B)
P(AB)=P(B)P(A), if A and B are independent P( A) P( B)
events
A and B negatively correlated, if the value is less
than 1; otherwise A and B positively correlated
2000
basketball not basketball sum(row) 5000
cereal 2000 1750 3750 0.889
3000 3750
not cereal 1000 250 1250
5000 5000
sum(col.) 3000 2000 5000
1000
Lift(play basketball eat cereal) = 0.89 5000 1.33
Lift(play basketball not eat cereal) = 1.33 3000 1250
5000 5000
39 Data Warehousing and Data Mining by Kritsada Sriphaew
40. Conviction
Conviction (Brin, 1997)
(1 Support ( B))
Conviction ( A B)
0 <= conv(AB) <= (1 Confidence ( A B)
A and B are statistically independent if and only if
conv(AB) = 1
0 < conv(AB) < 1 if and only if p(B|A) < p(B)
B is negatively correlated with A.
1 < conv(AB) < if and only if p(B|A) > p(B)
B is positively correlated with A. 1
3750
basketball not basketball sum(row) 5000 0.375
cereal 2000 1750 3750 1 0.667
not cereal 1000 250 1250
1250
sum(col.) 3000 2000 5000 1
5000 2.25
conviction(play basketball eat cereal) = 0.375
1 0.333
conviction(play basketball not eat cereal) = 2.25
40 Data Warehousing and Data Mining by Kritsada Sriphaew
41. From Association Mining to Correlation Analysis
Ex. Strong rules are not necessarily interesting
Of 10000 transactions
• 6000 customer transactions include computer games
• 7500 customer transactions include videos
• 4000 customer transactions include both computer game and video
• Suppose that data mining program for
videos games
discovering association rules is run on
the data, using min_sup of 30% and
min_conf. of 60%
• The following association rule is
discovered:
4,000
buys(X, “computer games”) buys(X, “videos”)
[s=40%, c=66%]
41
=4000/10000 =4000/6000
42. A misleading “strong” association rule
buys(X, “computer games”) buys(X, “videos”)
[support=40%, confidence=66%]
This rule is misleading because the probability of purchasing video is
75% (>66%)
In fact, computer games and videos are negatively associated because
the purchase of one of these items actually decreases the likelihood of
purchasing the other. Therefore, we could easily make unwise business
decisions based on this rule
42
Data Warehousing and Data Mining by Kritsada Sriphaew
43. From Association Analysis to Correlation
Analysis
To help filter out misleading “strong” association
Correlation rules
A B [support, confidence, correlation]
Lift is a simple correlation measure that is given as follows
The occurrence of itemset A is independent of the occurrence of itemset B if
P(AB) = P(A)P(B);
Otherwise, itemset A and B are dependent and correlated
lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)
If lift(A,B) < 1, then the occurrence of A is negatively correlated with the
occurrence of B
If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence
of one implies the occurrence of the other.
43
Data Warehousing and Data Mining by Kritsada Sriphaew
44. From Association Analysis to Correlation
Analysis (Cont.)
Ex. Correlation analysis using lift
buys(X, “computer games”) buys(X, “videos”)
[support=40%, confidence=66%]
The lift of this rule is
P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89
There is a negative correlation between the occurrence of {game} and {video}
Ex. Is this following rule misleading?
Buy walnuts Buy milk [1%, 80%]”
if 85% of customers buy milk
44
Data Warehousing and Data Mining by Kritsada Sriphaew
46. Feb 26, 2011 (14:00)
Quiz I
Star-net Query (Multidimensional Table)
Data Cube Computation (Memory Calculation)
Data Preprocessing (Normalization, Smoothing by binning)
Association Rule Mining
46 Data Warehousing and Data Mining by Kritsada Sriphaew