2. 2
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Milk and Bread ?!
What are the subsequent purchases after buying a PC?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis etc
4. 4
Basic Concepts: Frequent Patterns
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Coffee, Nuts, Snacks
20 Milk, Coffee, Diaper
30 Beer, Bread
40 Tea, Sugar, Milk
50 Nuts, Coffee, Sugar, Bread, Milk
5. 5
Basic Concepts: Association Rules
Find all the rules X Y with
minimum support and confidence
support, s, probability that a
transaction contains X and Y
confidence, c, conditional
probability that a transaction
having X also contains Y
Freq. Pat.: Beer:1, Nuts:2, Diaper:1,
Coffee:3,Milk:3 {Coffee, Milk}:2
Association rules: (many more!)
milk Coffee
Snacks Beer
Tid Items bought
10 Coffee, Nuts, Snacks
20 Milk, Coffee, Diaper
30 Beer, Bread
40 Tea, Sugar, Milk
50 Nuts, Coffee, Sugar, Bread, Milk
7. 7
Scalable Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test Approach
FPGrowth: A Frequent Pattern-Growth Approach
8. 8
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {Coffee, Milk, Bread} is frequent, so is {Coffee,
Milk}
i.e., every transaction having {Coffee, Milk, Bread}
also contains {Coffee, Milk}
9. 9
Apriori: A Candidate Generation & Test Approach
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
12. 12
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
13. 13
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
18. 18
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find
frequent 1-itemset (single
item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again, construct
FP-tree
19.
20.
21. 21
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Conditional pattern bases
itemcond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
22. Practice Question
Implement and analyse Association rules for the following dataset with
min_support = 60% and confidence =80%
Find all the frequent item sets using Apriori Algorithm. List the strong association
rules.
23. Generate Fp-Tree to find frequent
itemsets for following dataset
Tid Items
1 A,B,E
2 B,D
3 B,C
4 A,B,D
5 A,C
6 B,C
7 A,C
8 A,B,C,E
9 A,B,C
Conditional
Pattern Base
Conditional FP Tree Frequent Pattern
Generated
24. 24
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
25. 25
The Frequent Pattern Growth Mining Method
Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and
database partition
Method
For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
26. Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
27. Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
Credit/loan approval:
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Prediction Problems: Classification vs. Numeric
Prediction
28. Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by
the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical
formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
29. Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
30. Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
31.
32. Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct
— prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
34. Bayes’ Theorem: Basics
Total probability Theorem:
Bayes’ Theorem:
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
)
(
)
1
|
(
)
(
i
A
P
M
i i
A
B
P
B
P
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P
35. Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes’ theorem
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P
37. NAÏVE BAYES EXAMPLE
(OUTLOOK, TEMP)---PLAY(YES/NO)
YES NO P(YES) P(NO)
SUNNY 2 3
OVERCAST 4 0
RAINY 3 2
TOTAL
YES NO P(YES) P(NO)
HOT 2 2
MILD 4 2
COLD 3 1
TOTAL
OUTLOOK
TEMPERATURE
YES 9
NO 5
PLAYS TENNIS YES OR NO
38. Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income student
credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
39. Naïve Bayes Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income student
credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
40.
41. Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Training data set: Buys_computer
Resulting tree:
42. Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
45. Attribute Selection Measure:
Information Gain (ID3)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify
D:
Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info
)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info
(D)
Info
Info(D)
Gain(A) A