SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Association Rule & Apriori
Algorithm
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by the database
and data mining community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how items purchased by
customers are related.
Bread  Milk [sup = 5%, conf = 100%]
2
What Is Association Mining?
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Given a set of transactions, find rules that will predict the occurrence of an
item based on the occurrences of other items in the transaction.
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence,
not causality!
Association rule mining
Basket Data
Retail organizations, e.g., supermarkets, collect and store massive
amounts sales data, called basket data.
A record consist of
transaction date
items bought
Or, basket data may consist of items bought by a customer over a
period.
 Items frequently purchased together:
Bread PeanutButter
Example Association Rule
90% of transactions that purchase bread and butter also purchase milk
“IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the antecedent or consequent
• Antecedent and consequent are disjoint (i.e., have no items in common)
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID (transaction ID)
• A transactional dataset: A set of transactions
7
Transaction data: a set of documents
• A text document data set. Each document is treated as a “bag” of
keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game
8
Definition: Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
• Support count ()
• Frequency of occurrence of an itemset
• E.g. ({Milk, Bread,Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t  I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
10
 I: itemset
{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}
 T: set of transactions
1 {{cucumber, parsley, onion, tomato, salt, bread},
2 {tomato, cucumber, parsley},
3 {tomato, cucumber, olives, onion, parsley},
4 {tomato, cucumber, onion, bread},
5 {tomato, salt, onion},
6 {bread, cheese}
7 {tomato, cheese, cucumber}
8 {bread, butter}}
The model: Association rules
• A transaction t contains X, a set of items (itemset) in I, if X  t.
• An association rule is an implication of the form:
X  Y, where X, Y  I, and X Y = 
• An itemset is a set of items.
• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
11
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data set) if
sup% of transactions contain X  Y.
• sup = probability that a transaction contains Pr(X  Y)
(Percentage of transactions that contain X  Y)
• Confidence: The rule holds in T with confidence conf if conf% of tranactions
that contain X also contain Y.
• conf = conditional probability that a transaction having X also contains Y
Pr(Y | X)
(Ratio of number of transactions that contain X  Y to the number that
contain X)
• An association rule is a pattern that states when X occurs, Y occurs with
certain probability.
12
Support and Confidence
• Support count: The support count of an itemset X, denoted by X.count, in a
data set T is the number of transactions in T that contain X. Assume T has n
transactions.
• Then,
n
countYX
support
).( 

countX
countYX
confidence
.
).( 

13
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).
Example:
Beer}Diaper,Milk{ 
4.0
5
2
|T|
)BeerDiaper,,Milk(


s
67.0
3
2
)Diaper,Milk(
)BeerDiaper,Milk,(



c
 Association Rule
An implication expression of the form
X  Y, where X and Y are itemsets
Example:
{Milk, Diaper}  {Beer}
 Rule Evaluation Metrics
Support (s)
 Fraction of transactions that contain
both X and Y
Confidence (c)
 Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definition: Association Rule
Is minimum support and minimum confidence can be automatically
determined in mining association rules?
• For the mininmum support, it all depends on the dataset. Usually, may start with
a high value, and then decrease the values until to find a value that will generate
enough paterns.
• For the minimum confidence, it is a little bit easier because it represents the
confidence that you want in the rules. So usually, use something like 60 % . But it
also depends on the data.
• In terms of performance, when minsup is higher you will find less pattern and the
algorithm is faster. For minconf, when it is set higher, there will be less pattern
but it may not be faster because many algorithms don't use minconf to prune the
search space. So obviously, setting these parameters also depends on how many
rules you want.
An example
• Transaction data
• Assume:
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]
t1: Bread, Chicken, Milk
t2: Bread, Cheese
t3: Cheese, Boots
t4: Bread, Chicken, Cheese
t5: Bread, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
16
Basic Concept: Association Rules
 Let min_support = 50%,
min_conf = 50%:
 A  C (50%, 66.7%)
 C  A (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all
rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Frequent Itemset Generation
• Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database
• Match each transaction against every candidate
• Complexity ~ O(NMw) => Expensive since M = 2d !!!
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
N
Transactions List of
Candidates
M
w
Brute-force approach:
Given d items, there are 2d possible
candidate itemsets
Computational Complexity
• Given d unique items:
• Total number of itemsets = 2d
• Total number of possible association rules:
123 1
1
1 1












 












 
dd
d
k
kd
j
j
kd
k
d
R
If d=6, R = 602 rules
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
if an itemset is frequent, each of its subsets is frequent as well.
 This property belongs to a special category of properties called antimonotonicity in the
sense that if a set cannot pass a test, all of its supersets will fail the same test as well.
1. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally expensive
Frequent Itemset Generation
• An itemset X is closed in a data set D if there exists no proper super-
itemset Y* such that Y has the same support count as X in D.
*(Y is a proper super-itemset of X if X is a proper sub-itemset of Y, that is, if X  Y. In other
words, every item of X is contained in Y but there is at least one item of Y that is not in X.)
• An itemset X is a closed frequent itemset in set D if X is both closed and
frequent in D.
• An itemset X is a maximal frequent itemset (or max-itemset) in a data set
D if X is frequent, and there exists no super-itemset Y such that X  Y and
Y is frequent in D.
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
• Complete search: M=2d
• Use pruning techniques to reduce M
• Reduce the number of transactions (N)
• Reduce size of N as the size of itemset increases
• Used by DHP (Direct Hashing & Purning) and vertical-based mining
algorithms
• Reduce the number of comparisons (NM)
• Use efficient data structures to store the candidates or transactions
• No need to match every candidate against every transaction
Many mining algorithms
• There are a large number of them!!
• They use different strategies and data structures.
• Their resulting sets of rules are all the same.
• Given a transaction data set T, and a minimum support and a minimum
confident, the set of association rules existing in T is uniquely determined.
• Any algorithm should find the same set of rules although their
computational efficiencies and memory requirements may be different.
• We study only one: the Apriori Algorithm
25
• The algorithm uses a level-wise search, where k-itemsets are used to explore
(k+1)-itemsets
• In this algorithm, frequent subsets are extended one item at a time (this step
is known as candidate generation process)
• Then groups of candidates are tested against the data.
• It identifies the frequent individual items in the database and extends them
to larger and larger item sets as long as those itemsets appear sufficiently
often in the database.
• Apriori algorithm determines frequent itemsets that can be used to
determine association rules which highlight general trends in the database.
The Apriori algorithm
• The Apriori algorithm takes advantage of the fact that any subset of a
frequent itemset is also a frequent itemset.
• i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent itemsets.
• The algorithm can therefore, reduce the number of candidates being
considered by only exploring the itemsets whose support count is greater
than the minimum support count.
• All infrequent itemsets can be pruned if it has an infrequent subset.
The Apriori algorithm
• So we build a Candidate list of k-itemsets and then extract a
Frequent list of k-itemsets using the support count
• After that, we use the Frequent list of k-itemsets in determing
the Candidate and Frequent list of k+1-itemsets.
• We use Pruning to do that
• We repeat until we have an empty Candidate or Frequent of k-
itemsets
• Then we return the list of k-1-itemsets.
How do we do that?
KEY CONCEPTS
• Frequent Itemsets: All the sets which contain the item with the minimum
support (denoted by L𝑖 for ith itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by
joining Lk-1 with itself.
APRIORI ALGORITHM EXAMPLE
The Apriori Algorithm : Pseudo Code
Apriori’s Candidate Generation
• For k=1, C1 = all 1-itemsets.
• For k>1, generate Ck from Lk-1 as follows:
– The join step
Ck = k-2 way join of Lk-1 with itself
If both {a1, …,ak-2, ak-1} & {a1, …, ak-2, ak} are in Lk-1,
then add {a1, …,ak-2, ak-1, ak} to Ck
(We keep items sorted).
– The prune step
Remove {a1, …,ak-2, ak-1, ak} if it contains a non-frequent (k-1) subset
Example – Finding frequent itemsets
Dataset D
TID Items
T100 a1 a3 a4
T200 a2 a3 a5
T300 a1 a2 a3 a5
T400 a2 a5
1. scan D C1: a1:2, a2:3, a3:3, a4:1, a5:3
 L1: a1:2, a2:3, a3:3, a5:3
 C2: a1a2, a1a3, a1a5, a2a3, a2a5, a3a5
2. scan D  C2: a1a2:1, a1a3:2, a1a5:1, a2a3:2, a2a5:3,
a3a5:2
 L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2
 C3: a1a2a3, a2a3a5
 Pruned C3: a1a2a3
3. scan D  L3: a2a3a5:2
minSup=0.5
Order of items can make difference in porcess
Dataset D
TID Items
T100 1 3 4
T200 2 3 5
T300 1 2 3 5
T400 2 5
minSup=0.5
1. scan D  C1: 1:2, 2:3, 3:3, 4:1, 5:3
 L1: 1:2, 2:3, 3:3, 5:3
 C2: 12, 13, 15, 23, 25, 35
2. scan D  C2: 12:1, 13:2, 15:1, 23:2, 25:3, 35:2
Suppose the order of items is: 5,4,3,2,1
 L2: 31:2, 32:2, 52:3, 53:2
 C3: 321, 532
 Pruned C3: 532
3. scan D  L3: 532:2
Generating Association Rules
From frequent itemsets
• Procedure 1:
• Let we have the list of frequent itemsets
• Generate all nonempty subsets for each frequent itemset I
• For I = {1,3,5}, all nonempty subsets are {1,3},{1,5},{3,5},{1},{3},{5}
• For I = {2,3,5}, all nonempty subsets are {2,3},{2,5},{3,5},{2},{3},{5}
• Procedure 2:
• For every nonempty subset S of I, output the rule:
S → (I - S)
• If support_count(I)/support_count(s)>= min_conf
where min_conf is minimum confidence threshold
• Let us assume:
• minimum confidence threshold is 60%
Generating Association Rules
From frequent itemsets
Association Rules with confidence
• R1 : 1,3 -> 5
– Confidence = sc{1,3,5}/sc{1,3} = 2/3 = 66.66% (R1 is selected)
• R2 : 1,5 -> 3
– Confidence = sc{1,5,3}/sc{1,5} = 2/2 = 100% (R2 is selected)
• R3 : 3,5 -> 1
– Confidence = sc{3,5,1}/sc{3,5} = 2/3 = 66.66% (R3 is selected)
• R4 : 1 -> 3,5
– Confidence = sc{1,3,5}/sc{1} = 2/3 = 66.66% (R4 is selected)
• R5 : 3 -> 1,5
– Confidence = sc{3,1,5}/sc{3} = 2/4 = 50% (R5 is REJECTED)
• R6 : 5 -> 1,3
– Confidence = sc{5,1,3}/sc{5} = 2/4 = 50% (R6 is REJECTED)
How to efficiently generate rules?
• In general, confidence does not have an anti-monotone property
c(ABC→D) can be larger or smaller than c(AB →D)
• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L= {A,B,C,D}
c(ABC→D) ≥ c(AB→CD) ≥ c(A→BCD)
Confidence is anti-monotone w.r.t number of items on the RHS of the rule.
Rule generation for Apriori Algorithm
Rule generation for Apriori Algorithm
Pruned the
Rule
• Cdidate rule is generated by merging two rules that share the same
prefix in the rule consequent
• join (CD=>AB, BD=>AC)
would produce the candidate rule,
D=>ABC
• Prune rule D=>ABC if its subset
AD=>BC does not have high confidence
Rule generation for Apriori Algorithm

Weitere ähnliche Inhalte

Was ist angesagt?

Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data miningSulman Ahmed
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data miningSulman Ahmed
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationKnoldus Inc.
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningWan Aezwani Wab
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalramya marichamy
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 
Mining Association Rules in Large Database
Mining Association Rules in Large DatabaseMining Association Rules in Large Database
Mining Association Rules in Large DatabaseEr. Nawaraj Bhandari
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Understanding Association Rule Mining
Understanding Association Rule MiningUnderstanding Association Rule Mining
Understanding Association Rule MiningMohit Rajput
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysishktripathy
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 

Was ist angesagt? (20)

Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Association rules apriori algorithm
Association rules   apriori algorithmAssociation rules   apriori algorithm
Association rules apriori algorithm
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
Association rules
Association rulesAssociation rules
Association rules
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Association Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset GenerationAssociation Rule Learning Part 1: Frequent Itemset Generation
Association Rule Learning Part 1: Frequent Itemset Generation
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Mining Association Rules in Large Database
Mining Association Rules in Large DatabaseMining Association Rules in Large Database
Mining Association Rules in Large Database
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Understanding Association Rule Mining
Understanding Association Rule MiningUnderstanding Association Rule Mining
Understanding Association Rule Mining
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 

Ähnlich wie Lect6 Association rule & Apriori algorithm

MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
AssociationRule.pdf
AssociationRule.pdfAssociationRule.pdf
AssociationRule.pdfWailaBaba
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14rahulmath80
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket AnalysisSandeep Prasad
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptraju980973
 
3 association rule mining
3 association rule mining3 association rule mining
3 association rule miningVishal Dutt
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxssuser957b41
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit IIImalathieswaran29
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesRashmi Bhat
 
3.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca43.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca4Azad public school
 
What goes with what (Market Basket Analysis)
What goes with what (Market Basket Analysis)What goes with what (Market Basket Analysis)
What goes with what (Market Basket Analysis)Kumar P
 
6. Association Rule.pdf
6. Association Rule.pdf6. Association Rule.pdf
6. Association Rule.pdfJyoti Yadav
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule miningMegha Sharma
 
ASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxSherishJaved
 
Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data MiningIffat Firozy
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 

Ähnlich wie Lect6 Association rule & Apriori algorithm (20)

MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
AssociationRule.pdf
AssociationRule.pdfAssociationRule.pdf
AssociationRule.pdf
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.ppt
 
apriori.pptx
apriori.pptxapriori.pptx
apriori.pptx
 
3 association rule mining
3 association rule mining3 association rule mining
3 association rule mining
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
BAS 250 Lecture 4
BAS 250 Lecture 4BAS 250 Lecture 4
BAS 250 Lecture 4
 
3.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca43.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca4
 
What goes with what (Market Basket Analysis)
What goes with what (Market Basket Analysis)What goes with what (Market Basket Analysis)
What goes with what (Market Basket Analysis)
 
6. Association Rule.pdf
6. Association Rule.pdf6. Association Rule.pdf
6. Association Rule.pdf
 
Unit 4_ML.pptx
Unit 4_ML.pptxUnit 4_ML.pptx
Unit 4_ML.pptx
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
 
Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
ASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptx
 
Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data Mining
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 

Mehr von hktripathy

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data samplinghktripathy
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualizationhktripathy
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mininghktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Mehr von hktripathy (17)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data sampling
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualization
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Lect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data MiningLect 3 background mathematics for Data Mining
Lect 3 background mathematics for Data Mining
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

Kürzlich hochgeladen

Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 

Kürzlich hochgeladen (20)

Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 

Lect6 Association rule & Apriori algorithm

  • 1. Association Rule & Apriori Algorithm
  • 2. Association rule mining • Proposed by Agrawal et al in 1993. • It is an important data mining model studied extensively by the database and data mining community. • Assume all data are categorical. • No good algorithm for numeric data. • Initially used for Market Basket Analysis to find how items purchased by customers are related. Bread  Milk [sup = 5%, conf = 100%] 2
  • 3. What Is Association Mining? • Motivation: finding regularities in data • What products were often purchased together? — Beer and diapers • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents?
  • 4. • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality! Association rule mining
  • 5. Basket Data Retail organizations, e.g., supermarkets, collect and store massive amounts sales data, called basket data. A record consist of transaction date items bought Or, basket data may consist of items bought by a customer over a period.  Items frequently purchased together: Bread PeanutButter
  • 6. Example Association Rule 90% of transactions that purchase bread and butter also purchase milk “IF” part = antecedent “THEN” part = consequent “Item set” = the items (e.g., products) comprising the antecedent or consequent • Antecedent and consequent are disjoint (i.e., have no items in common) Antecedent: bread and butter Consequent: milk Confidence factor: 90%
  • 7. Transaction data: supermarket data • Market basket transactions: t1: {bread, cheese, milk} t2: {apple, eggs, salt, yogurt} … … tn: {biscuit, eggs, milk} • Concepts: • An item: an item/article in a basket • I: the set of all items sold in the store • A transaction: items purchased in a basket; it may have TID (transaction ID) • A transactional dataset: A set of transactions 7
  • 8. Transaction data: a set of documents • A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game 8
  • 9. Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
  • 10. The model: data • I = {i1, i2, …, im}: a set of items. • Transaction t : • t a set of items, and t  I. • Transaction Database T: a set of transactions T = {t1, t2, …, tn}. 10  I: itemset {cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}  T: set of transactions 1 {{cucumber, parsley, onion, tomato, salt, bread}, 2 {tomato, cucumber, parsley}, 3 {tomato, cucumber, olives, onion, parsley}, 4 {tomato, cucumber, onion, bread}, 5 {tomato, salt, onion}, 6 {bread, cheese} 7 {tomato, cheese, cucumber} 8 {bread, butter}}
  • 11. The model: Association rules • A transaction t contains X, a set of items (itemset) in I, if X  t. • An association rule is an implication of the form: X  Y, where X, Y  I, and X Y =  • An itemset is a set of items. • E.g., X = {milk, bread, cereal} is an itemset. • A k-itemset is an itemset with k items. • E.g., {milk, bread, cereal} is a 3-itemset 11
  • 12. Rule strength measures • Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X  Y. • sup = probability that a transaction contains Pr(X  Y) (Percentage of transactions that contain X  Y) • Confidence: The rule holds in T with confidence conf if conf% of tranactions that contain X also contain Y. • conf = conditional probability that a transaction having X also contains Y Pr(Y | X) (Ratio of number of transactions that contain X  Y to the number that contain X) • An association rule is a pattern that states when X occurs, Y occurs with certain probability. 12
  • 13. Support and Confidence • Support count: The support count of an itemset X, denoted by X.count, in a data set T is the number of transactions in T that contain X. Assume T has n transactions. • Then, n countYX support ).(   countX countYX confidence . ).(   13 Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf).
  • 14. Example: Beer}Diaper,Milk{  4.0 5 2 |T| )BeerDiaper,,Milk(   s 67.0 3 2 )Diaper,Milk( )BeerDiaper,Milk,(    c  Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer}  Rule Evaluation Metrics Support (s)  Fraction of transactions that contain both X and Y Confidence (c)  Measures how often items in Y appear in transactions that contain X TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Definition: Association Rule
  • 15. Is minimum support and minimum confidence can be automatically determined in mining association rules? • For the mininmum support, it all depends on the dataset. Usually, may start with a high value, and then decrease the values until to find a value that will generate enough paterns. • For the minimum confidence, it is a little bit easier because it represents the confidence that you want in the rules. So usually, use something like 60 % . But it also depends on the data. • In terms of performance, when minsup is higher you will find less pattern and the algorithm is faster. For minconf, when it is set higher, there will be less pattern but it may not be faster because many algorithms don't use minconf to prune the search space. So obviously, setting these parameters also depends on how many rules you want.
  • 16. An example • Transaction data • Assume: minsup = 30% minconf = 80% • An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] • Association rules from the itemset: Clothes  Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3] t1: Bread, Chicken, Milk t2: Bread, Cheese t3: Cheese, Boots t4: Bread, Chicken, Cheese t5: Bread, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes 16
  • 17. Basic Concept: Association Rules  Let min_support = 50%, min_conf = 50%:  A  C (50%, 66.7%)  C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Frequent pattern Support {A} 75% {B} 50% {C} 50% {A, C} 50%
  • 18. Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!
  • 19. Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d !!! TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke N Transactions List of Candidates M w
  • 20. Brute-force approach: Given d items, there are 2d possible candidate itemsets
  • 21. Computational Complexity • Given d unique items: • Total number of itemsets = 2d • Total number of possible association rules: 123 1 1 1 1                             dd d k kd j j kd k d R If d=6, R = 602 rules
  • 22. Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup if an itemset is frequent, each of its subsets is frequent as well.  This property belongs to a special category of properties called antimonotonicity in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. 1. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive
  • 23. Frequent Itemset Generation • An itemset X is closed in a data set D if there exists no proper super- itemset Y* such that Y has the same support count as X in D. *(Y is a proper super-itemset of X if X is a proper sub-itemset of Y, that is, if X  Y. In other words, every item of X is contained in Y but there is at least one item of Y that is not in X.) • An itemset X is a closed frequent itemset in set D if X is both closed and frequent in D. • An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent, and there exists no super-itemset Y such that X  Y and Y is frequent in D.
  • 24. Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP (Direct Hashing & Purning) and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction
  • 25. Many mining algorithms • There are a large number of them!! • They use different strategies and data structures. • Their resulting sets of rules are all the same. • Given a transaction data set T, and a minimum support and a minimum confident, the set of association rules existing in T is uniquely determined. • Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different. • We study only one: the Apriori Algorithm 25
  • 26. • The algorithm uses a level-wise search, where k-itemsets are used to explore (k+1)-itemsets • In this algorithm, frequent subsets are extended one item at a time (this step is known as candidate generation process) • Then groups of candidates are tested against the data. • It identifies the frequent individual items in the database and extends them to larger and larger item sets as long as those itemsets appear sufficiently often in the database. • Apriori algorithm determines frequent itemsets that can be used to determine association rules which highlight general trends in the database. The Apriori algorithm
  • 27. • The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset is also a frequent itemset. • i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent itemsets. • The algorithm can therefore, reduce the number of candidates being considered by only exploring the itemsets whose support count is greater than the minimum support count. • All infrequent itemsets can be pruned if it has an infrequent subset. The Apriori algorithm
  • 28. • So we build a Candidate list of k-itemsets and then extract a Frequent list of k-itemsets using the support count • After that, we use the Frequent list of k-itemsets in determing the Candidate and Frequent list of k+1-itemsets. • We use Pruning to do that • We repeat until we have an empty Candidate or Frequent of k- itemsets • Then we return the list of k-1-itemsets. How do we do that?
  • 29. KEY CONCEPTS • Frequent Itemsets: All the sets which contain the item with the minimum support (denoted by L𝑖 for ith itemset). • Apriori Property: Any subset of frequent itemset must be frequent. • Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.
  • 31. The Apriori Algorithm : Pseudo Code
  • 32. Apriori’s Candidate Generation • For k=1, C1 = all 1-itemsets. • For k>1, generate Ck from Lk-1 as follows: – The join step Ck = k-2 way join of Lk-1 with itself If both {a1, …,ak-2, ak-1} & {a1, …, ak-2, ak} are in Lk-1, then add {a1, …,ak-2, ak-1, ak} to Ck (We keep items sorted). – The prune step Remove {a1, …,ak-2, ak-1, ak} if it contains a non-frequent (k-1) subset
  • 33. Example – Finding frequent itemsets Dataset D TID Items T100 a1 a3 a4 T200 a2 a3 a5 T300 a1 a2 a3 a5 T400 a2 a5 1. scan D C1: a1:2, a2:3, a3:3, a4:1, a5:3  L1: a1:2, a2:3, a3:3, a5:3  C2: a1a2, a1a3, a1a5, a2a3, a2a5, a3a5 2. scan D  C2: a1a2:1, a1a3:2, a1a5:1, a2a3:2, a2a5:3, a3a5:2  L2: a1a3:2, a2a3:2, a2a5:3, a3a5:2  C3: a1a2a3, a2a3a5  Pruned C3: a1a2a3 3. scan D  L3: a2a3a5:2 minSup=0.5
  • 34. Order of items can make difference in porcess Dataset D TID Items T100 1 3 4 T200 2 3 5 T300 1 2 3 5 T400 2 5 minSup=0.5 1. scan D  C1: 1:2, 2:3, 3:3, 4:1, 5:3  L1: 1:2, 2:3, 3:3, 5:3  C2: 12, 13, 15, 23, 25, 35 2. scan D  C2: 12:1, 13:2, 15:1, 23:2, 25:3, 35:2 Suppose the order of items is: 5,4,3,2,1  L2: 31:2, 32:2, 52:3, 53:2  C3: 321, 532  Pruned C3: 532 3. scan D  L3: 532:2
  • 35. Generating Association Rules From frequent itemsets • Procedure 1: • Let we have the list of frequent itemsets • Generate all nonempty subsets for each frequent itemset I • For I = {1,3,5}, all nonempty subsets are {1,3},{1,5},{3,5},{1},{3},{5} • For I = {2,3,5}, all nonempty subsets are {2,3},{2,5},{3,5},{2},{3},{5}
  • 36. • Procedure 2: • For every nonempty subset S of I, output the rule: S → (I - S) • If support_count(I)/support_count(s)>= min_conf where min_conf is minimum confidence threshold • Let us assume: • minimum confidence threshold is 60% Generating Association Rules From frequent itemsets
  • 37. Association Rules with confidence • R1 : 1,3 -> 5 – Confidence = sc{1,3,5}/sc{1,3} = 2/3 = 66.66% (R1 is selected) • R2 : 1,5 -> 3 – Confidence = sc{1,5,3}/sc{1,5} = 2/2 = 100% (R2 is selected) • R3 : 3,5 -> 1 – Confidence = sc{3,5,1}/sc{3,5} = 2/3 = 66.66% (R3 is selected) • R4 : 1 -> 3,5 – Confidence = sc{1,3,5}/sc{1} = 2/3 = 66.66% (R4 is selected) • R5 : 3 -> 1,5 – Confidence = sc{3,1,5}/sc{3} = 2/4 = 50% (R5 is REJECTED) • R6 : 5 -> 1,3 – Confidence = sc{5,1,3}/sc{5} = 2/4 = 50% (R6 is REJECTED)
  • 38. How to efficiently generate rules? • In general, confidence does not have an anti-monotone property c(ABC→D) can be larger or smaller than c(AB →D) • But confidence of rules generated from the same itemset has an anti- monotone property • e.g., L= {A,B,C,D} c(ABC→D) ≥ c(AB→CD) ≥ c(A→BCD) Confidence is anti-monotone w.r.t number of items on the RHS of the rule.
  • 39. Rule generation for Apriori Algorithm
  • 40. Rule generation for Apriori Algorithm Pruned the Rule
  • 41. • Cdidate rule is generated by merging two rules that share the same prefix in the rule consequent • join (CD=>AB, BD=>AC) would produce the candidate rule, D=>ABC • Prune rule D=>ABC if its subset AD=>BC does not have high confidence Rule generation for Apriori Algorithm