2. • Data Mining : Uncovering and discovering
hidden & potentially useful information from
your data
• Descriptive Information
– Find Patterns that are human interpretable
– Ex: Clustering, Association Rule Mining,
• Predictive Information
– Find Value of an attribute using the values of
other attributes
– Ex: Classification, Regression,
Descriptive & Predictive
Information/Model
3. • Typical and widely used example of Association Rules
application is market basket analysis
– Frequent Patterns are patterns that appear in a data set
frequently.
– Milk and Bread, that appear frequently together in a
transaction data set.
• Other names: Frequent Item Set Mining, Association
Rule Mining, Market Basket Analysis, Link Analysis
etc.
Introduction
4.
5. • Association Rules: describing association
relationships among the attributes in the set of
relevant data.
• To find the relationships between objects which
are frequently used together
• Association Rules find all sets of item(itemsets )
that have support greater than the minimum
support
– Then using the large itemsets to generate the
desired rules that have confidence greater than the
minimum confidence
Association Rules
6. • If the customer buys milk then he may also
buy cereal or , if the customer buys a tablet
computer then he may also buy a case(cover)
• There are two basic criteria that association
rules use, Support and Confidence
– To identify the relationship and rules generated by
analysing data for frequently used if/then patterns
– Association rules are usually needed to satisfy a
user-specified minimum support and a user-
specified minimum confidence at the same time
7. • Rule: X⇒Y
• Support: Probability that a transaction contains
X and Y = Applicability of the Rule
– Support =P(X ∪Y) = or
• Confidence: Conditional probability that a
transaction having X also contains Y = Strength
of the Rule
– Confidence =
Concepts
freq(X,Y)
freq(X,Y)
freq(X)
Coverage =support
Accuracy =confidence
freq(X,Y)
N
8. • Both Confidence and Support should be large
• By convention, confidence and support values are
written as percentages (%).
• Item Set: Set of Items
• K-Item Set: An item set that contains k items
– {A,B} is a 2-item set.
• Frequency, Support Count, Count: Number of
transaction that contains the item set.
• Frequent Itemsets: Itemsets that occurs frequently
(more than minimum support)
– A set of all items in a store I= {i1,i2,i3,…im}
– A set of all transactions (Transaction Database T)
• T= {t1,t2,t3,t4,… tn}
• Each ti is a set of items s.t.
• Each transaction ti has a transaction ID(TID)
Concepts
t Í I
10. TID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Example :
Minimum Support = 50% Minimum Confidence = 50%
A ⇒ C (Sup=50%, Conf=66.6%)
C ⇒ A (Sup=50%, Conf=100%)
11. • Naïve method for finding association rules:
– Use separate-and-conquer method
– Treat every possible combination of attribute values
as a separate class
• Two problems:
– Computational complexity
– Resulting number of rules (which would have to be
pruned on the basis of support and confidence)
• But: we can look for high support rules directly!
Association Rules
12. • Support: number of instances correctly covered
by association rule
– The same as the number of instances covered by all
tests in the rule (LHS and RHS!)
• Item: one test/attribute-value pair
• Item set : all items occurring in a rule
• Goal: only rules that exceed pre-defined support
– ⇒ Do it by finding all item sets with the given
minimum support and generating rules from them!
Item Sets
13. Example :Weather data
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
14. 1-item Sets 2-item sets 3-item sets 4-item sets
Outlook=Sunny(5) Outlook=Sunny
Temperature=Hot(2)
Outlook=Sunny
Temperature= Hot
Humidity=High(2)
Outlook = Sunny
Temperature = Hot
Humidity = High
Play = No (2)
Temperature = Cool(4) Outlook = Sunny Humidity
= High (3)
Outlook=Sunny
Humidity = High
Windy = False (2)
Outlook = Rainy
Temperature = Mild
Windy = False
Play = Yes (2)
... … … …
Item sets for weather data
In total: 12 one-item sets,
47 two-item sets,
39 three-item sets,
6 four-item sets and
0 five-item sets
(with minimum support of two)
15. • Once all item sets with minimum support have been
generated, we can turn them into rules
• Examples:
– Humidity = Normal, Windy = False, Play = Yes(4)
• Seven (2N-1) Potential rules:
– If Humidity = Normal and Windy = False then Play=Yes 4/4
– If Humidity = Normal Play=Yes then Windy = False 4/6
– If Windy = False and Play=Yes then Humidity = Normal 4/6
– If Humidity = Normal then Windy = False and Play=Yes 4/7
– If Windy = False then Play=Yes and Humidity = Normal 4/8
– If Play=Yes then Humidity=Normal and Windy=False 4/9
– If - then Humidity=Normal and Windy=False and Play=Yes 4/14
Generating rules from an item set
16. • Rules with support > 2 and confidence = 100%:
Rules for weather data
Association rule Sup. Conf.
1 Humidity=Normal Windy=False ⇒ Play=Yes 4 100%
2 Temperature=Cool ⇒ Humidity=Normal 4 100%
3 Outlook=Overcast ⇒ Play=Yes 4 100%
4 Temperature=Cold Play=Yes ⇒ Humidity=Normal 3 100%
… ... … …
58 Outlook=Sunny Temperature=Hot ⇒ Humidity=High 2 100%
• In Total::
– 3 rules with support four
– 5 with support three
– 50 with support two
17. • Item set:
– Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)
• Resulting rules (all with 100% confidence):
– Temperature = Cool ,Windy = False ⇒ Humidity = Normal, Play = Yes
– Temperature = Cool ,Windy = False, Humidity = Normal ⇒Play = Yes
– Temperature = Cool ,Windy = False, Play = Yes ⇒ Humidity = Normal
• Due to the following “frequent” item sets:
– Temperature = Cool, Windy = False (2)
– Temperature = Cool, Humidity = Normal, Windy = False (2)
– Temperature = Cool, Windy = False, Play = Yes (2)
Example rules from the same set
18. • How can we efficiently find all frequent item sets?
• Finding one-item sets easy
• Idea: use one-item sets to generate two-item sets,
two-item sets to generate three-item sets, ...
– If (A B) is frequent item set, then (A) and (B) have to
be frequent item sets as well!
– In general: if X is frequent k-item set, then all (k-1)-
item subsets of X are also frequent
⇒Compute k-item set by merging (k-1)-item sets
Generating item sets efficiently
19. • Given: five three-item sets
– (A B C), (A B D), (A C D), (A C E), (B C D)
• Candidate four-item sets:
– (A B C D) OK because of (A C D) (B C D)
– (A C D E) Not OK because of (C D E)
• Final check by counting instances in dataset!
• (k –1)-item sets are stored in hash table
Example
20. • 2 Steps
– Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets)
– Use frequent itemsets to generate rules
• Key idea: A subsets of a frequent itemset must also be a frequent
itemsets
– If {I1 ,I2} is a frequent itemset, then{I1} and {I2} should be a frequent
itemsets
• An iterative approach to find frequent itemsets
Apriori Algorithm Example 2
21. TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
500 1 3 5
Apriori Algorithm Example 2:
Itemset Support
{1} 3
{2} 3
{3} 4
{4} 1
{5} 4
Minimum
Support Count =2
Itemset Support
{1} 3
{2} 3
{3} 4
{5} 4
Itemset Support
{1,2} 1
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
Itemset Support
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
Candidate List of 1-itemsets
Frequent List of 1-itemsets
Candidate List of 2-itemsets
Frequent List of 2-itemsets
A subsets of a
frequent itemset must
also be a frequent
itemsets
22. TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
500 1 3 5
Apriori Algorithm Example:
Minimum
Support
Count =2
Itemset In FI2?
{1,2,3}
{1,2},{1,3},{2,3}
NO
{1,2,5}
{1,2},{1,5},{2,5}
NO
{1,3,5}
{1,3},{1,5},{3,5}
Yes
{2,3,5}
{2,3},{2,5},{3,5}
Yes
Itemset Support
{1,3,5} 2
{2,3,5} 2
Candidate List of 3-itemsets
Frequent List of 3-itemsets
Itemset Support
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
Frequent List of 2-itemsets
A subsets of a frequent itemset must
also be a frequent itemsets
23. TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
500 1 3 5
Apriori Algorithm Example:
Minimum
Support Count =2
Itemset Support
{1,2,3,5} 1
Candidate List of 4-itemsets
Frequent List of 4-itemsets
Itemset Support
Empty
Frequent List of 4-itemsets
A subsets of a frequent itemset must
also be a frequent itemset
Itemset In FI3?
{1,2,3,5}
{1,2,3},{1,2,5},{1,3,5},{2,4,5}
No
Itemset Support
{1,3,5} 2
{2,3,5} 2
Frequent List of 3-itemsets
If Support is large enough
24. • The Apriori algorithm takes advantage of the
fact that any subset of a frequent itemset is also
a frequent itemset
• The algorithm can therefore, reduce the
number of candidates being considered by only
exploring the itemsets whose support count is
greater than the minimum support count
• All infrequent itemsets can be pruned if it has
an infrequent subset
Apriori Algorithm
25. • Build a Candidate List of k-itemsets and then
extract a Frequent List if k-itemsets using the
support count
• After that, we use the Frequent List of k-
itemsets in determining the Candidate and
Frequent List of (k+1) itemsets
• We use Pruning to do that
• We repeat until we have an empty Candidate
or Frequent of k-itemsets
– Then we return the List of k-1 itemsets
Algorithm
26. • Now we have the list of frequent itemsets
• Generate all nonempty subsets for each frequent
itemset I
– For I ={1,3,5} , all noneempty subsets are
{1,3},{1,5},{3,5},{1},{3},{5}
– For I = {2,3,5} , all noneempty subsets are
{2,3},{2,5},{3,5},{2}, {3},{5}
Generate Associate Rules
Itemset Support
{1,3,5} 2/5
{2,3,5} 2/5
Frequent List of 3-itemsets
27. • For rule X Y , Confidence
• For every nonempty subset s of I, output the rule :
s (I-s)
If Confidence >= min_confidence
Where min_confidence is minimum confidence threshold
Let us assume
• Minimum confidence threshold is 60%
freq(X,Y)
freq(X)
Data Mining is normally done using data warehouse
Hidden information: information which is not very obvious,not something visible directly but interpreted in some manner. it is discovered
Information in data mining process catogorized in 2 broad chatogories
Sth human can understand
Association rules are like classification rules
The way supermarket are designed, the way the layout is designed, the way catalog even are designed is based on Association rules
Mentioned that before
If X implies Y, That mean if customer buys item x then he will also buys item Y
N =number of transactionsc
support: if X and Y both together over N,
Confidence: frequency of X and Y happening together over the frequncy of X , how many times X was bought
Trasaction is subset of can be one or more items
If X implies Y, That mean if customer buys item x then he will also buys item Y
N =number of transactionsc
support: if X and Y both together over N,
Confidence: frequency of X and Y happening together over the frequncy of X , how many times X was bought
Trasaction is subset of can be one or more items
terms
Trnsaction database: transaction 1 to 5
Item database : item A,B,,C,D and E
If customer buy item A that he buys item D
Atleast it should apear 2 times
The first rows of the table, for example, show that there are five days when outlook = sunny, two of which have temperature = hot, and, in fact, on both of those days humidity = high and play = no as well.
Total numer of instance
Once all item sets with the required coverage have been generated, the next step is to turn each into a rule, or a set of rules,
Some item sets will produce more than one rule; others will produce none
Coverage
Final rule: no antecedent and denominator is , Total number of instancess
Final rules for weather data :58 rules
4 itemset which has coverage 2.
Lexicographically ordered!
Ex: a frequent itemset
{Chicken, Cloths, Milk} [sup = 3/7]
and one rule from the frequent itemset
Cloths Milk, Chicken xs [sup = 3/7, conf =3/3]
Combination of frequent item set
Order is not important
Combination of frequent item set
Order is not important
Use as a fraction : support count so we get 0-1
Combination of frequent item set
Order is not important
All posible combination:
SO we stop here.
If Support is large enough we count FI4 and we check FI of previous itemsets . Though
Purning step