SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Association Rule Basics
Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Lecture outline                                  2



   What is association rule mining?
   Frequent itemsets, support, and confidence
   Mining association rules
   The “Apriori” algorithm
   Rule generation




                    Prof. Pier Luca Lanzi
Example of market-basket transactions                3




 Bread       Bread                       Steak     Jam
 Peanuts     Jam                         Jam       Soda
 Milk        Soda                        Soda      Peanuts
 Fruit       Chips                       Chips     Milk
 Jam         Milk                        Bread     Fruit
             Fruit



 Jam          Fruit                      Fruit     Fruit
 Soda         Soda                       Soda      Peanuts
 Chips        Chips                      Peanuts   Cheese
 Milk         Milk                       Milk      Yogurt
 Bread




                 Prof. Pier Luca Lanzi
What is association mining?                      4



 Finding frequent patterns, associations, correlations, or
  causal structures among sets of items or objects in
  transaction databases, relational databases, and other
  information repositories

 Applications
    Basket data analysis
    Cross-marketing
    Catalog design
    …




                    Prof. Pier Luca Lanzi
What is association rule mining?                                      5




TID Items                                               Examples
 1   Bread, Peanuts, Milk, Fruit, Jam                      {bread}    {milk}
 2   Bread, Jam, Soda, Chips, Milk, Fruit                  {soda}    {chips}
 3   Steak, Jam, Soda, Chips, Bread                        {bread}    {jam}
 4   Jam, Soda, Peanuts, Milk, Fruit
 5   Jam, Soda, Chips, Milk, Bread
 6   Fruit, Soda, Chips, Milk
 7   Fruit, Soda, Peanuts, Milk
 8   Fruit, Peanuts, Cheese, Yogurt


 Given a set of transactions, find rules that will predict the
  occurrence of an item based on the occurrences of other
  items in the transaction


                                Prof. Pier Luca Lanzi
Definition: Frequent Itemset                                            6



   Itemset                                        TID Items
        A collection of one or more items,
        e.g., {milk, bread, jam}                   1   Bread, Peanuts, Milk, Fruit, Jam
        k-itemset, an itemset that                 2   Bread, Jam, Soda, Chips, Milk, Fruit
        contains k items
   Support count ( )                              3   Steak, Jam, Soda, Chips, Bread
        Frequency of occurrence                    4   Jam, Soda, Peanuts, Milk, Fruit
        of an itemset
          ({Milk, Bread}) = 3                      5   Jam, Soda, Chips, Milk, Bread
          ({Soda, Chips}) = 4                      6   Fruit, Soda, Chips, Milk
   Support
        Fraction of transactions                   7   Fruit, Soda, Peanuts, Milk
        that contain an itemset                    8   Fruit, Peanuts, Cheese, Yogurt
        s({Milk, Bread}) = 3/8
         s({Soda, Chips}) = 4/8
   Frequent Itemset
        An itemset whose support is greater
        than or equal to a minsup threshold




                           Prof. Pier Luca Lanzi
What is an association rule?                              7



 Implication of the form X         Y, where X and Y are itemsets

 Example, {bread}       {milk}

 Rule Evaluation Metrics, Suppor & Confidence

 Support (s)
     Fraction of transactions
     that contain both X and Y
 Confidence (c)
     Measures how often items in Y
     appear in transactions that
     contain X




                     Prof. Pier Luca Lanzi
Support and Confidence                       8



         Customer
         buys both          Customer
                            buys Milk




Customer
buys Bread




                     Prof. Pier Luca Lanzi
What is the goal?                                 9



 Given a set of transactions T, the goal of association rule
  mining is to find all rules having
     support ≥ minsup threshold
     confidence ≥ minconf threshold

 Brute-force approach:
     List all possible association rules
     Compute the support and confidence for each rule
     Prune rules that fail the minsup and minconf thresholds

 Brute-force approach is computationally prohibitive!




                    Prof. Pier Luca Lanzi
Mining Association Rules                        10



             {Bread, Jam}    {Milk} s=0.4 c=0.75
             {Milk, Jam}   {Bread} s=0.4 c=0.75
             {Bread}    {Milk, Jam} s=0.4 c=0.75
             {Jam}     {Bread, Milk} s=0.4 c=0.6
             {Milk}    {Bread, Jam} s=0.4 c=0.5

 All the above rules are binary partitions of the same itemset:

                        {Milk, Bread, Jam}

 Rules originating from the same itemset have identical
  support but can have different confidence
 We can decouple the support and confidence requirements!



                   Prof. Pier Luca Lanzi
Mining Association Rules:                       11
Two Step Approach

 Frequent Itemset Generation
     Generate all itemsets whose support     minsup

 Rule Generation
     Generate high confidence rules from frequent itemset
     Each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is computationally expensive




                   Prof. Pier Luca Lanzi
Frequent Itemset Generation                                        13



                                  null




             A       B             C              D       E




AB    AC      AD   AE        BC          BD       BE    CD     CE         DE




ABC   ABD    ABE   ACD      ACE          ADE      BCD   BCE    BDE    CDE




            ABCD   ABCE           ABDE         ACDE     BCDE
                                                               Given d items, there
                                                               are 2d possible
                                ABCDE                          candidate itemsets

                          Prof. Pier Luca Lanzi
Frequent Itemset Generation                                      14



 Brute-force approach:
     Each itemset in the lattice is a candidate frequent itemset
     Count the support of each candidate by scanning the
     database
           TID Items
            1   Bread, Peanuts, Milk, Fruit, Jam
                                                        candidates
            2   Bread, Jam, Soda, Chips, Milk, Fruit
            3   Steak, Jam, Soda, Chips, Bread
       N    4   Jam, Soda, Peanuts, Milk, Fruit
            5   Jam, Soda, Chips, Milk, Bread                         M
            6   Fruit, Soda, Chips, Milk
            7   Fruit, Soda, Peanuts, Milk
            8   Fruit, Peanuts, Cheese, Yogurt


                                w
     Match each transaction against every candidate
     Complexity ~ O(NMw) => Expensive since M = 2d


                                Prof. Pier Luca Lanzi
Computational Complexity                            15




 Given d unique items:
      Total number of itemsets = 2d
      Total number of possible association rules:




       For d=6, there are 602 rules

                        Prof. Pier Luca Lanzi
Frequent Itemset Generation Strategies           16



 Reduce the number of candidates (M)
    Complete search: M=2d
    Use pruning techniques to reduce M
 Reduce the number of transactions (N)
    Reduce size of N as the size of itemset increases
 Reduce the number of comparisons (NM)
    Use efficient data structures to store the candidates or
    transactions
    No need to match every candidate against every
    transaction




                     Prof. Pier Luca Lanzi
Reducing the Number of Candidates                18



 Apriori principle
     If an itemset is frequent, then all of
     its subsets must also be frequent

 Apriori principle holds due to the following property of the
  support measure:



 Support of an itemset never exceeds the support of its
  subsets
 This is known as the anti-monotone property of support




                    Prof. Pier Luca Lanzi
Illustrating Apriori Principle                                           19



                                                         null




                                A              B          C            D          E




                  AB   AC        AD        AE      BC           BD      BE      CD     CE    DE



Found to be
Infrequent
                 ABC   ABD      ABE       ACD      ACE          ADE    BCD      BCE    BDE   CDE




                              ABCD         ABCE          ABDE         ACDE      BCDE



                 Pruned
                 supersets                           ABCDE




                       Prof. Pier Luca Lanzi
How does the Apriori principle work?                    20



Items (1-itemsets)
Item      Count
Bread      4
Peanuts    4                        2-itemsets
Milk       6                 2-Itemset        Count
Fruit      6
                             Bread, Jam          4
Jam        5
                             Peanuts, Fruit      4
Soda       6
                             Milk, Fruit         5
Chips      4
                             Milk, Jam           4
Steak      1
                             Milk, Soda          5
Cheese     1
                             Fruit, Soda         4
Yogurt     1
                             Jam, Soda           4
                                                             3-itemsets
                             Soda, Chips         4
 Minimum Support = 4                                  3-Itemset           Count
                                                      Milk, Fruit, Soda    4




                     Prof. Pier Luca Lanzi
Apriori Algorithm                            21



 Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
     Generate length (k+1) candidate itemsets from length k
     frequent itemsets
     Prune candidate itemsets containing subsets of length k
     that are infrequent
     Count the support of each candidate by scanning the DB
     Eliminate candidates that are infrequent, leaving only
     those that are frequent




                    Prof. Pier Luca Lanzi
The Apriori Algorithm                            22




      Ck: Candidate itemset of size k
      Lk : frequent itemset of size k

      L1 = {frequent items};
      for (k = 1; Lk != ; k++) do begin
          Ck+1 = candidates generated from Lk;
         for each transaction t in database do
             increment the count of all candidates in Ck+1
             that are contained in t
         Lk+1 = candidates in Ck+1 with min_support
         end
      return k Lk;




                  Prof. Pier Luca Lanzi
The Apriori Algorithm                              23



 Join Step
     Ck is generated by joining Lk-1 with itself

 Prune Step
     Any (k-1)-itemset that is not frequent cannot be a subset
     of a frequent k-itemset




                    Prof. Pier Luca Lanzi
Efficient Implementation of Apriori in SQL    24



 Hard to get good performance out of pure SQL (SQL-92)
  based approaches alone
 Make use of object-relational extensions like UDFs, BLOBs,
  Table functions etc.
     Get orders of magnitude improvement
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
  association rule mining with relational database systems:
  Alternatives and implications. In SIGMOD’98




                   Prof. Pier Luca Lanzi
How to Improve                                             25
Apriori’s Efficiency

 Hash-based itemset counting: A k-itemset whose corresponding hashing
  bucket count is below the threshold cannot be frequent
 Transaction reduction: A transaction that does not contain any frequent k-
  itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB must be frequent
  in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support threshold + a
  method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only when all of
  their subsets are estimated to be frequent




                        Prof. Pier Luca Lanzi
Rule Generation                                 26



 Given a frequent itemset L, find all non-empty subsets f    L
  such that f   L – f satisfies the minimum confidence
  requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:
         ABC D, ABD C, ACD B, BCD          A, A BCD, B ACD,
         C ABD, D ABC, AB CD, AC           BD, AD  BC, BC AD,
         BD AC, CD AB

 If |L| = k, then there are 2k – 2 candidate association rules
  (ignoring L       and      L)




                   Prof. Pier Luca Lanzi
How to efficiently generate rules from                28
frequent itemsets?

 Confidence does not have an anti-monotone property

 c(ABC   D) can be larger or smaller than c(AB            D)

 But confidence of rules generated from the same itemset has
  an anti-monotone property

     e.g., L = {A,B,C,D}:

             c(ABC       D)      c(AB     CD)   c(A    BCD)

     Confidence is anti-monotone with respect to the number
     of items on the right hand side of the rule




                  Prof. Pier Luca Lanzi
Rule Generation for Apriori Algorithm                            29




                                          ABCD=>{ }
Low
Confidence
Rule
               BCD=>A           ACD=>B                ABD=>C        ABC=>D




      CD=>AB     BD=>AC         BC=>AD          AD=>BC         AC=>BD        AB=>CD




               D=>ABC           C=>ABD                B=>ACD        A=>BCD
Pruned
Rules

                        Prof. Pier Luca Lanzi
Rule Generation for Apriori Algorithm               30



 Candidate rule is generated by merging two rules that share
  the same prefix
  in the rule consequent

 join(CD=>AB,BD=>AC)                      CD=>AB        BD=>AC
  would produce the candidate
  rule D => ABC

 Prune rule D=>ABC if its
  subset AD=>BC does not have
  high confidence
                                                D=>ABC




                   Prof. Pier Luca Lanzi
Effect of Support Distribution                31



   Many real data sets have skewed support distribution




Support
distribution of
a retail data set




                    Prof. Pier Luca Lanzi
How to set the appropriate minsup?              32



 If minsup is set too high, we could miss itemsets involving
  interesting rare items (e.g., expensive products)

 If minsup is set too low, it is computationally expensive
  and the number of itemsets is very large

 A single minimum support threshold may not be effective




                   Prof. Pier Luca Lanzi

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Data warehouse architecture
Data warehouse architecture Data warehouse architecture
Data warehouse architecture
 
Association rules
Association rulesAssociation rules
Association rules
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Distributed database
Distributed databaseDistributed database
Distributed database
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Fp growth
Fp growthFp growth
Fp growth
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 

Andere mochten auch

1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparisonric_biet
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reductionShatakirti Er
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Different type of databases
Different type of databasesDifferent type of databases
Different type of databasesShwe Yee
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssasUmar Ali
 
Big data and Social Media Analytics
Big data and Social Media AnalyticsBig data and Social Media Analytics
Big data and Social Media AnalyticsSimplify360
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methodsKrish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 

Andere mochten auch (20)

1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Social Media Analytics
Social Media Analytics Social Media Analytics
Social Media Analytics
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Concept description characterization and comparison
Concept description characterization and comparisonConcept description characterization and comparison
Concept description characterization and comparison
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Different type of databases
Different type of databasesDifferent type of databases
Different type of databases
 
Clique
Clique Clique
Clique
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
 
Big data and Social Media Analytics
Big data and Social Media AnalyticsBig data and Social Media Analytics
Big data and Social Media Analytics
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 

Ähnlich wie Data Mining: Association Rules Basics

Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsPier Luca Lanzi
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
AssociationRule.pdf
AssociationRule.pdfAssociationRule.pdf
AssociationRule.pdfWailaBaba
 
ASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxSherishJaved
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data miningSulman Ahmed
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptraju980973
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
BIM Data Mining Unit4 by Tekendra Nath Yogi
 BIM Data Mining Unit4 by Tekendra Nath Yogi BIM Data Mining Unit4 by Tekendra Nath Yogi
BIM Data Mining Unit4 by Tekendra Nath YogiTekendra Nath Yogi
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptRaviKiranVarma4
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule miningMegha Sharma
 
Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data MiningIffat Firozy
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket AnalysisSandeep Prasad
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14rahulmath80
 

Ähnlich wie Data Mining: Association Rules Basics (20)

Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules Basics
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
AssociationRule.pdf
AssociationRule.pdfAssociationRule.pdf
AssociationRule.pdf
 
ASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptx
 
Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.ppt
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
BIM Data Mining Unit4 by Tekendra Nath Yogi
 BIM Data Mining Unit4 by Tekendra Nath Yogi BIM Data Mining Unit4 by Tekendra Nath Yogi
BIM Data Mining Unit4 by Tekendra Nath Yogi
 
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).pptUNIT 3.2 -Mining Frquent Patterns (part1).ppt
UNIT 3.2 -Mining Frquent Patterns (part1).ppt
 
Association Rule mining
Association Rule miningAssociation Rule mining
Association Rule mining
 
Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data Mining
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Intake 37 DM
Intake 37 DMIntake 37 DM
Intake 37 DM
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 

Data Mining: Association Rules Basics

  • 1. Association Rule Basics Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
  • 2. Lecture outline 2  What is association rule mining?  Frequent itemsets, support, and confidence  Mining association rules  The “Apriori” algorithm  Rule generation Prof. Pier Luca Lanzi
  • 3. Example of market-basket transactions 3 Bread Bread Steak Jam Peanuts Jam Jam Soda Milk Soda Soda Peanuts Fruit Chips Chips Milk Jam Milk Bread Fruit Fruit Jam Fruit Fruit Fruit Soda Soda Soda Peanuts Chips Chips Peanuts Cheese Milk Milk Milk Yogurt Bread Prof. Pier Luca Lanzi
  • 4. What is association mining? 4  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories  Applications Basket data analysis Cross-marketing Catalog design … Prof. Pier Luca Lanzi
  • 5. What is association rule mining? 5 TID Items Examples 1 Bread, Peanuts, Milk, Fruit, Jam {bread} {milk} 2 Bread, Jam, Soda, Chips, Milk, Fruit {soda} {chips} 3 Steak, Jam, Soda, Chips, Bread {bread} {jam} 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk 7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Prof. Pier Luca Lanzi
  • 6. Definition: Frequent Itemset 6  Itemset TID Items A collection of one or more items, e.g., {milk, bread, jam} 1 Bread, Peanuts, Milk, Fruit, Jam k-itemset, an itemset that 2 Bread, Jam, Soda, Chips, Milk, Fruit contains k items  Support count ( ) 3 Steak, Jam, Soda, Chips, Bread Frequency of occurrence 4 Jam, Soda, Peanuts, Milk, Fruit of an itemset ({Milk, Bread}) = 3 5 Jam, Soda, Chips, Milk, Bread ({Soda, Chips}) = 4 6 Fruit, Soda, Chips, Milk  Support Fraction of transactions 7 Fruit, Soda, Peanuts, Milk that contain an itemset 8 Fruit, Peanuts, Cheese, Yogurt s({Milk, Bread}) = 3/8 s({Soda, Chips}) = 4/8  Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold Prof. Pier Luca Lanzi
  • 7. What is an association rule? 7  Implication of the form X Y, where X and Y are itemsets  Example, {bread} {milk}  Rule Evaluation Metrics, Suppor & Confidence  Support (s) Fraction of transactions that contain both X and Y  Confidence (c) Measures how often items in Y appear in transactions that contain X Prof. Pier Luca Lanzi
  • 8. Support and Confidence 8 Customer buys both Customer buys Milk Customer buys Bread Prof. Pier Luca Lanzi
  • 9. What is the goal? 9  Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold  Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Brute-force approach is computationally prohibitive! Prof. Pier Luca Lanzi
  • 10. Mining Association Rules 10 {Bread, Jam} {Milk} s=0.4 c=0.75 {Milk, Jam} {Bread} s=0.4 c=0.75 {Bread} {Milk, Jam} s=0.4 c=0.75 {Jam} {Bread, Milk} s=0.4 c=0.6 {Milk} {Bread, Jam} s=0.4 c=0.5  All the above rules are binary partitions of the same itemset: {Milk, Bread, Jam}  Rules originating from the same itemset have identical support but can have different confidence  We can decouple the support and confidence requirements! Prof. Pier Luca Lanzi
  • 11. Mining Association Rules: 11 Two Step Approach  Frequent Itemset Generation Generate all itemsets whose support minsup  Rule Generation Generate high confidence rules from frequent itemset Each rule is a binary partitioning of a frequent itemset  Frequent itemset generation is computationally expensive Prof. Pier Luca Lanzi
  • 12. Frequent Itemset Generation 13 null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d possible ABCDE candidate itemsets Prof. Pier Luca Lanzi
  • 13. Frequent Itemset Generation 14  Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database TID Items 1 Bread, Peanuts, Milk, Fruit, Jam candidates 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread N 4 Jam, Soda, Peanuts, Milk, Fruit 5 Jam, Soda, Chips, Milk, Bread M 6 Fruit, Soda, Chips, Milk 7 Fruit, Soda, Peanuts, Milk 8 Fruit, Peanuts, Cheese, Yogurt w Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d Prof. Pier Luca Lanzi
  • 14. Computational Complexity 15  Given d unique items: Total number of itemsets = 2d Total number of possible association rules: For d=6, there are 602 rules Prof. Pier Luca Lanzi
  • 15. Frequent Itemset Generation Strategies 16  Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M  Reduce the number of transactions (N) Reduce size of N as the size of itemset increases  Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Prof. Pier Luca Lanzi
  • 16. Reducing the Number of Candidates 18  Apriori principle If an itemset is frequent, then all of its subsets must also be frequent  Apriori principle holds due to the following property of the support measure:  Support of an itemset never exceeds the support of its subsets  This is known as the anti-monotone property of support Prof. Pier Luca Lanzi
  • 17. Illustrating Apriori Principle 19 null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE Prof. Pier Luca Lanzi
  • 18. How does the Apriori principle work? 20 Items (1-itemsets) Item Count Bread 4 Peanuts 4 2-itemsets Milk 6 2-Itemset Count Fruit 6 Bread, Jam 4 Jam 5 Peanuts, Fruit 4 Soda 6 Milk, Fruit 5 Chips 4 Milk, Jam 4 Steak 1 Milk, Soda 5 Cheese 1 Fruit, Soda 4 Yogurt 1 Jam, Soda 4 3-itemsets Soda, Chips 4 Minimum Support = 4 3-Itemset Count Milk, Fruit, Soda 4 Prof. Pier Luca Lanzi
  • 19. Apriori Algorithm 21  Let k=1  Generate frequent itemsets of length 1  Repeat until no new frequent itemsets are identified Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent Prof. Pier Luca Lanzi
  • 20. The Apriori Algorithm 22 Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Prof. Pier Luca Lanzi
  • 21. The Apriori Algorithm 23  Join Step Ck is generated by joining Lk-1 with itself  Prune Step Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Prof. Pier Luca Lanzi
  • 22. Efficient Implementation of Apriori in SQL 24  Hard to get good performance out of pure SQL (SQL-92) based approaches alone  Make use of object-relational extensions like UDFs, BLOBs, Table functions etc. Get orders of magnitude improvement  S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’98 Prof. Pier Luca Lanzi
  • 23. How to Improve 25 Apriori’s Efficiency  Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent  Transaction reduction: A transaction that does not contain any frequent k- itemset is useless in subsequent scans  Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB  Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness  Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent Prof. Pier Luca Lanzi
  • 24. Rule Generation 26  Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement  If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC, AB CD, AC BD, AD BC, BC AD, BD AC, CD AB  If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) Prof. Pier Luca Lanzi
  • 25. How to efficiently generate rules from 28 frequent itemsets?  Confidence does not have an anti-monotone property  c(ABC D) can be larger or smaller than c(AB D)  But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone with respect to the number of items on the right hand side of the rule Prof. Pier Luca Lanzi
  • 26. Rule Generation for Apriori Algorithm 29 ABCD=>{ } Low Confidence Rule BCD=>A ACD=>B ABD=>C ABC=>D CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD D=>ABC C=>ABD B=>ACD A=>BCD Pruned Rules Prof. Pier Luca Lanzi
  • 27. Rule Generation for Apriori Algorithm 30  Candidate rule is generated by merging two rules that share the same prefix in the rule consequent  join(CD=>AB,BD=>AC) CD=>AB BD=>AC would produce the candidate rule D => ABC  Prune rule D=>ABC if its subset AD=>BC does not have high confidence D=>ABC Prof. Pier Luca Lanzi
  • 28. Effect of Support Distribution 31  Many real data sets have skewed support distribution Support distribution of a retail data set Prof. Pier Luca Lanzi
  • 29. How to set the appropriate minsup? 32  If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products)  If minsup is set too low, it is computationally expensive and the number of itemsets is very large  A single minimum support threshold may not be effective Prof. Pier Luca Lanzi