SlideShare a Scribd company logo
1 of 37
Download to read offline
Seminar web data extraction                                                         1 / 26




       Seminar web data extraction: Mining uncertain data

                                     Sebastiaan van Schaik
                              Sebastiaan.van.Schaik@comlab.ox.ac.uk



                                        20 January 2011




                                                                      Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                   2 / 26


  Introduction

    Focus of this presentation: mining of frequent patterns and
    association rules from (uncertain) data.

    Example applications:
            discover regularities in customer transactions;
            analysing log files: determine how visitors use a website;

    Based on:
            Mining Uncertain Data with Probabilistic Guarantees[9] (KDD 2010);
            Frequent Pattern Mining with Uncertain Data[1] (KDD 2009);
            A Tree-Based Approach for Frequent Pattern Mining from Uncertain
            Data[6] (PAKDD 2008).


                                                                        Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                     3 / 26


  Introduction & running example
    Frequent pattern (itemset): items that occurs sufficiently often.
    Example: {fever, headache}

    Association rule: a set of items values implying another set of items.
    Example: {fever, headache} ⇒ {nausea}


             Patient          Diagnosis
      t1     Cheng            {severe cold}
      t2     Andrey           {yellow fever, haemochromatosis}
      t3     Omer             {schistosomiasis, syringomyelia}
      t4     Tim              {Wilson’s disease}
      t5     Dan              {Hughes-Stovin syndrome}                Yellow fever?
      t6     Bas              {Henoch-Schnlein purpura}
              Running example: patient diagnosis database
                                                                          Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                 4 / 26


  Measuring ‘interestingness’: support & confidence
    Support of an itemset X :
    sup(X ): number of entries (rows, transactions) that contain X


    Confidence of an association rule X ⇒ Y :
                                      sup(X ∪ Y )
                     conf(X ⇒ Y ) =
                                        sup(X )




                                                                      Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                       5 / 26


  Finding association rules: Apriori (1)

    Agrawal et al. introduced Apriori in 1994[2] to mine association rules:
     1 Find all frequent itemsets X in database D (X is frequent iff
                                                       i
        sup(Xi ) > minsup):
               1   Candidate generation: generate all possible itemsets of length k
                   (starting k = 1) based on frequent itemsets of length k − 1;
               2   Test candidates, discard infrequent itemsets;
               3   Repeat with k = k + 1.




                                                                            Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                       5 / 26


  Finding association rules: Apriori (1)

    Agrawal et al. introduced Apriori in 1994[2] to mine association rules:
     1 Find all frequent itemsets X in database D (X is frequent iff
                                                       i
        sup(Xi ) > minsup):
               1   Candidate generation: generate all possible itemsets of length k
                   (starting k = 1) based on frequent itemsets of length k − 1;
               2   Test candidates, discard infrequent itemsets;
               3   Repeat with k = k + 1.


    Important observation: all subsets X of a frequent itemset X are
    frequent (Apriori property). Used to purge before step (2).

    Example: if X = {fever} is not frequent in database D, then
    X = {fever, headache} can not be frequent.


                                                                            Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                     6 / 26


  Finding association rules: Apriori (2)

    Apriori continued:
     2 Extract association rules from frequent itemsets X . For each

         Xi ∈ X :
               1   Generate all non-empty subsets S of Xi . For each S:
               2   Test confidence of rule S ⇒ (Xi − S)




                                                                          Sebastiaan van Schaik
Seminar web data extraction > Frequent patterns & association rules                     6 / 26


  Finding association rules: Apriori (2)

    Apriori continued:
     2 Extract association rules from frequent itemsets X . For each

         Xi ∈ X :
               1   Generate all non-empty subsets S of Xi . For each S:
               2   Test confidence of rule S ⇒ (Xi − S)


    Example: itemset X = {fever, headache, nausea} is frequent, test:
            {fever, headache} ⇒ {nausea}
            {fever, nausea} ⇒ {headache}
            {nausea, headache} ⇒ {fever}
            {fever} ⇒ {headache, nausea}
            (. . . )


                                                                          Sebastiaan van Schaik
Seminar web data extraction > Introduction to uncertain data                     7 / 26


  Introduction to uncertain data

    Data might be uncertain, for example:
           Location detection using multiple RFID sensors (triangulation);
           Sensor readings (temperature, humidity) are noisy;
           Face recognition;
           Patient diagnosis.



    Challenge: how do we model uncertainty and
    take it into account when mining frequent
    itemsets and association rules?



                                                                   Sebastiaan van Schaik
Seminar web data extraction > Introduction to uncertain data                    8 / 26


  Existential probabilities

    Existential probability: a probability is associated with each item in a
    tuple, expressing the odds that the item belongs to that tuple.

    Important assumption: tuple and item independence!




                                                                  Sebastiaan van Schaik
Seminar web data extraction > Introduction to uncertain data                                     8 / 26


  Existential probabilities

    Existential probability: a probability is associated with each item in a
    tuple, expressing the odds that the item belongs to that tuple.

    Important assumption: tuple and item independence!



          Patient       Diagnosis (including existential probabilities)
   t1     Cheng         { 0.9 : a                            0.72 : d 0.718 : e        0.8 : f      }
   t2     Andrey        { 0.9 : a                 0.81 : c 0.718 : d    0.72 : e                    }
   t3     Omer          {            0.875 : b 0.857 : c                                            }
   t4     Tim           { 0.9 : a                            0.72 : d 0.718 : e                     }
   t5     Dan           {            0.875 : b 0.857 : c     0.05 : d                               }
   t6     Bas           {            0.875 : b                                         0.1 : f      }
            Simplified probabilistic diagnosis database (adapted from [6])


                                                                                   Sebastiaan van Schaik
Seminar web data extraction > Introduction to uncertain data                          9 / 26


  Possible worlds

        D = {t1 , t2 , . . . , tn } (n transactions)
        tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)
    D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.




                                                                        Sebastiaan van Schaik
Seminar web data extraction > Introduction to uncertain data                                             9 / 26


  Possible worlds

        D = {t1 , t2 , . . . , tn } (n transactions)
        tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)
    D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.

          Patient       Diagnosis (including prob.)
   t1     Cheng         { 0.9 : a                                   0.72 : d   0.718 : e       0.8 : f      }
   t2     Andrey        { 0.9 : a                0.81 : c          0.718 : d    0.72 : e                    }
   t3     Omer          {            0.875 : b 0.857 : c                                                    }
   t4     Tim           { 0.9 : a                                   0.72 : d   0.718 : e                    }
   t5     Dan           {            0.875 : b 0.857 : c            0.05 : d                                }
   t6     Bas           {            0.875 : b                                                 0.1 : f      }


    Pr[Wx ] = (1 − p(1,a) ) · p(1,d) · (1 − p(1,e) ) · p(1,f ) · p(2,a) · . . . · p(6,f )
                   = 0.1 · 0.72 · 0.29 · 0.2 · 0.9 · . . . · 0.1
                   ≈ 0.00000021                      (one of the 218 possible worlds)
                                                                                           Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                10 / 26


 Introduction


    Approaches to mining frequent itemsets from uncertain data:
           U-Apriori[4] and p-Apriori[9]
           UF-growth[6]
           UFP-tree[1]
           ...

    Further focus:
           UF-growth: mining without candidate generation;
           p-Apriori: pruning using Chernoff bounds



                                                                     Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                                11 / 26


 Expected support

    Support of an itemset X turns into a random variable:

                          E [sup(X )] =                       Pr[Wi ] · supWi (X )
                                                    Wi ∈W




                                                                                     Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                                   11 / 26


 Expected support

    Support of an itemset X turns into a random variable:

                          E [sup(X )] =                         Pr[Wi ] · supWi (X )
                                                    Wi ∈W




    Enumerating all possible worlds is infeasible, however (because of
    independency assumptions):


                              E [sup(X )] =                                Pr[x, tj ]
                                                        tj ∈D        x∈X

    (see [7, 6])


                                                                                        Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                                     12 / 26


 Expected support (2)
         Patient        Diagnosis (including prob.)
   t1    Cheng          { 0.9 : a                                     0.72 : d   0.718 : e   0.8 : f       }
   t2    Andrey         { 0.9 : a                0.81 : c            0.718 : d    0.72 : e                 }
   t3    Omer           {            0.875 : b 0.857 : c                                                   }
   t4    Tim            { 0.9 : a                                     0.72 : d   0.718 : e                 }
   t5    Dan            {            0.875 : b 0.857 : c              0.05 : d                             }
   t6    Bas            {            0.875 : b                                               0.1 : f       }

    Expected support of itemset X = {a, d} in patient diagnosis database:
           supWx (X )      =     2
          E[sup(X )]       =              Pr[Wi ] · supWi (X )
                                 Wi ∈W




                                                                                         Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                                         12 / 26


 Expected support (2)
         Patient        Diagnosis (including prob.)
   t1    Cheng          { 0.9 : a                                     0.72 : d   0.718 : e       0.8 : f      }
   t2    Andrey         { 0.9 : a                0.81 : c            0.718 : d    0.72 : e                    }
   t3    Omer           {            0.875 : b 0.857 : c                                                      }
   t4    Tim            { 0.9 : a                                     0.72 : d   0.718 : e                    }
   t5    Dan            {            0.875 : b 0.857 : c              0.05 : d                                }
   t6    Bas            {            0.875 : b                                                   0.1 : f      }

    Expected support of itemset X = {a, d} in patient diagnosis database:
           supWx (X )      =     2
          E[sup(X )]       =              Pr[Wi ] · supWi (X )
                                 Wi ∈W


                           =                   Pr[x, tj ]
                                 tj ∈D   x∈X

                           =     0.9 · 0.72 + 0.9 · 0.71 + 0 · 0 + 0.9 · 0.72 + 0 · 0.05 + 0 · 0
                           =     1.935
                                                                                             Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > Introduction                               13 / 26


 Frequent itemsets in probabilistic databases




    An itemset X is frequent iff:

 UF-growth:            E[sup(X )] > minsup                 (also used in [4, 1] and many others)
    p-Apriori:         Pr[sup(X ) > minsup] ≥ minprob




                                                                                    Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                 14 / 26


 Introduction to UF-growth
   Apriori versus UF-growth:
           Apriori-like algorithms generate and test candidate itemsets;
           UF-growth[6] (based on FP-growth[5]) grows a tree based on a
           probabilistic database.




                                                                   Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                 14 / 26


 Introduction to UF-growth
   Apriori versus UF-growth:
           Apriori-like algorithms generate and test candidate itemsets;
           UF-growth[6] (based on FP-growth[5]) grows a tree based on a
           probabilistic database.

   Outline of procedure (example follows):
       1   First scan: determine expected support of all items;
       2   Second scan: create branch for each transaction (merging
           identical nodes when possible). Each node contains:
                  An item;
                  Its probability;
                  Its occurrence count.
                  Example: (a, 0.9, 2)


                                                                   Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                 14 / 26


 Introduction to UF-growth
   Apriori versus UF-growth:
           Apriori-like algorithms generate and test candidate itemsets;
           UF-growth[6] (based on FP-growth[5]) grows a tree based on a
           probabilistic database.

   Outline of procedure (example follows):
       1   First scan: determine expected support of all items;
       2   Second scan: create branch for each transaction (merging
           identical nodes when possible). Each node contains:
                  An item;
                  Its probability;
                  Its occurrence count.
                  Example: (a, 0.9, 2)
     An itemset X is frequent iff: E[sup(X )] > minsup
                                                                   Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                                         15 / 26


 UF-tree example (1)
         Patient       Diagnosis (including prob.)
   t1    Cheng         { 0.9 : a                                   0.72 : d   0.718 : e       0.8 : f      }
   t2    Andrey        { 0.9 : a                0.81 : c          0.718 : d    0.72 : e                    }
   t3    Omer          {            0.875 : b 0.857 : c                                                    }
   t4    Tim           { 0.9 : a                                   0.72 : d   0.718 : e                    }
   t5    Dan           {            0.875 : b 0.857 : c            0.05 : d                                }
   t6    Bas           {            0.875 : b                                                 0.1 : f      }


   1) determine exp. support                            2) build tree

          E [sup({a})] = 2.7
         E [sup({b})] = 2.625
          E [sup({c})] = 2.524
         E [sup({d})] = 2.20875
          E [sup({e})] = 2.1575
          E [sup({f })] = 0.9                                                       (from [6])
                                                                                          Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                             16 / 26


 UF-tree example (2)




   Extract frequent patterns from FP-tree:
       E [sup({a, e})]        = 1 · 0.72 · 0.9 + 2 · 0.71875 · 0.9 = 1.94175
       E [sup({c, e})] = 1 · 0.72 · 0.81 = 0.5832
       E [sup({d, e})] = 1 · 0.72 · 0.71875 + 2 · 0.71875 · 0.72 = 1.5525
   E [sup({a, d, e})] = 1 · 0.9 · 0.71875 · 0.72 + 2 · 0.9 · 0.72 · 0.71875 = 1.39725
                                                                               Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > UF-growth                 17 / 26


 UF-growth continued


   Mining larger itemsets can be done more efficiently using tree
   projections.

   Remarks:
           Nodes can only be merged when items have identical probabilities
           (otherwise, all occurence counts equal 1);
           Suggested solution in [6]: rounding of probabilities;
           Other solution (from [1]): store a carefully constructed summary
           of probabilities in each node. Might yield overestimation of
           expected support.



                                                                   Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                  18 / 26


 Introduction to p-Apriori


           Apriori has been extended to support uncertainty;
           New pruning techniques[9, 4, 3] improve efficiency;
           Note: the apriori (“downwards closure”) property still holds in the
           probabilistic case[1];
           Goal: prune candidates, saving as much time as possible.




                                                                    Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                  18 / 26


 Introduction to p-Apriori


           Apriori has been extended to support uncertainty;
           New pruning techniques[9, 4, 3] improve efficiency;
           Note: the apriori (“downwards closure”) property still holds in the
           probabilistic case[1];
           Goal: prune candidates, saving as much time as possible.



    In p-Apriori, an itemset X is frequent iff:

                             Pr[sup(X ) > minsup] ≥ minprob



                                                                    Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                         19 / 26


 p-Apriori: advanced frequent itemset mining

    Sun et al. [9] use a simplified approach to modelling uncertainty: each
    tuple ti is associated with an existential probability pi .




        1
            Interesting course: Probability & Computing by James Worrell
                                                                           Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                         19 / 26


 p-Apriori: advanced frequent itemset mining

    Sun et al. [9] use a simplified approach to modelling uncertainty: each
    tuple ti is associated with an existential probability pi .

    In p-Apriori: itemset X is frequent if and only if:

                                 Pr[sup(X ) > minsup] ≥ minprob


    Let cnt(X ) denote the number of tuples containing X , then:

                         cnt(X ) < minsup ⇒ X can not be frequent


    Chernoff bounds1 provide a strict bound on the tail distributions of
    sums of independent random variables.
        1
            Interesting course: Probability & Computing by James Worrell
                                                                           Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                              20 / 26


 p-Apriori: pruning using Chernoff Bounds (1)
    Each tuple ti is associated with an existential probability pi . Then:

                                             1        with probability pi
                           Yi     =
                                             0        with probability 1 − pi

                           Y      =           Yi       =      sup(X )




                                                                                Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                                20 / 26


 p-Apriori: pruning using Chernoff Bounds (1)
    Each tuple ti is associated with an existential probability pi . Then:

                                             1        with probability pi
                           Yi     =
                                             0        with probability 1 − pi

                           Y      =           Yi       =      sup(X )


    Furthermore:

                                   µ = E[sup(X )]
                                        minsup − µ − 1
                                    δ =
                                                µ
                 Pr[sup(X ) ≥ minsup] = Pr [sup(X ) > minsup − 1]
                                                      = Pr [sup(X ) > (1 + δ)µ]

                                                                                  Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                                                   21 / 26


 p-Apriori: pruning using Chernoff Bounds (2)
    Using a Chernoff bound (see [8], theorem 4.3 and exercise 4.1):
                                                                       2−δµ          if δ ≥ 2e − 1
                  Pr[sup(X ) ≥ minsup] <                                   −δ 2 µ
                                                                       e    4          otherwise


    Therefore: an itemset X can not be frequent if:
                                 for δ ≥ 2e − 1 :                  2−δµ < minprob
                                                                       −δ 2 µ
                           for 0 < δ < 2e − 1 :                    e    4           < minprob


    Example with minprob = 0.4, minsup = 9 and E [sup(X )] = 3:
                                              2
                        −δ 2 µ        − 9−3−1
                                       (  3     )
                                                ·3
                                                                  25
                    e    4       =e        4          = e − 12 ≈ 0.125 < minprob
                                                                                                     Sebastiaan van Schaik
Seminar web data extraction > Mining uncertain data > p-Apriori                    22 / 26


 p-Apriori: finding frequent patterns (DP)


    The p-Apriori algorithm for finding frequent patterns resembles apriori:
       1   Generate set of candidate k-itemsets Ck based on frequent
           itemsets of length k − 1
       2   For each itemset X ∈ Ck :
               1   Try pruning by using apriori property
               2   Compute cnt(X ), try pruning using Chernoff bound
       3   For each itemset X ∈ Ck left: compute pmf in O(n2 ) time,
           compare against minprob


    (association rules can be mined using the frequent patterns)


                                                                      Sebastiaan van Schaik
Seminar web data extraction > Summary & conclusion                                23 / 26


 Summary & conclusion

           Data mining of uncertain data is a new, fast moving field;
           Data uncertainty introduces a significant complexity layer;
           Different algorithms use different definitions and models;
           Algorithm performance greatly depends on data.




                                                                     Sebastiaan van Schaik
Seminar web data extraction > References                                       24 / 26


 References I


          C. C. Aggarwal, Y. Li, and Jing Wang.
          Frequent pattern mining with uncertain data.
          discovery and data mining, pages 29–37, 2009.
          Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors.
          Fast algorithms for mining association rules, volume 1215 of Proc
          20th Int Conf Very Large Data Bases VLDB. Citeseer, 1994.
          C. K. Chui and B. Kao.
          decremental approach for mining frequent itemsets from uncertain
          data.
          Proceedings of the 12th Pacific-Asia conference on Advances in
          knowledge discovery and data mining, pages 64–75, 2008.


                                                                  Sebastiaan van Schaik
Seminar web data extraction > References                                       25 / 26


 References II

          C. K. Chui, Ben Kao, and Edward Hung.
          Mining frequent itemsets from uncertain data.
          Advances in Knowledge Discovery and Data Mining, pages 47–58,
          2007.
          J. Han, J. Pei, Y. Yin, and R. Mao.
          Mining frequent patterns without candidate generation: A
          frequent-pattern tree approach.
          Data mining and knowledge discovery, 8(1):53–87, 2004.
          C. Leung, M. Mateo, and D. Brajczuk.
          tree-based approach for frequent pattern mining from uncertain
          data.
          Advances in Knowledge Discovery and Data Mining, pages
          653–661, 2008.

                                                                  Sebastiaan van Schaik
Seminar web data extraction > References                                      26 / 26


 References III


          C. K. S. Leung, B. Hao, and F. Jiang.
          Constrained frequent itemset mining from uncertain data streams.
          pages 120–127, 2010.
          R. Motwani and P. Raghavan.
          Randomized Algorithms.
          Cambridge University Press, 1995.
          Liwen Sun, R. Cheng, and D. W. Cheung.
          Mining uncertain data with probabilistic guarantees.
          discovery and data mining, pages 273–282, 2010.
          Recommended by Dan Olteanu, read by Nov 12 4pm.



                                                                 Sebastiaan van Schaik

More Related Content

Similar to Mining Uncertain Data (Sebastiaan van Schaaik)

Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsChristian Robert
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Visionzukun
 
Several nonlinear models and methods for FDA
Several nonlinear models and methods for FDASeveral nonlinear models and methods for FDA
Several nonlinear models and methods for FDAtuxette
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statisticsTarun Gehlot
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overviewSebastien Destercke
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
 
lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).pptmuqadsatareen
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptNaglaaAbdelhady
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
 
Formulation of model likelihood functions
Formulation of model likelihood functionsFormulation of model likelihood functions
Formulation of model likelihood functionsAndreas Scheidegger
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 

Similar to Mining Uncertain Data (Sebastiaan van Schaaik) (20)

Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data Streams
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
Information synergy
Information synergyInformation synergy
Information synergy
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Vision
 
Digital Communication - Stochastic Process
Digital Communication - Stochastic ProcessDigital Communication - Stochastic Process
Digital Communication - Stochastic Process
 
Several nonlinear models and methods for FDA
Several nonlinear models and methods for FDASeveral nonlinear models and methods for FDA
Several nonlinear models and methods for FDA
 
08 entropie
08 entropie08 entropie
08 entropie
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statistics
 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overview
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).ppt
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Basen Network
Basen NetworkBasen Network
Basen Network
 
Formulation of model likelihood functions
Formulation of model likelihood functionsFormulation of model likelihood functions
Formulation of model likelihood functions
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 

Mining Uncertain Data (Sebastiaan van Schaaik)

  • 1. Seminar web data extraction 1 / 26 Seminar web data extraction: Mining uncertain data Sebastiaan van Schaik Sebastiaan.van.Schaik@comlab.ox.ac.uk 20 January 2011 Sebastiaan van Schaik
  • 2. Seminar web data extraction > Frequent patterns & association rules 2 / 26 Introduction Focus of this presentation: mining of frequent patterns and association rules from (uncertain) data. Example applications: discover regularities in customer transactions; analysing log files: determine how visitors use a website; Based on: Mining Uncertain Data with Probabilistic Guarantees[9] (KDD 2010); Frequent Pattern Mining with Uncertain Data[1] (KDD 2009); A Tree-Based Approach for Frequent Pattern Mining from Uncertain Data[6] (PAKDD 2008). Sebastiaan van Schaik
  • 3. Seminar web data extraction > Frequent patterns & association rules 3 / 26 Introduction & running example Frequent pattern (itemset): items that occurs sufficiently often. Example: {fever, headache} Association rule: a set of items values implying another set of items. Example: {fever, headache} ⇒ {nausea} Patient Diagnosis t1 Cheng {severe cold} t2 Andrey {yellow fever, haemochromatosis} t3 Omer {schistosomiasis, syringomyelia} t4 Tim {Wilson’s disease} t5 Dan {Hughes-Stovin syndrome} Yellow fever? t6 Bas {Henoch-Schnlein purpura} Running example: patient diagnosis database Sebastiaan van Schaik
  • 4. Seminar web data extraction > Frequent patterns & association rules 4 / 26 Measuring ‘interestingness’: support & confidence Support of an itemset X : sup(X ): number of entries (rows, transactions) that contain X Confidence of an association rule X ⇒ Y : sup(X ∪ Y ) conf(X ⇒ Y ) = sup(X ) Sebastiaan van Schaik
  • 5. Seminar web data extraction > Frequent patterns & association rules 5 / 26 Finding association rules: Apriori (1) Agrawal et al. introduced Apriori in 1994[2] to mine association rules: 1 Find all frequent itemsets X in database D (X is frequent iff i sup(Xi ) > minsup): 1 Candidate generation: generate all possible itemsets of length k (starting k = 1) based on frequent itemsets of length k − 1; 2 Test candidates, discard infrequent itemsets; 3 Repeat with k = k + 1. Sebastiaan van Schaik
  • 6. Seminar web data extraction > Frequent patterns & association rules 5 / 26 Finding association rules: Apriori (1) Agrawal et al. introduced Apriori in 1994[2] to mine association rules: 1 Find all frequent itemsets X in database D (X is frequent iff i sup(Xi ) > minsup): 1 Candidate generation: generate all possible itemsets of length k (starting k = 1) based on frequent itemsets of length k − 1; 2 Test candidates, discard infrequent itemsets; 3 Repeat with k = k + 1. Important observation: all subsets X of a frequent itemset X are frequent (Apriori property). Used to purge before step (2). Example: if X = {fever} is not frequent in database D, then X = {fever, headache} can not be frequent. Sebastiaan van Schaik
  • 7. Seminar web data extraction > Frequent patterns & association rules 6 / 26 Finding association rules: Apriori (2) Apriori continued: 2 Extract association rules from frequent itemsets X . For each Xi ∈ X : 1 Generate all non-empty subsets S of Xi . For each S: 2 Test confidence of rule S ⇒ (Xi − S) Sebastiaan van Schaik
  • 8. Seminar web data extraction > Frequent patterns & association rules 6 / 26 Finding association rules: Apriori (2) Apriori continued: 2 Extract association rules from frequent itemsets X . For each Xi ∈ X : 1 Generate all non-empty subsets S of Xi . For each S: 2 Test confidence of rule S ⇒ (Xi − S) Example: itemset X = {fever, headache, nausea} is frequent, test: {fever, headache} ⇒ {nausea} {fever, nausea} ⇒ {headache} {nausea, headache} ⇒ {fever} {fever} ⇒ {headache, nausea} (. . . ) Sebastiaan van Schaik
  • 9. Seminar web data extraction > Introduction to uncertain data 7 / 26 Introduction to uncertain data Data might be uncertain, for example: Location detection using multiple RFID sensors (triangulation); Sensor readings (temperature, humidity) are noisy; Face recognition; Patient diagnosis. Challenge: how do we model uncertainty and take it into account when mining frequent itemsets and association rules? Sebastiaan van Schaik
  • 10. Seminar web data extraction > Introduction to uncertain data 8 / 26 Existential probabilities Existential probability: a probability is associated with each item in a tuple, expressing the odds that the item belongs to that tuple. Important assumption: tuple and item independence! Sebastiaan van Schaik
  • 11. Seminar web data extraction > Introduction to uncertain data 8 / 26 Existential probabilities Existential probability: a probability is associated with each item in a tuple, expressing the odds that the item belongs to that tuple. Important assumption: tuple and item independence! Patient Diagnosis (including existential probabilities) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e } t3 Omer { 0.875 : b 0.857 : c } t4 Tim { 0.9 : a 0.72 : d 0.718 : e } t5 Dan { 0.875 : b 0.857 : c 0.05 : d } t6 Bas { 0.875 : b 0.1 : f } Simplified probabilistic diagnosis database (adapted from [6]) Sebastiaan van Schaik
  • 12. Seminar web data extraction > Introduction to uncertain data 9 / 26 Possible worlds D = {t1 , t2 , . . . , tn } (n transactions) tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction) D can be expanded to possible worlds: W = {W1 , . . . , W2nm }. Sebastiaan van Schaik
  • 13. Seminar web data extraction > Introduction to uncertain data 9 / 26 Possible worlds D = {t1 , t2 , . . . , tn } (n transactions) tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction) D can be expanded to possible worlds: W = {W1 , . . . , W2nm }. Patient Diagnosis (including prob.) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e } t3 Omer { 0.875 : b 0.857 : c } t4 Tim { 0.9 : a 0.72 : d 0.718 : e } t5 Dan { 0.875 : b 0.857 : c 0.05 : d } t6 Bas { 0.875 : b 0.1 : f } Pr[Wx ] = (1 − p(1,a) ) · p(1,d) · (1 − p(1,e) ) · p(1,f ) · p(2,a) · . . . · p(6,f ) = 0.1 · 0.72 · 0.29 · 0.2 · 0.9 · . . . · 0.1 ≈ 0.00000021 (one of the 218 possible worlds) Sebastiaan van Schaik
  • 14. Seminar web data extraction > Mining uncertain data > Introduction 10 / 26 Introduction Approaches to mining frequent itemsets from uncertain data: U-Apriori[4] and p-Apriori[9] UF-growth[6] UFP-tree[1] ... Further focus: UF-growth: mining without candidate generation; p-Apriori: pruning using Chernoff bounds Sebastiaan van Schaik
  • 15. Seminar web data extraction > Mining uncertain data > Introduction 11 / 26 Expected support Support of an itemset X turns into a random variable: E [sup(X )] = Pr[Wi ] · supWi (X ) Wi ∈W Sebastiaan van Schaik
  • 16. Seminar web data extraction > Mining uncertain data > Introduction 11 / 26 Expected support Support of an itemset X turns into a random variable: E [sup(X )] = Pr[Wi ] · supWi (X ) Wi ∈W Enumerating all possible worlds is infeasible, however (because of independency assumptions): E [sup(X )] = Pr[x, tj ] tj ∈D x∈X (see [7, 6]) Sebastiaan van Schaik
  • 17. Seminar web data extraction > Mining uncertain data > Introduction 12 / 26 Expected support (2) Patient Diagnosis (including prob.) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e } t3 Omer { 0.875 : b 0.857 : c } t4 Tim { 0.9 : a 0.72 : d 0.718 : e } t5 Dan { 0.875 : b 0.857 : c 0.05 : d } t6 Bas { 0.875 : b 0.1 : f } Expected support of itemset X = {a, d} in patient diagnosis database: supWx (X ) = 2 E[sup(X )] = Pr[Wi ] · supWi (X ) Wi ∈W Sebastiaan van Schaik
  • 18. Seminar web data extraction > Mining uncertain data > Introduction 12 / 26 Expected support (2) Patient Diagnosis (including prob.) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e } t3 Omer { 0.875 : b 0.857 : c } t4 Tim { 0.9 : a 0.72 : d 0.718 : e } t5 Dan { 0.875 : b 0.857 : c 0.05 : d } t6 Bas { 0.875 : b 0.1 : f } Expected support of itemset X = {a, d} in patient diagnosis database: supWx (X ) = 2 E[sup(X )] = Pr[Wi ] · supWi (X ) Wi ∈W = Pr[x, tj ] tj ∈D x∈X = 0.9 · 0.72 + 0.9 · 0.71 + 0 · 0 + 0.9 · 0.72 + 0 · 0.05 + 0 · 0 = 1.935 Sebastiaan van Schaik
  • 19. Seminar web data extraction > Mining uncertain data > Introduction 13 / 26 Frequent itemsets in probabilistic databases An itemset X is frequent iff: UF-growth: E[sup(X )] > minsup (also used in [4, 1] and many others) p-Apriori: Pr[sup(X ) > minsup] ≥ minprob Sebastiaan van Schaik
  • 20. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Sebastiaan van Schaik
  • 21. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Outline of procedure (example follows): 1 First scan: determine expected support of all items; 2 Second scan: create branch for each transaction (merging identical nodes when possible). Each node contains: An item; Its probability; Its occurrence count. Example: (a, 0.9, 2) Sebastiaan van Schaik
  • 22. Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26 Introduction to UF-growth Apriori versus UF-growth: Apriori-like algorithms generate and test candidate itemsets; UF-growth[6] (based on FP-growth[5]) grows a tree based on a probabilistic database. Outline of procedure (example follows): 1 First scan: determine expected support of all items; 2 Second scan: create branch for each transaction (merging identical nodes when possible). Each node contains: An item; Its probability; Its occurrence count. Example: (a, 0.9, 2) An itemset X is frequent iff: E[sup(X )] > minsup Sebastiaan van Schaik
  • 23. Seminar web data extraction > Mining uncertain data > UF-growth 15 / 26 UF-tree example (1) Patient Diagnosis (including prob.) t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f } t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e } t3 Omer { 0.875 : b 0.857 : c } t4 Tim { 0.9 : a 0.72 : d 0.718 : e } t5 Dan { 0.875 : b 0.857 : c 0.05 : d } t6 Bas { 0.875 : b 0.1 : f } 1) determine exp. support 2) build tree E [sup({a})] = 2.7 E [sup({b})] = 2.625 E [sup({c})] = 2.524 E [sup({d})] = 2.20875 E [sup({e})] = 2.1575 E [sup({f })] = 0.9 (from [6]) Sebastiaan van Schaik
  • 24. Seminar web data extraction > Mining uncertain data > UF-growth 16 / 26 UF-tree example (2) Extract frequent patterns from FP-tree: E [sup({a, e})] = 1 · 0.72 · 0.9 + 2 · 0.71875 · 0.9 = 1.94175 E [sup({c, e})] = 1 · 0.72 · 0.81 = 0.5832 E [sup({d, e})] = 1 · 0.72 · 0.71875 + 2 · 0.71875 · 0.72 = 1.5525 E [sup({a, d, e})] = 1 · 0.9 · 0.71875 · 0.72 + 2 · 0.9 · 0.72 · 0.71875 = 1.39725 Sebastiaan van Schaik
  • 25. Seminar web data extraction > Mining uncertain data > UF-growth 17 / 26 UF-growth continued Mining larger itemsets can be done more efficiently using tree projections. Remarks: Nodes can only be merged when items have identical probabilities (otherwise, all occurence counts equal 1); Suggested solution in [6]: rounding of probabilities; Other solution (from [1]): store a carefully constructed summary of probabilities in each node. Might yield overestimation of expected support. Sebastiaan van Schaik
  • 26. Seminar web data extraction > Mining uncertain data > p-Apriori 18 / 26 Introduction to p-Apriori Apriori has been extended to support uncertainty; New pruning techniques[9, 4, 3] improve efficiency; Note: the apriori (“downwards closure”) property still holds in the probabilistic case[1]; Goal: prune candidates, saving as much time as possible. Sebastiaan van Schaik
  • 27. Seminar web data extraction > Mining uncertain data > p-Apriori 18 / 26 Introduction to p-Apriori Apriori has been extended to support uncertainty; New pruning techniques[9, 4, 3] improve efficiency; Note: the apriori (“downwards closure”) property still holds in the probabilistic case[1]; Goal: prune candidates, saving as much time as possible. In p-Apriori, an itemset X is frequent iff: Pr[sup(X ) > minsup] ≥ minprob Sebastiaan van Schaik
  • 28. Seminar web data extraction > Mining uncertain data > p-Apriori 19 / 26 p-Apriori: advanced frequent itemset mining Sun et al. [9] use a simplified approach to modelling uncertainty: each tuple ti is associated with an existential probability pi . 1 Interesting course: Probability & Computing by James Worrell Sebastiaan van Schaik
  • 29. Seminar web data extraction > Mining uncertain data > p-Apriori 19 / 26 p-Apriori: advanced frequent itemset mining Sun et al. [9] use a simplified approach to modelling uncertainty: each tuple ti is associated with an existential probability pi . In p-Apriori: itemset X is frequent if and only if: Pr[sup(X ) > minsup] ≥ minprob Let cnt(X ) denote the number of tuples containing X , then: cnt(X ) < minsup ⇒ X can not be frequent Chernoff bounds1 provide a strict bound on the tail distributions of sums of independent random variables. 1 Interesting course: Probability & Computing by James Worrell Sebastiaan van Schaik
  • 30. Seminar web data extraction > Mining uncertain data > p-Apriori 20 / 26 p-Apriori: pruning using Chernoff Bounds (1) Each tuple ti is associated with an existential probability pi . Then: 1 with probability pi Yi = 0 with probability 1 − pi Y = Yi = sup(X ) Sebastiaan van Schaik
  • 31. Seminar web data extraction > Mining uncertain data > p-Apriori 20 / 26 p-Apriori: pruning using Chernoff Bounds (1) Each tuple ti is associated with an existential probability pi . Then: 1 with probability pi Yi = 0 with probability 1 − pi Y = Yi = sup(X ) Furthermore: µ = E[sup(X )] minsup − µ − 1 δ = µ Pr[sup(X ) ≥ minsup] = Pr [sup(X ) > minsup − 1] = Pr [sup(X ) > (1 + δ)µ] Sebastiaan van Schaik
  • 32. Seminar web data extraction > Mining uncertain data > p-Apriori 21 / 26 p-Apriori: pruning using Chernoff Bounds (2) Using a Chernoff bound (see [8], theorem 4.3 and exercise 4.1): 2−δµ if δ ≥ 2e − 1 Pr[sup(X ) ≥ minsup] < −δ 2 µ e 4 otherwise Therefore: an itemset X can not be frequent if: for δ ≥ 2e − 1 : 2−δµ < minprob −δ 2 µ for 0 < δ < 2e − 1 : e 4 < minprob Example with minprob = 0.4, minsup = 9 and E [sup(X )] = 3: 2 −δ 2 µ − 9−3−1 ( 3 ) ·3 25 e 4 =e 4 = e − 12 ≈ 0.125 < minprob Sebastiaan van Schaik
  • 33. Seminar web data extraction > Mining uncertain data > p-Apriori 22 / 26 p-Apriori: finding frequent patterns (DP) The p-Apriori algorithm for finding frequent patterns resembles apriori: 1 Generate set of candidate k-itemsets Ck based on frequent itemsets of length k − 1 2 For each itemset X ∈ Ck : 1 Try pruning by using apriori property 2 Compute cnt(X ), try pruning using Chernoff bound 3 For each itemset X ∈ Ck left: compute pmf in O(n2 ) time, compare against minprob (association rules can be mined using the frequent patterns) Sebastiaan van Schaik
  • 34. Seminar web data extraction > Summary & conclusion 23 / 26 Summary & conclusion Data mining of uncertain data is a new, fast moving field; Data uncertainty introduces a significant complexity layer; Different algorithms use different definitions and models; Algorithm performance greatly depends on data. Sebastiaan van Schaik
  • 35. Seminar web data extraction > References 24 / 26 References I C. C. Aggarwal, Y. Li, and Jing Wang. Frequent pattern mining with uncertain data. discovery and data mining, pages 29–37, 2009. Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors. Fast algorithms for mining association rules, volume 1215 of Proc 20th Int Conf Very Large Data Bases VLDB. Citeseer, 1994. C. K. Chui and B. Kao. decremental approach for mining frequent itemsets from uncertain data. Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining, pages 64–75, 2008. Sebastiaan van Schaik
  • 36. Seminar web data extraction > References 25 / 26 References II C. K. Chui, Ben Kao, and Edward Hung. Mining frequent itemsets from uncertain data. Advances in Knowledge Discovery and Data Mining, pages 47–58, 2007. J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data mining and knowledge discovery, 8(1):53–87, 2004. C. Leung, M. Mateo, and D. Brajczuk. tree-based approach for frequent pattern mining from uncertain data. Advances in Knowledge Discovery and Data Mining, pages 653–661, 2008. Sebastiaan van Schaik
  • 37. Seminar web data extraction > References 26 / 26 References III C. K. S. Leung, B. Hao, and F. Jiang. Constrained frequent itemset mining from uncertain data streams. pages 120–127, 2010. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. Liwen Sun, R. Cheng, and D. W. Cheung. Mining uncertain data with probabilistic guarantees. discovery and data mining, pages 273–282, 2010. Recommended by Dan Olteanu, read by Nov 12 4pm. Sebastiaan van Schaik