SlideShare ist ein Scribd-Unternehmen logo
1 von 89
Downloaden Sie, um offline zu lesen
A Framework for Optimum Document
            Clustering:
Implementing the Cluster Hypothesis

              Norbert Fuhr

         University of Duisburg-Essen


            March 30, 2011
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   2




      Outline


       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   3
  Introduction




       1     Introduction

       2     Cluster Metric

       3     Optimum clustering

       4     Towards Optimum Clustering

       5     Experiments

       6     Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   4
  Introduction



      Motivation

       Ad-hoc Retrieval
           heuristic models:
                      define retrieval function
                      evaluate to test if it yields good quality
                 Probability Ranking Principle (PRP)
                      theoretic foundation for optimum retrieval
                      numerous probabilistic models based on PRP

       Document clustering
                 classic approach:
                      define similarity function and fusion principle
                      evaluate to test if they yield good quality
                 Optimum Clustering Principle?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   5
  Introduction



      Cluster Hypothesis


       Original Formulation
       ”closely associated documents tend to be relevant to the same
       requests” (Rijsbergen 1979)

       Idea of optimum clustering:
       Cluster documents in such a way, that for any request, the
       relevant documents occur together in one cluster
         redefine document similarity:
       documents are similar if they are relevant to the same queries
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   6
  Introduction



      The Optimum Clustering Framework
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   7
  Cluster Metric




       1     Introduction

       2     Cluster Metric

       3     Optimum clustering

       4     Towards Optimum Clustering

       5     Experiments

       6     Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   8
  Cluster Metric



      Defining a Metric based on the Cluster Hypothesis



       General idea:
          Evaluate clustering wrt. a set of queries
          For each query and each cluster, regard pairs of
          documents co-occurring:
                      relevant-relevant: good
                      relevant-irrelevant: bad
                      irrelevant-irrelevant: don’t care
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 9
  Cluster Metric



      Pairwise precision

                       Q Set of queries
                       D Document collection
                       R relevance judgments: R ⊂ Q × D
                       C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and
                                                                   i=1
                         ∀i, j : i = j → Ci ∩ Cj = ∅
                    ci = |Ci | (size of cluster Ci ),
                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )

       Pairwise precision (weighted average over all clusters)
                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 10
  Cluster Metric



      Pairwise precision – Example


                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1



               Query set: disjoint classification with two classes a and b,
               three clusters: (aab|bb|aa)
               Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
                    7     3                               7

               Perfect clustering for a disjoint classification would yield
               Pp = 1
               for arbitrary query sets, values > 1 are possible
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 10
  Cluster Metric



      Pairwise precision – Example


                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1



               Query set: disjoint classification with two classes a and b,
               three clusters: (aab|bb|aa)
               Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 .
                    7     3                               7

               Perfect clustering for a disjoint classification would yield
               Pp = 1
               for arbitrary query sets, values > 1 are possible
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                11
  Cluster Metric



      Pairwise recall

                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )
                   gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
                        documents for qk )
       (micro recall)

                                                          qk ∈Q         Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q    gk (gk − 1)
                                                                gk >1


       Example: (aab|bb|aa)
               2 a pairs (out of 6)
               1 b pair (out of 3)
                        2+1      1
               Rp =     6+3    = 3.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                11
  Cluster Metric



      Pairwise recall

                   rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant
                         documents in Ci wrt. qk )
                   gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant
                         documents for qk )
       (micro recall)

                                                          qk ∈Q         Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q    gk (gk − 1)
                                                                gk >1



       Example: (aab|bb|aa)
               2 a pairs (out of 6)
               1 b pair (out of 3)
                        2+1      1
               Rp =     6+3    = 3.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering


       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
                               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
                               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
       Example:                Rp = 23
                               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
                               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering


       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
                               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
                               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
       Example:                Rp = 23
                               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
                               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   12
  Cluster Metric



      Perfect clustering

       C is a perfect clustering iff there exists no clustering C s.th.
       Pp (D, Q, R, C) < Pp (D, Q, R, C )∧
       Rp (D, Q, R, C) < Rp (D, Q, R, C )

       strong Pareto optimum – more than one perfect clustering
       possible
       Example:
               Pp ({d1 , d2 , d3 }, {d4 , d5 }) =
               Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1,
               Rp = 23
               Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6,
               Rp = 1
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   13
  Cluster Metric



      Do perfect clusterings form a hierarchy?
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1


                                                                                       C




                                                                                   1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                 13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1                     C’


                                                                                        C




                                                                                    1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}

       C = {{d1 , d2 }, {d3 , d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                        13
  Cluster Metric



      Do perfect clusterings form a hierarchy?

                                                        Pp

                                                         1                     C’

                                                                                    C’’
                                                                                              C




                                                                                          1       Rp


                                          C = {{d1 , d2 , d3 , d4 }}

       C = {{d1 , d2 }, {d3 , d4 }}                                      C = {{d1 , d2 , d3 }, {d4 }}
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   14
  Optimum clustering




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   15
  Optimum clustering



      Optimum Clustering



              Usually, clustering process has no knowledge about
              relevance judgments
              switch from external to internal cluster measures
              replace relevance judgments by estimates of probability of
              relevance
              requires probabilistic retrieval method yielding P(rel|q, d)
                   compute expected cluster quality
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                              16
  Optimum clustering



      Expected cluster quality


      Pairwise precision:

                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1


      Expected precision:

                        1                    ci
 π(D, Q, C) =
  !                                                                                P(rel|qk , dl )P(rel|qk , dm )
                       |D|              ci (ci − 1)
                               Ci ∈C                   qk ∈Q   (dl ,dm )∈Ci ×Ci
                              |Ci |>1                                dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                              16
  Optimum clustering



      Expected cluster quality


      Pairwise precision:

                                                      1                          rik (rik − 1)
                        Pp (D, Q, R, C) =                           ci
                                                     |D|                          ci (ci − 1)
                                                            Ci ∈C        qk ∈Q
                                                            ci >1


      Expected precision:

                        1                    ci
 π(D, Q, C) =
  !                                                                                P(rel|qk , dl )P(rel|qk , dm )
                       |D|              ci (ci − 1)
                               Ci ∈C                   qk ∈Q   (dl ,dm )∈Ci ×Ci
                              |Ci |>1                                dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                             17
  Optimum clustering



      Expected precision


                        1                   ci
π(D, Q, C) =                                                                     P(rel|qk , dl )P(rel|qk , dm )
                       |D|             ci (ci − 1)
                              Ci ∈C                   (dl ,dm )∈Ci ×Ci   qk ∈Q
                             |Ci |>1                        dl =dm


      here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
      number of queries for which both dl and dm are relevant
      Transform a document into a vector of relevance probabilities:
      τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).


                                          1                 1
             π(D, Q, C) =                                                             τ T (dl ) · τ (dm )
                                         |D|    Ci ∈C
                                                         ci − 1    (dl ,dm )∈Ci ×Ci
                                               |Ci |>1                   dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                             17
  Optimum clustering



      Expected precision


                        1                   ci
π(D, Q, C) =                                                                     P(rel|qk , dl )P(rel|qk , dm )
                       |D|             ci (ci − 1)
                              Ci ∈C                   (dl ,dm )∈Ci ×Ci   qk ∈Q
                             |Ci |>1                        dl =dm


      here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected
      number of queries for which both dl and dm are relevant
      Transform a document into a vector of relevance probabilities:
      τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )).


                                          1                 1
             π(D, Q, C) =                                                             τ T (dl ) · τ (dm )
                                         |D|    Ci ∈C
                                                         ci − 1    (dl ,dm )∈Ci ×Ci
                                               |Ci |>1                   dl =dm
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  18
  Optimum clustering



      Expected recall

                                                          qk ∈Q           Ci ∈C rik (rik   − 1)
                         Rp (D, Q, R, C) =
                                                                qk ∈Q      gk (gk − 1)
                                                                gk >1

      Direct estimation requires estimation of denominator → biased
      estimates
      But: denominator is constant for a given query set → ignore
      compute an estimate for the numerator only:

                         ρ(D, Q, C) =                                      τ T (dl ) · τ (dm )
                                               Ci ∈C   (dl ,dm )∈Ci ×Ci
                                                             dl =dm



      (Scalar product τ T (dl ) · τ (dm ) gives the expected number of
      queries for which both dl and dm are relevant)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   19
  Optimum clustering



      Optimum clustering



      C is an optimum clustering iff there exists no clustering C s.th.
      π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C )
              Pareto optima
              Set of perfect (and optimum) clusterings not even forms a
              cluster hierarchy
                   no hierarchic clustering method will find all optima!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   20
  Towards Optimum Clustering




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   21
  Towards Optimum Clustering



      Towards Optimum Clustering
      Development of an (optimum) clustering method




          1   Set of queries,
          2   Probabilistic retrieval method,
          3   Document similarity metric, and
          4   Fusion principle.
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  22
  Towards Optimum Clustering



      A simple application


          1    Set of queries: all possible one-term queries
          2    Probabilistic retrieval method: tf ∗ idf
          3    Document similarity metric: τ T (dl ) · τ (dm )
          4    Fusion principle: group average clustering
                                                1
                     π(D, Q, C) =                                           τ T (dl ) · τ (dm )
                                             c(c − 1)       (dl ,dm )∈C×C
                                                                 dl =dm



              standard clustering method
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   23
  Towards Optimum Clustering



      Query set


      Too few queries in real collections → artificial query set
              collection clustering: set of all possible one-term queries
                      Probability distribution over the query set: uniform /
                      proportional to doc. freq.
                      Document representation: original terms / transformations
                      of the term space
                      Semantic dimensions: focus on certain aspects only (e.g.
                      images: color, contour, texture)
              result clustering: set of all query expansions
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   24
  Towards Optimum Clustering



      Probabilistic retrieval method




              Model: In principle, any retrieval model suitable
              Transformation to probabilities: direct estimation /
              transforming the retrieval score into such a probability
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   25
  Towards Optimum Clustering



      Document similarity metric.




      fixed as τ T (dl ) · τ (dm )
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   26
  Towards Optimum Clustering



      Fusion principles




      OCF only gives guidelines for good fusion principles:
      consider metrics π and/or ρ during fusion
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis             27
  Towards Optimum Clustering



      Group average clustering:



                                           1
                          σ(C) =                                       τ T (dl ) · τ (dm )
                                        c(c − 1)       (dl ,dm )∈C×C
                                                            dl =dm



              expected precision as criterion!
              starts with singleton clusters                        minimum recall
              building larger clusters for increasing recall
              forms cluster with highest precision
              (which may be lower than that of the current clusters)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                  28
  Towards Optimum Clustering



      Fusion principles – min cut




              starts with single cluster (maximum recall)
              searches for cut with minimum loss in recall

                               ρ(D, Q, C) =                                 τ T (dl ) · τ (dm )
                                                    Ci ∈C   (dl ,dm )∈C×C
                                                                 dl =dm


              consider expected precision for breaking ties!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   29
  Towards Optimum Clustering



      Finding optimum clusterings

      Min cut
      (assume cohesive similarity graph)
              starts with optimum clustering for maximum recall
              min cut finds split with minimum loss in recall
              consider precision for tie breaking
                   optimum clustering for two clusters
              O(n3 ) (vs. O(2n ) for the general case)
              subsequent splits will not necessarily reach optima

      Group average
              in general, multiple fusion steps for reaching first optimum
              greedy strategy does not necessarily find this optimum!
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   30
  Experiments




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   31
  Experiments



      Experiments with a Query Set


      ADI collection:
                35 queries
                70 documents (relevant to 2.4 queries on avg.)

      Experiments:
                Q35opt using the actual relevance in τ (d)
                   Q35 BM25 estimates for the 35 queries
                  1Tuni 1-term queries, uniform distribution
                   1Tdf 1-term queries, according to document frequency
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis                32
  Experiments




                         2.5
                                                                                   Q35opt
                                                                                     Q35
                          2                                                         1Tuni
                                                                                     1Tdf
             Precision




                         1.5


                          1


                         0.5


                          0
                               0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9                      1
                                                               Recall
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   33
  Experiments



      Using Keyphrases as Query Set
      Compare clustering results based on different query sets

          1     ‘bag-of-words’: single words as queries
          2     keyphrases automatically extracted as head-noun phrases,
                single query = all keyphrases of a document

      Test collections:
                4 test collections assembled from the RCV1 (Reuters)
                news corpus
                # documents: 600 vs. 6000
                # categories: 6 vs. 12,
                Frequency distribution of classes: ([U]niform vs.
                [R]andom).
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   33
  Experiments



      Using Keyphrases as Query Set
      Compare clustering results based on different query sets

          1     ‘bag-of-words’: single words as queries
          2     keyphrases automatically extracted as head-noun phrases,
                single query = all keyphrases of a document

      Test collections:
                4 test collections assembled from the RCV1 (Reuters)
                news corpus
                # documents: 600 vs. 6000
                # categories: 6 vs. 12,
                Frequency distribution of classes: ([U]niform vs.
                [R]andom).
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis            34
  Experiments



      Using Keyphrases as Query Set - Results




           Average Precision                                         (External) F-measure
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   35
  Experiments



      Evaluation of the Expected F-Measure



      Correlation between expected F-Measure (internal measure)
      and
      standard F-measure (comparison with reference classification)
                test collections as before
                regard quality of 40 different clustering methods for each
                setting
                (find optimum clustering among these 40 methods)
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   36
  Experiments



      Correlation results
      Pearson correlation between internal measures and the
      external F-Measure
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   37
  Conclusion and Outlook




       1    Introduction

       2    Cluster Metric

       3    Optimum clustering

       4    Towards Optimum Clustering

       5    Experiments

       6    Conclusion and Outlook
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   38
  Conclusion and Outlook



      Summary




      Optimum Clustering Framework
              makes Cluster Hypothesis a requirement
              forms theoretical basis for development of better clustering
              methods
              yields positive experimental evidence
A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis   39
  Conclusion and Outlook



      Further Research



      theoretical
          compatibility of existing clustering methods with OCF
              extension of OCF to soft clustering
              extension of OCF to hierarchical clustering

      experimental
              variation of query sets
              user experiments

Weitere ähnliche Inhalte

Was ist angesagt?

A Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman ProblemA Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman Problemvsubhashini
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningMark Chang
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningMark Chang
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeFrederic Desprez
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 

Was ist angesagt? (14)

A Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman ProblemA Signature Scheme as Secure as the Diffie Hellman Problem
A Signature Scheme as Secure as the Diffie Hellman Problem
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep Learning
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
 
Scalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven ApplicationsScalable Link Discovery for Modern Data-Driven Applications
Scalable Link Discovery for Modern Data-Driven Applications
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Lec 5-nn-slides
Lec 5-nn-slidesLec 5-nn-slides
Lec 5-nn-slides
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Sara el hassad
Sara el hassadSara el hassad
Sara el hassad
 
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to PracticeWorkflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
Workflow Allocations and Scheduling on IaaS Platforms, from Theory to Practice
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
cdrw
cdrwcdrw
cdrw
 

Andere mochten auch

Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin Rshanelynn
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 

Andere mochten auch (11)

Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Text clustering
Text clusteringText clustering
Text clustering
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Court Case Management System
Court Case Management SystemCourt Case Management System
Court Case Management System
 
Text categorization
Text categorizationText categorization
Text categorization
 
E courts project
E courts projectE courts project
E courts project
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 

Ähnlich wie The Optimum Clustering Framework: Implementing the Cluster Hypothesis

Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptRajeshT305412
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...marxliouville
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...Nesma
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.pptIndra Hermawan
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrievalNisha Arankandath
 

Ähnlich wie The Optimum Clustering Framework: Implementing the Cluster Hypothesis (20)

Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
A+Novel+Approach+Based+On+Prototypes+And+Rough+Sets+For+Document+And+Feature+...
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
clustering.pptx
clustering.pptxclustering.pptx
clustering.pptx
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
Inex07
Inex07Inex07
Inex07
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Probablistic information retrieval
Probablistic information retrievalProbablistic information retrieval
Probablistic information retrieval
 
Cluster
ClusterCluster
Cluster
 

Mehr von yaevents

Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...yaevents
 
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, ЯндексТема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, Яндексyaevents
 
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...yaevents
 
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексi-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексyaevents
 
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...yaevents
 
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...yaevents
 
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...yaevents
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндексyaevents
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндексyaevents
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmannyaevents
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...yaevents
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...yaevents
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндексyaevents
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Googleyaevents
 
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...yaevents
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...yaevents
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигмаyaevents
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...yaevents
 

Mehr von yaevents (20)

Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
Как научить роботов тестировать веб-интерфейсы. Артем Ерошенко, Илья Кацев, Я...
 
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, ЯндексТема для WordPress в БЭМ. Владимир Гриненко, Яндекс
Тема для WordPress в БЭМ. Владимир Гриненко, Яндекс
 
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
Построение сложносоставных блоков в шаблонизаторе bemhtml. Сергей Бережной, Я...
 
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндексi-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
i-bem.js: JavaScript в БЭМ-терминах. Елена Глухова, Варвара Степанова, Яндекс
 
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
Дом из готовых кирпичей. Библиотека блоков, тюнинг, инструменты. Елена Глухов...
 
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
Модели в профессиональной инженерии и тестировании программ. Александр Петрен...
 
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндекс
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндекс
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmann
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Google
 
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигма
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...
 

Kürzlich hochgeladen

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

The Optimum Clustering Framework: Implementing the Cluster Hypothesis

  • 1. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis Norbert Fuhr University of Duisburg-Essen March 30, 2011
  • 2. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 2 Outline 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 3. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 3 Introduction 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 4. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 5. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 6. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 4 Introduction Motivation Ad-hoc Retrieval heuristic models: define retrieval function evaluate to test if it yields good quality Probability Ranking Principle (PRP) theoretic foundation for optimum retrieval numerous probabilistic models based on PRP Document clustering classic approach: define similarity function and fusion principle evaluate to test if they yield good quality Optimum Clustering Principle?
  • 7. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 8. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 9. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 5 Introduction Cluster Hypothesis Original Formulation ”closely associated documents tend to be relevant to the same requests” (Rijsbergen 1979) Idea of optimum clustering: Cluster documents in such a way, that for any request, the relevant documents occur together in one cluster redefine document similarity: documents are similar if they are relevant to the same queries
  • 10. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 11. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 12. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 13. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 6 Introduction The Optimum Clustering Framework
  • 14. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 7 Cluster Metric 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 15. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 8 Cluster Metric Defining a Metric based on the Cluster Hypothesis General idea: Evaluate clustering wrt. a set of queries For each query and each cluster, regard pairs of documents co-occurring: relevant-relevant: good relevant-irrelevant: bad irrelevant-irrelevant: don’t care
  • 16. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 17. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 18. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 19. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 20. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 9 Cluster Metric Pairwise precision Q Set of queries D Document collection R relevance judgments: R ⊂ Q × D C Clustering, C = {C1 , . . . , Cn } s.th. ∪n Ci = D and i=1 ∀i, j : i = j → Ci ∩ Cj = ∅ ci = |Ci | (size of cluster Ci ), rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) Pairwise precision (weighted average over all clusters) 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1
  • 21. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classification with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classification would yield Pp = 1 for arbitrary query sets, values > 1 are possible
  • 22. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 10 Cluster Metric Pairwise precision – Example 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Query set: disjoint classification with two classes a and b, three clusters: (aab|bb|aa) Pp = 1 (3( 1 + 0) + 2(0 + 1) + 2(1 + 0)) = 5 . 7 3 7 Perfect clustering for a disjoint classification would yield Pp = 1 for arbitrary query sets, values > 1 are possible
  • 23. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.
  • 24. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 11 Cluster Metric Pairwise recall rik = |{dm ∈ Ci |(qk , dm ) ∈ R}| (number of relevant documents in Ci wrt. qk ) gk = |{d ∈ D|(qk , d) ∈ R}| (number of relevant documents for qk ) (micro recall) qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Example: (aab|bb|aa) 2 a pairs (out of 6) 1 b pair (out of 3) 2+1 1 Rp = 6+3 = 3.
  • 25. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 26. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Example: Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 27. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 12 Cluster Metric Perfect clustering C is a perfect clustering iff there exists no clustering C s.th. Pp (D, Q, R, C) < Pp (D, Q, R, C )∧ Rp (D, Q, R, C) < Rp (D, Q, R, C ) strong Pareto optimum – more than one perfect clustering possible Example: Pp ({d1 , d2 , d3 }, {d4 , d5 }) = Pp ({d1 , d2 }, {d3 , d4 , d5 }) = 1, Rp = 23 Pp ({d1 , d2 , d3 , d4 , d5 }) = 0.6, Rp = 1
  • 28. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy?
  • 29. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C 1 Rp C = {{d1 , d2 , d3 , d4 }}
  • 30. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }}
  • 31. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 13 Cluster Metric Do perfect clusterings form a hierarchy? Pp 1 C’ C’’ C 1 Rp C = {{d1 , d2 , d3 , d4 }} C = {{d1 , d2 }, {d3 , d4 }} C = {{d1 , d2 , d3 }, {d4 }}
  • 32. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 14 Optimum clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 33. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 34. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 35. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 36. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 37. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 15 Optimum clustering Optimum Clustering Usually, clustering process has no knowledge about relevance judgments switch from external to internal cluster measures replace relevance judgments by estimates of probability of relevance requires probabilistic retrieval method yielding P(rel|q, d) compute expected cluster quality
  • 38. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16 Optimum clustering Expected cluster quality Pairwise precision: 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Expected precision: 1 ci π(D, Q, C) = ! P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 39. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 16 Optimum clustering Expected cluster quality Pairwise precision: 1 rik (rik − 1) Pp (D, Q, R, C) = ci |D| ci (ci − 1) Ci ∈C qk ∈Q ci >1 Expected precision: 1 ci π(D, Q, C) = ! P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C qk ∈Q (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 40. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17 Optimum clustering Expected precision 1 ci π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q |Ci |>1 dl =dm here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected number of queries for which both dl and dm are relevant Transform a document into a vector of relevance probabilities: τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )). 1 1 π(D, Q, C) = τ T (dl ) · τ (dm ) |D| Ci ∈C ci − 1 (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 41. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 17 Optimum clustering Expected precision 1 ci π(D, Q, C) = P(rel|qk , dl )P(rel|qk , dm ) |D| ci (ci − 1) Ci ∈C (dl ,dm )∈Ci ×Ci qk ∈Q |Ci |>1 dl =dm here qk ∈Q P(rel|qk , dl )P(rel|qk , dm ) gives the expected number of queries for which both dl and dm are relevant Transform a document into a vector of relevance probabilities: τ T (dm ) = (P(rel|q1 , dm ), P(rel|q2 , dm ), . . . , P(rel|q|Q| , dm )). 1 1 π(D, Q, C) = τ T (dl ) · τ (dm ) |D| Ci ∈C ci − 1 (dl ,dm )∈Ci ×Ci |Ci |>1 dl =dm
  • 42. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 43. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 44. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 45. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 18 Optimum clustering Expected recall qk ∈Q Ci ∈C rik (rik − 1) Rp (D, Q, R, C) = qk ∈Q gk (gk − 1) gk >1 Direct estimation requires estimation of denominator → biased estimates But: denominator is constant for a given query set → ignore compute an estimate for the numerator only: ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈Ci ×Ci dl =dm (Scalar product τ T (dl ) · τ (dm ) gives the expected number of queries for which both dl and dm are relevant)
  • 46. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 47. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 48. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 49. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 19 Optimum clustering Optimum clustering C is an optimum clustering iff there exists no clustering C s.th. π(D, Q, C) < π(D, Q, C ) ∧ ρ(D, Q, C) < ρ(D, Q, C ) Pareto optima Set of perfect (and optimum) clusterings not even forms a cluster hierarchy no hierarchic clustering method will find all optima!
  • 50. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 20 Towards Optimum Clustering 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 51. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 21 Towards Optimum Clustering Towards Optimum Clustering Development of an (optimum) clustering method 1 Set of queries, 2 Probabilistic retrieval method, 3 Document similarity metric, and 4 Fusion principle.
  • 52. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 53. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 54. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 55. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 56. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 57. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 58. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 22 Towards Optimum Clustering A simple application 1 Set of queries: all possible one-term queries 2 Probabilistic retrieval method: tf ∗ idf 3 Document similarity metric: τ T (dl ) · τ (dm ) 4 Fusion principle: group average clustering 1 π(D, Q, C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm standard clustering method
  • 59. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 23 Towards Optimum Clustering Query set Too few queries in real collections → artificial query set collection clustering: set of all possible one-term queries Probability distribution over the query set: uniform / proportional to doc. freq. Document representation: original terms / transformations of the term space Semantic dimensions: focus on certain aspects only (e.g. images: color, contour, texture) result clustering: set of all query expansions
  • 60. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 24 Towards Optimum Clustering Probabilistic retrieval method Model: In principle, any retrieval model suitable Transformation to probabilities: direct estimation / transforming the retrieval score into such a probability
  • 61. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 25 Towards Optimum Clustering Document similarity metric. fixed as τ T (dl ) · τ (dm )
  • 62. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 26 Towards Optimum Clustering Fusion principles OCF only gives guidelines for good fusion principles: consider metrics π and/or ρ during fusion
  • 63. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 64. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 65. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 66. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 27 Towards Optimum Clustering Group average clustering: 1 σ(C) = τ T (dl ) · τ (dm ) c(c − 1) (dl ,dm )∈C×C dl =dm expected precision as criterion! starts with singleton clusters minimum recall building larger clusters for increasing recall forms cluster with highest precision (which may be lower than that of the current clusters)
  • 67. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 68. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 69. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 28 Towards Optimum Clustering Fusion principles – min cut starts with single cluster (maximum recall) searches for cut with minimum loss in recall ρ(D, Q, C) = τ T (dl ) · τ (dm ) Ci ∈C (dl ,dm )∈C×C dl =dm consider expected precision for breaking ties!
  • 70. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 71. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 72. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 73. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 74. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 75. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 76. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 77. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 78. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 29 Towards Optimum Clustering Finding optimum clusterings Min cut (assume cohesive similarity graph) starts with optimum clustering for maximum recall min cut finds split with minimum loss in recall consider precision for tie breaking optimum clustering for two clusters O(n3 ) (vs. O(2n ) for the general case) subsequent splits will not necessarily reach optima Group average in general, multiple fusion steps for reaching first optimum greedy strategy does not necessarily find this optimum!
  • 79. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 30 Experiments 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 80. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 31 Experiments Experiments with a Query Set ADI collection: 35 queries 70 documents (relevant to 2.4 queries on avg.) Experiments: Q35opt using the actual relevance in τ (d) Q35 BM25 estimates for the 35 queries 1Tuni 1-term queries, uniform distribution 1Tdf 1-term queries, according to document frequency
  • 81. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 32 Experiments 2.5 Q35opt Q35 2 1Tuni 1Tdf Precision 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall
  • 82. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).
  • 83. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 33 Experiments Using Keyphrases as Query Set Compare clustering results based on different query sets 1 ‘bag-of-words’: single words as queries 2 keyphrases automatically extracted as head-noun phrases, single query = all keyphrases of a document Test collections: 4 test collections assembled from the RCV1 (Reuters) news corpus # documents: 600 vs. 6000 # categories: 6 vs. 12, Frequency distribution of classes: ([U]niform vs. [R]andom).
  • 84. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 34 Experiments Using Keyphrases as Query Set - Results Average Precision (External) F-measure
  • 85. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 35 Experiments Evaluation of the Expected F-Measure Correlation between expected F-Measure (internal measure) and standard F-measure (comparison with reference classification) test collections as before regard quality of 40 different clustering methods for each setting (find optimum clustering among these 40 methods)
  • 86. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 36 Experiments Correlation results Pearson correlation between internal measures and the external F-Measure
  • 87. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 37 Conclusion and Outlook 1 Introduction 2 Cluster Metric 3 Optimum clustering 4 Towards Optimum Clustering 5 Experiments 6 Conclusion and Outlook
  • 88. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 38 Conclusion and Outlook Summary Optimum Clustering Framework makes Cluster Hypothesis a requirement forms theoretical basis for development of better clustering methods yields positive experimental evidence
  • 89. A Framework for Optimum Document Clustering: Implementing the Cluster Hypothesis 39 Conclusion and Outlook Further Research theoretical compatibility of existing clustering methods with OCF extension of OCF to soft clustering extension of OCF to hierarchical clustering experimental variation of query sets user experiments