SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Making Interval-Based Clustering Rank-Aware




             Julia Stoyanovich (University of Pennsylvania)

                         joint work with
    Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv
                           University)


Яндекс         23.08.2011
Research Directions
• Representation of Large Complex Datasets
    – Symmetric relationships [VLDB 2004]
    – Faceted databases [VLDB 2005, Internet Archaeology 2007]
    – Schema polynomials [EDBT 2008]
    – Probabilistic databases [ICDE 2011]
    – Scientific workflows with provenance [CIDR 2011, ICDT 2011]

• Information Discovery in Large Complex Datasets
    – Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008]
    – Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011]
    – Rank-aware clustering [CIKM 2009, EDBT 2011]

    – Exploring repositories of scientific workflows [WANDS 2010, AMW 2011]
    – Exploring repositories of functional genomics experiments [submitted]
    – Estimating susceptibility to genetic disorders [Bioinformatics 2007]

 Яндекс             23.08.2011                                                       2
Applications and Prototypes

    • The Faceted Query Engine applied to archaeology

    • Biological data management
          –   MutaGeneSys – estimating individual genetic disease susceptibility
          –   AnnotCompute – exploring repositories of microarray experiments
          –   SkylineSearch – semantic ranking and result visualization for PubMed
          –   myExperiment topics – exploring repositories of scientific workflows


    • “Shopping and dating”
          – Yahoo! Garçon – a collaborative tagging recommender system
          – Yahoo! FindLove – rank-aware clustering for dating data




 Яндекс             23.08.2011                                                       3
Ranked Exploration of Structured Datasets
                                                      MBA, 40 years old
    Dating service user Mike                          makes $150K

    •   Find matches                                  MBA, 40 years old
          – age: [18,40]                              makes $150K
          – education: at least some college
          – income: > $50,000 / year                  MBA, 40 years old
                                                      makes $150K
    •   Rank by income from higher to lower
                                                      MBA, 40 years old
                                                      makes $150K
   •    Problems
          – too many results                          … 999 matches

          – results are homogeneous at top ranks,     PhD, 36 years old
                 due to correlations among            makes $100K
          attributes!                                 … 9999 matches

          – correlations may be complex,              BS, 27 years old
               depend on the selection criteria and   makes $80K
               on the ranking function
 Яндекс              23.08.2011                                           4
An Example from Yahoo! Personals

                                                                          -- income > $50K

                                                                          -- edu > BS




Observe that
    1.    % of women with income > $50K increases with age
    2.    % women with post-graduate education increases until age 29, then plateaus
There is a clear positive correlation between
    1.    age and income, for all ages
    2.    education and income, at least until age 29         Correlations are local
 Яндекс              23.08.2011                                                          5
Goal: Find Clusters that Correlate with Ranking




                                                           age: 26-37
     age: 18-25
                                                            edu: PhD
    edu: BS, MS
                                 age: 33-40            income: 100-130K
  income: 50-75K
                             income: 125-150K




              edu: MS                              age: 26-30
          income: 50-75K                        income: 75-110K

 Яндекс         23.08.2011                                                6
Roadmap

   • Introduction

   ➞ Rank-aware clustering
         – The formalism
         – The BARAC algorithm


   • Experimental evaluation
         – Effectiveness
         – Efficiency


   • Conclusion

Яндекс           23.08.2011      7
What Is Subspace Clustering?




            Parsons et al., SIGKDD Explorations 6(1), 2006


                                                             8
 Яндекс    23.08.2011
Why Do We Need Subspace Clustering?




            Parsons et al., SIGKDD Explorations 6(1), 2006

                                                             9
 Яндекс   23.08.2011
How Do We Find Subspace Clusters?

    • Finds clusters in multiple, possibly overlapping, subspaces
          – Dimensionality reduction per cluster
          – Lower-dimensional clusters are easier to identify and their
            descriptions are more palatable to the users
          – Example: “age 20-25” and “edu = BS” and “income 25K-50K”


    • Two main approaches
          – Top-down: start with full dimensionality and refine
          – Bottom-up: start with dense units in 1D,
             combine to find higher-dimensional clusters

    • Issues
          – What is a cluster? – need a measure of quality
          – How do we find clusters? – need a search strategy


 Яндекс            23.08.2011                                             10
Problem Statement

      • User specifies a conjunction of filtering conditions, e.g.,

            Q : age  20,40  edu  Bachelors

      • User specifies a ranking function, e.g., linear combination

           R :[income,],[age,]
         We do not restrict the set of ranking functions, but assume that ranking is
         derived from, or correlates with, attribute values

     Given a query Q and a ranking function R, find rank-aware clusters
   in subspaces of the dataset. Clusters are subspaces that:
                 •     have sufficient rank-aware quality
                 •     are tight
                 •     are maximal

 Яндекс              23.08.2011                                                        11
BARAC: Bottom-up Algorithm for Rank-Aware Clustering

  • BuildGrid
     – split each dimension into intervals
     – compute top-N for each interval

  • Merge
     – merge neighboring intervals using rank-aware locality (interval dominance)

          ensures tightness

  • Join
     – build K-dimensional clusters from compatible (K-1)-dimensional clusters
       using rank-aware clustering quality

          ensures maximality and rank-aware quality



 Яндекс           23.08.2011                                                     12
Avoiding Match Homogeneity at Top Ranks
 Cluster descriptions must accurately describe the top-N items
                                                   MBA, 40 years old
                                                   makes $150K

                                                   MBA, 40 years old
                                                   makes $150K

                                                   MBA, 40 years old
               age: 25-40
                                                   makes $150K
            income: 75-150K
                                                   MBA, 40 years old
                                                   makes $150K
                                                   … 999 matches

                                                   PhD, 36 years old
                                                   makes $100K
                 age: 40                           … 9999 matches
              income:150K
                                                   BS, 27 years old
                                                   makes $80K
            Tightness will give us this property

 Яндекс       23.08.2011                                               13
Ranked Intervals and Interval Dominance
           • Ranked intervals: description, contents (items), top-N
                   – I1: age  [25,30], I2: edu = MBA
           •      Interval dominance is a rank-aware measure of locality, defined
                   – over 2 consecutive intervals on the same attribute
                   – for a ranking function R, integer N, and dominance threshold θdom  (0.5, 1]


                I1 dominates I2 if

                                                                      I1 + I2 :   age    [20,29]
   I1 :        age  [20,24]   I2 :   age  [25,29]
                                                      R1 : age (asc) R2 : 0.3inc + 0.7edu (desc) R3 : rel serv (asc)
  top-10




                                                      I2 <10,1 I1        I1 <10,0.8 I2              I1 <>10,0.5 I2
 Яндекс                          23.08.2011                                                                          14
Property 1: Tightness
  38 years old                36 years old                                             R :[income,]


             age: 35-39
              edu: PhD
                                                                        age: 30-39
                                                                           edu: PhD


                 I1 :   age  [30,34]   I2 :   age  [35,39]        I1 + I 2 :   age  [30,39]




     if I1 dominates I2, then add I1 and I2 to the search space
                        else add I1, I2, and I1+ I2 to the search space

 Яндекс             23.08.2011                                                                         15
Choose Best from Among Comparable
                                                         R :[income,]



                                    >
                                    ?     
                    age: 33-40                age: 33-40
                income: 126-150K           income: 70-100K




                                    ≠
                                    ?
                     age: 33-40               age: 26-30
                 income: 125-150K          income: 75-110K


          Rank-aware clustering quality will give us this property

 Яндекс        23.08.2011                                                16
Ranked Subspaces and Clusters

    A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct
       attributes, e.g., S: { age  [25,30] , edu = MBA }
        • interpreted as a conjunction of predicates over dataset D
        • dimensionality = number of intervals


    Goal: find subspaces that have sufficient rank-aware clustering quality


    All rank-aware clustering quality measures
          – compare the top-N list of a ranked subspace to the top-N lists of its
            constituent ranked intervals
          – are defined for a ranking function R, an integer N, and a quality
            threshold θ Q  (0.5, 1]




 Яндекс           23.08.2011                                                        17
Property 2: Rank-Aware Clustering Quality
 R : income 
              2
 N  3 Q 
              3     age: 25-29             edu: BS             age: 30-34
                   m1       99K         m1       99K           m6       125K
                   m3       90K         m2       95K           m8       110K
                   m7       75K         m3       90K           m10     100K

                 m9      65K          m4      85K            m2       95K
                                                               m4       85K
                                                               m5       85K
                                age: 25-29       age: 30-34
                                  edu: BS          edu: BS
                               m1       99K     m2       95K
                               m3       90K     m4       85K




     Яндекс       23.08.2011                                                   18
Rank-Aware Clustering Quality Measures
    • QtopN : subspace contains > θ Q items from the top-N of its intervals
          – Considers top-N lists as sets


    • QSCORE : subspace contains > θ Q high-scoring items from the top-N of
        its intervals
          – Based on the sums of scores of top-N items


    •   QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items
        from the top-N of its intervals
          – Based on NDCG, incorporates both scores and ranks



    •   Clustering quality measures must exhibit downward closure
          – Quality of a subspace is no higher than the quality of its included subspaces
          – Holds trivially for density-based measures, due to set properties
          – Also holds for our measures, details omitted here



 Яндекс              23.08.2011                                                             19
Property 3: Maximality
   Avoid producing redundant clusters




           age: 25-40
            edu: PhD

                                                        age: 25-40
                                  edu: PhD               edu: PhD
                              income: 100-130K      income: 100-130K

              age: 25-40
          income: 100-130K



              Maximality will give us this property
                    comes for free with bottom-up subspace clustering

 Яндекс          23.08.2011                                             20
BARAC Recap

 • BuildGrid
     – split each dimension into intervals
     – compute top-N for each interval

 • Merge
     – merge neighboring intervals using rank-aware locality (interval dominance)

          ensures tightness

 • Join
     – build K-dimensional clusters from compatible (K-1)-dimensional clusters
       using rank-aware clustering quality

          ensures maximality and rank-aware quality



 Яндекс           23.08.2011                                                     21
Complexity of BARAC

    • Polynomial in input size, exponential in the number of attributes

    • Exponential dependency is unavoidable!
          – Even counting distinct maximal frequent itemsets is #P-complete


    • Example
          –   1 item for each combination of attribute values
          –   each item has an arbitrary distinct score
          –   find rank-aware clusters with QtopN, N = 1
          –   there is 1 cluster per item, so an exponential number of clusters!

    • But lower in practice
          – correlations are local
          – clustering quality requires 50% overlap at top-N


 Яндекс              23.08.2011                                                    22
Roadmap

   • Introduction

   • Rank-aware clustering
         – The formalism
         – The BARAC algorithm


   ➞ Experimental evaluation
         – Effectiveness
         – Efficiency


   • Conclusion

Яндекс           23.08.2011      23
Experimental Dataset: Yahoo! Personals
    • Data and users
          –   5 weeks, 454 users, 861 searches
          –   19 filtering attributes, 17 clustering attributes, 6 ranking attributes
          –   Filtering on attributes, user-specified
          –   Filtering on geo location (only for effectiveness evaluation)
          – QtopN clustering quality metric


    • Ranking function: weighted sum
          – sum of normalized per-attribute distances from best attribute value
            from among matches
          – attributes: age, height, body type, education, income, religious
            services
          – personalized by user: choice of attributes, sort order, normalization



 Яндекс              23.08.2011                                                         24
Evaluation of Effectiveness: User Study


                                        presentation

                                    list         groups
          content




                     top-100       top list     top groups

                     BARAC       BARAC list    BARAC groups




 Яндекс             23.08.2011                                25
Яндекс   23.08.2011   26
Яндекс   23.08.2011   27
Effectiveness Metrics and Results
    • Users may fave matches and / or groups
          – When a group is faved, all matches in that group are faved


    • A productive search has at least 1 faved match/group


                             % prod.       num. faves per   num. faves per prod.
   treatment
                            searches          search               search
   top list                     17              0.84                5.05

   top group                    14              0.87         7.33 / 1.17 groups

   BARAC list                   15              0.74                4.93

   BARAC group                  20              1.55        12.38 / 1.91 groups



 Яндекс            23.08.2011                                                      28
Evaluation of Efficiency
    • Summary of results: BARAC is scalable
          – runtimes of BuildGrid and Join dominate performance
          – runtime of Merge is negligible


    • All reported results are over the complete set of female profiles
      in Yahoo! Personals, without any location-based filtering!




 Яндекс            23.08.2011                                             29
Evaluation of Efficiency
    • Summary of results: BARAC is scalable
          – runtimes of BuildGrid and Join dominate performance
          – runtime of Merge is negligible

                                                          runtime of BuildGrid

                                          8000
              runtime of BuildGrid (ms)




                                          7000

                                          6000

                                          5000

                                          4000

                                          3000

                                          2000

                                          1000

                                             0
                                                 0   100000    200000   300000   400000   500000
                                                                   # items

 Яндекс                                      23.08.2011                                            30
Evaluation of Efficiency
    • Summary of results: BARAC is scalable
          – runtimes of BuildGrid and Join dominate performance
          – runtime of Merge is negligible

                                                           runtime of Join
                                  3500

                                  3000
           runtime of Join (ms)




                                  2500

                                  2000

                                  1500

                                  1000

                                   500

                                     0
                                         2     3   4   5    6    7   8   9   10   11   12   13   14   15   16   17
                                                                # clustering dimensions


 Яндекс                                      23.08.2011                                                              31
Performance of Join

                                 600


                                 500
          runtime of Join (ms)




                                                                                  9D
                                 400
                                                                                  8D
                                                                                  7D
                                 300                                              6D
                                                                                  5D
                                                                                  4D
                                 200
                                                                                  3D


                                 100


                                  0
                                       0.5    0.6    0.7         0.8    0.9   1
                                                    quality threshold




   * results for 100 Yahoo! Personals users on the full Y!P dataset.

 Яндекс                                23.08.2011                                      32
Performance of Join

                                 1000

                                  900

                                  800
          runtime of Join (ms)




                                  700                                                 9D
                                                                                      8D
                                  600
                                                                                      7D
                                  500                                                 6D
                                                                                      5D
                                  400
                                                                                      4D
                                  300                                                 3D

                                  200

                                  100

                                    0
                                        0.5     0.6     0.7        0.8      0.9   1
                                                      dominance threshold




   * results for 100 Yahoo! Personals users on the full Y!P dataset.

 Яндекс                                 23.08.2011                                         33
Roadmap

   • Introduction

   • Rank-aware clustering
         – The formalism
         – The BARAC algorithm


   • Experimental evaluation
         – Effectiveness
         – Efficiency


   ➞ Conclusion

Яндекс           23.08.2011      34
Rank-Aware Clustering: Recap
  •   Formalized rank-aware clustering, a novel
      data exploration paradigm

                                                                                      age: 18-25
  •   Developed a rank-aware measure of locality and a                               edu: BS, MS                          age: 33-40
                                                                                     inc: 50-75K
      family of rank-aware clustering quality measures                                                                  inc: 126-150K




  •   Proposed BARAC: a bottom-up algorithm for rank-                                                     age: 26-30
      aware clustering                                                                                   inc: 75-110K


                                                                                     8000




                                                         runtime of BuildGrid (ms)
                                                                                     7000

                                                                                     6000

  •   Presented an experimental evaluation on Yahoo!                                 5000

                                                                                     4000
      Personals (also restaurants in Yahoo! Local)                                   3000

                                                                                     2000
       • Effectiveness                                                               1000

                                                                                        0
       • Efficiency                                                                         0   100000    200000   300000
                                                                                                              # items
                                                                                                                              400000    500000




 Яндекс            23.08.2011                                                                                                          35
Related Work

    • Subspace clustering
          – CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]
          – Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002]
    • Ranking of structured data
          – Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al,
            2003]
          – Rank-aware attribute selection [Das et al, 2006]
    • Integrating ranking with clustering
          – Mixture model, mutual reinforcement between ranking and clustering, for
            heterogeneous information networks, e.g., DBLP [Sun et al, 2009]
    • Diversification
          – Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru
            et al, 2004], …
          – Database queries [Chen and Li, 2007], [Vee et al, 2008]
          – Recommendation [Boim et al, 2011], [Yu et al, 2009]


 Яндекс             23.08.2011                                                         36
Future Work: Choosing a Clustering Quality Measure


         12
                                                attribute-rank
         10                                     geo-rank


         8
 score




         6

         4

         2

         0
              0     20         40          60     80             100
                                    rank


 Яндекс           23.08.2011                                           37
Thank you!




Яндекс   23.08.2011
Take 1: Density-Based Clustering



     age: 18-25         age: 26-30         age: 31-35          age: 36-40




                               min density = 2




   income: 50-75K        income: 76-100K    income: 101-125K      Income: 126-150K

 Яндекс           23.08.2011                                                   39
Take 1: Density-Based Clustering



                 age: 18-30                     age: 31-35         age: 36-40




                                   min density = 2

             age: 18-30                                             age: 36-40
          Income: 50-75K                                        income: 101-150K




   income: 50-75K             income: 76-100K                income: 101-150K

 Яндекс            23.08.2011                                                      40
Take 2: A Lower Threshold?



     age: 18-25         age: 26-30         age: 31-35          age: 36-40




                               min density = 1




     income: 50-75K      income: 76-100K    income: 101-125K      income 126-150K

 Яндекс           23.08.2011                                                   41
Take 2: A Lower Threshold?



                           age: 18-40




                        density > 0

                  age: 18-40; income: 50-150K




                        income: 50-150K

 Яндекс    23.08.2011                           42
Performance of BARAC

   100%
                                                               BuildGrid
    90%                                                        Join
    80%
                                                               Total

    70%

    60%

    50%
    40%

    30%

    20%

    10%

     0%
           <30sec    <20sec     <15sec    <10sec     <5 sec     <1 sec


  * results for 100 Yahoo! Personals users on the full Y!P dataset.
 Яндекс         23.08.2011                                                 43

Weitere ähnliche Inhalte

Mehr von yaevents

Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
yaevents
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндекс
yaevents
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндекс
yaevents
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmann
yaevents
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
yaevents
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
yaevents
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
yaevents
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
yaevents
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
yaevents
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Google
yaevents
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
yaevents
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигма
yaevents
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...
yaevents
 
Поисковая технология "Спектр". Андрей Плахов, Яндекс
Поисковая технология "Спектр". Андрей Плахов, ЯндексПоисковая технология "Спектр". Андрей Плахов, Яндекс
Поисковая технология "Спектр". Андрей Плахов, Яндекс
yaevents
 
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
yaevents
 
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
Evangelos Kanoulas — Advances in Information Retrieval EvaluationEvangelos Kanoulas — Advances in Information Retrieval Evaluation
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
yaevents
 
Raffaele Perego "Efficient Query Suggestions in the Long Tail"
Raffaele Perego "Efficient Query Suggestions in the Long Tail"Raffaele Perego "Efficient Query Suggestions in the Long Tail"
Raffaele Perego "Efficient Query Suggestions in the Long Tail"
yaevents
 
"Efficient Diversification of Web Search Results"
"Efficient Diversification of Web Search Results""Efficient Diversification of Web Search Results"
"Efficient Diversification of Web Search Results"
yaevents
 
Salvatore_Orlando
Salvatore_OrlandoSalvatore_Orlando
Salvatore_Orlando
yaevents
 

Mehr von yaevents (20)

Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
Администрирование небольших сервисов или один за всех и 100 на одного. Роман ...
 
Мониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, ЯндексМониторинг со всех сторон. Алексей Симаков, Яндекс
Мониторинг со всех сторон. Алексей Симаков, Яндекс
 
Истории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, ЯндексИстории про разработку сайтов. Сергей Бережной, Яндекс
Истории про разработку сайтов. Сергей Бережной, Яндекс
 
Разработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, ShturmannРазработка приложений для Android на С++. Юрий Береза, Shturmann
Разработка приложений для Android на С++. Юрий Береза, Shturmann
 
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
Кросс-платформенная разработка под мобильные устройства. Дмитрий Жестилевский...
 
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
Сложнейшие техники, применяемые буткитами и полиморфными вирусами. Вячеслав З...
 
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, ЯндексСканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
Сканирование уязвимостей со вкусом Яндекса. Тарас Иващенко, Яндекс
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Юнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, GoogleЮнит-тестирование и Google Mock. Влад Лосев, Google
Юнит-тестирование и Google Mock. Влад Лосев, Google
 
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
C++11 (formerly known as C++0x) is the new C++ language standard. Dave Abraha...
 
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
Зачем обычному программисту знать языки, на которых почти никто не пишет. Але...
 
В поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, НигмаВ поисках математики. Михаил Денисенко, Нигма
В поисках математики. Михаил Денисенко, Нигма
 
Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...Using classifiers to compute similarities between face images. Prof. Lior Wol...
Using classifiers to compute similarities between face images. Prof. Lior Wol...
 
Поисковая технология "Спектр". Андрей Плахов, Яндекс
Поисковая технология "Спектр". Андрей Плахов, ЯндексПоисковая технология "Спектр". Андрей Плахов, Яндекс
Поисковая технология "Спектр". Андрей Плахов, Яндекс
 
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
Mike Thelwall - Sentiment strength detection for the social web: From YouTube...
 
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
Evangelos Kanoulas — Advances in Information Retrieval EvaluationEvangelos Kanoulas — Advances in Information Retrieval Evaluation
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
 
Raffaele Perego "Efficient Query Suggestions in the Long Tail"
Raffaele Perego "Efficient Query Suggestions in the Long Tail"Raffaele Perego "Efficient Query Suggestions in the Long Tail"
Raffaele Perego "Efficient Query Suggestions in the Long Tail"
 
"Efficient Diversification of Web Search Results"
"Efficient Diversification of Web Search Results""Efficient Diversification of Web Search Results"
"Efficient Diversification of Web Search Results"
 
Salvatore_Orlando
Salvatore_OrlandoSalvatore_Orlando
Salvatore_Orlando
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Julia Stoyanovich - Making interval-based clustering rank-aware

  • 1. Making Interval-Based Clustering Rank-Aware Julia Stoyanovich (University of Pennsylvania) joint work with Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv University) Яндекс 23.08.2011
  • 2. Research Directions • Representation of Large Complex Datasets – Symmetric relationships [VLDB 2004] – Faceted databases [VLDB 2005, Internet Archaeology 2007] – Schema polynomials [EDBT 2008] – Probabilistic databases [ICDE 2011] – Scientific workflows with provenance [CIDR 2011, ICDT 2011] • Information Discovery in Large Complex Datasets – Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008] – Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011] – Rank-aware clustering [CIKM 2009, EDBT 2011] – Exploring repositories of scientific workflows [WANDS 2010, AMW 2011] – Exploring repositories of functional genomics experiments [submitted] – Estimating susceptibility to genetic disorders [Bioinformatics 2007] Яндекс 23.08.2011 2
  • 3. Applications and Prototypes • The Faceted Query Engine applied to archaeology • Biological data management – MutaGeneSys – estimating individual genetic disease susceptibility – AnnotCompute – exploring repositories of microarray experiments – SkylineSearch – semantic ranking and result visualization for PubMed – myExperiment topics – exploring repositories of scientific workflows • “Shopping and dating” – Yahoo! Garçon – a collaborative tagging recommender system – Yahoo! FindLove – rank-aware clustering for dating data Яндекс 23.08.2011 3
  • 4. Ranked Exploration of Structured Datasets MBA, 40 years old Dating service user Mike makes $150K • Find matches MBA, 40 years old – age: [18,40] makes $150K – education: at least some college – income: > $50,000 / year MBA, 40 years old makes $150K • Rank by income from higher to lower MBA, 40 years old makes $150K • Problems – too many results … 999 matches – results are homogeneous at top ranks, PhD, 36 years old due to correlations among makes $100K attributes! … 9999 matches – correlations may be complex, BS, 27 years old depend on the selection criteria and makes $80K on the ranking function Яндекс 23.08.2011 4
  • 5. An Example from Yahoo! Personals -- income > $50K -- edu > BS Observe that 1. % of women with income > $50K increases with age 2. % women with post-graduate education increases until age 29, then plateaus There is a clear positive correlation between 1. age and income, for all ages 2. education and income, at least until age 29 Correlations are local Яндекс 23.08.2011 5
  • 6. Goal: Find Clusters that Correlate with Ranking age: 26-37 age: 18-25 edu: PhD edu: BS, MS age: 33-40 income: 100-130K income: 50-75K income: 125-150K edu: MS age: 26-30 income: 50-75K income: 75-110K Яндекс 23.08.2011 6
  • 7. Roadmap • Introduction ➞ Rank-aware clustering – The formalism – The BARAC algorithm • Experimental evaluation – Effectiveness – Efficiency • Conclusion Яндекс 23.08.2011 7
  • 8. What Is Subspace Clustering? Parsons et al., SIGKDD Explorations 6(1), 2006 8 Яндекс 23.08.2011
  • 9. Why Do We Need Subspace Clustering? Parsons et al., SIGKDD Explorations 6(1), 2006 9 Яндекс 23.08.2011
  • 10. How Do We Find Subspace Clusters? • Finds clusters in multiple, possibly overlapping, subspaces – Dimensionality reduction per cluster – Lower-dimensional clusters are easier to identify and their descriptions are more palatable to the users – Example: “age 20-25” and “edu = BS” and “income 25K-50K” • Two main approaches – Top-down: start with full dimensionality and refine – Bottom-up: start with dense units in 1D, combine to find higher-dimensional clusters • Issues – What is a cluster? – need a measure of quality – How do we find clusters? – need a search strategy Яндекс 23.08.2011 10
  • 11. Problem Statement • User specifies a conjunction of filtering conditions, e.g., Q : age  20,40  edu  Bachelors • User specifies a ranking function, e.g., linear combination  R :[income,],[age,] We do not restrict the set of ranking functions, but assume that ranking is derived from, or correlates with, attribute values Given a query Q and a ranking function R, find rank-aware clusters  in subspaces of the dataset. Clusters are subspaces that: • have sufficient rank-aware quality • are tight • are maximal Яндекс 23.08.2011 11
  • 12. BARAC: Bottom-up Algorithm for Rank-Aware Clustering • BuildGrid – split each dimension into intervals – compute top-N for each interval • Merge – merge neighboring intervals using rank-aware locality (interval dominance) ensures tightness • Join – build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality ensures maximality and rank-aware quality Яндекс 23.08.2011 12
  • 13. Avoiding Match Homogeneity at Top Ranks Cluster descriptions must accurately describe the top-N items MBA, 40 years old makes $150K MBA, 40 years old makes $150K MBA, 40 years old age: 25-40 makes $150K income: 75-150K MBA, 40 years old makes $150K … 999 matches PhD, 36 years old makes $100K age: 40 … 9999 matches income:150K BS, 27 years old makes $80K Tightness will give us this property Яндекс 23.08.2011 13
  • 14. Ranked Intervals and Interval Dominance • Ranked intervals: description, contents (items), top-N – I1: age  [25,30], I2: edu = MBA • Interval dominance is a rank-aware measure of locality, defined – over 2 consecutive intervals on the same attribute – for a ranking function R, integer N, and dominance threshold θdom  (0.5, 1] I1 dominates I2 if I1 + I2 : age  [20,29] I1 : age  [20,24] I2 : age  [25,29] R1 : age (asc) R2 : 0.3inc + 0.7edu (desc) R3 : rel serv (asc) top-10 I2 <10,1 I1 I1 <10,0.8 I2 I1 <>10,0.5 I2 Яндекс 23.08.2011 14
  • 15. Property 1: Tightness 38 years old 36 years old R :[income,] age: 35-39 edu: PhD  age: 30-39 edu: PhD I1 : age  [30,34] I2 : age  [35,39] I1 + I 2 : age  [30,39] if I1 dominates I2, then add I1 and I2 to the search space else add I1, I2, and I1+ I2 to the search space Яндекс 23.08.2011 15
  • 16. Choose Best from Among Comparable R :[income,] > ?  age: 33-40 age: 33-40 income: 126-150K income: 70-100K ≠ ? age: 33-40 age: 26-30 income: 125-150K income: 75-110K Rank-aware clustering quality will give us this property Яндекс 23.08.2011 16
  • 17. Ranked Subspaces and Clusters A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct attributes, e.g., S: { age  [25,30] , edu = MBA } • interpreted as a conjunction of predicates over dataset D • dimensionality = number of intervals Goal: find subspaces that have sufficient rank-aware clustering quality All rank-aware clustering quality measures – compare the top-N list of a ranked subspace to the top-N lists of its constituent ranked intervals – are defined for a ranking function R, an integer N, and a quality threshold θ Q  (0.5, 1] Яндекс 23.08.2011 17
  • 18. Property 2: Rank-Aware Clustering Quality R : income  2 N  3 Q  3 age: 25-29 edu: BS age: 30-34 m1 99K m1 99K m6 125K m3 90K m2 95K m8 110K m7 75K m3 90K m10 100K  m9 65K m4 85K m2 95K m4 85K m5 85K age: 25-29 age: 30-34 edu: BS edu: BS m1 99K m2 95K m3 90K m4 85K Яндекс 23.08.2011 18
  • 19. Rank-Aware Clustering Quality Measures • QtopN : subspace contains > θ Q items from the top-N of its intervals – Considers top-N lists as sets • QSCORE : subspace contains > θ Q high-scoring items from the top-N of its intervals – Based on the sums of scores of top-N items • QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items from the top-N of its intervals – Based on NDCG, incorporates both scores and ranks • Clustering quality measures must exhibit downward closure – Quality of a subspace is no higher than the quality of its included subspaces – Holds trivially for density-based measures, due to set properties – Also holds for our measures, details omitted here Яндекс 23.08.2011 19
  • 20. Property 3: Maximality Avoid producing redundant clusters age: 25-40 edu: PhD age: 25-40 edu: PhD edu: PhD income: 100-130K income: 100-130K age: 25-40 income: 100-130K Maximality will give us this property comes for free with bottom-up subspace clustering Яндекс 23.08.2011 20
  • 21. BARAC Recap • BuildGrid – split each dimension into intervals – compute top-N for each interval • Merge – merge neighboring intervals using rank-aware locality (interval dominance) ensures tightness • Join – build K-dimensional clusters from compatible (K-1)-dimensional clusters using rank-aware clustering quality ensures maximality and rank-aware quality Яндекс 23.08.2011 21
  • 22. Complexity of BARAC • Polynomial in input size, exponential in the number of attributes • Exponential dependency is unavoidable! – Even counting distinct maximal frequent itemsets is #P-complete • Example – 1 item for each combination of attribute values – each item has an arbitrary distinct score – find rank-aware clusters with QtopN, N = 1 – there is 1 cluster per item, so an exponential number of clusters! • But lower in practice – correlations are local – clustering quality requires 50% overlap at top-N Яндекс 23.08.2011 22
  • 23. Roadmap • Introduction • Rank-aware clustering – The formalism – The BARAC algorithm ➞ Experimental evaluation – Effectiveness – Efficiency • Conclusion Яндекс 23.08.2011 23
  • 24. Experimental Dataset: Yahoo! Personals • Data and users – 5 weeks, 454 users, 861 searches – 19 filtering attributes, 17 clustering attributes, 6 ranking attributes – Filtering on attributes, user-specified – Filtering on geo location (only for effectiveness evaluation) – QtopN clustering quality metric • Ranking function: weighted sum – sum of normalized per-attribute distances from best attribute value from among matches – attributes: age, height, body type, education, income, religious services – personalized by user: choice of attributes, sort order, normalization Яндекс 23.08.2011 24
  • 25. Evaluation of Effectiveness: User Study presentation list groups content top-100 top list top groups BARAC BARAC list BARAC groups Яндекс 23.08.2011 25
  • 26. Яндекс 23.08.2011 26
  • 27. Яндекс 23.08.2011 27
  • 28. Effectiveness Metrics and Results • Users may fave matches and / or groups – When a group is faved, all matches in that group are faved • A productive search has at least 1 faved match/group % prod. num. faves per num. faves per prod. treatment searches search search top list 17 0.84 5.05 top group 14 0.87 7.33 / 1.17 groups BARAC list 15 0.74 4.93 BARAC group 20 1.55 12.38 / 1.91 groups Яндекс 23.08.2011 28
  • 29. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible • All reported results are over the complete set of female profiles in Yahoo! Personals, without any location-based filtering! Яндекс 23.08.2011 29
  • 30. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible runtime of BuildGrid 8000 runtime of BuildGrid (ms) 7000 6000 5000 4000 3000 2000 1000 0 0 100000 200000 300000 400000 500000 # items Яндекс 23.08.2011 30
  • 31. Evaluation of Efficiency • Summary of results: BARAC is scalable – runtimes of BuildGrid and Join dominate performance – runtime of Merge is negligible runtime of Join 3500 3000 runtime of Join (ms) 2500 2000 1500 1000 500 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # clustering dimensions Яндекс 23.08.2011 31
  • 32. Performance of Join 600 500 runtime of Join (ms) 9D 400 8D 7D 300 6D 5D 4D 200 3D 100 0 0.5 0.6 0.7 0.8 0.9 1 quality threshold * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 32
  • 33. Performance of Join 1000 900 800 runtime of Join (ms) 700 9D 8D 600 7D 500 6D 5D 400 4D 300 3D 200 100 0 0.5 0.6 0.7 0.8 0.9 1 dominance threshold * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 33
  • 34. Roadmap • Introduction • Rank-aware clustering – The formalism – The BARAC algorithm • Experimental evaluation – Effectiveness – Efficiency ➞ Conclusion Яндекс 23.08.2011 34
  • 35. Rank-Aware Clustering: Recap • Formalized rank-aware clustering, a novel data exploration paradigm age: 18-25 • Developed a rank-aware measure of locality and a edu: BS, MS age: 33-40 inc: 50-75K family of rank-aware clustering quality measures inc: 126-150K • Proposed BARAC: a bottom-up algorithm for rank- age: 26-30 aware clustering inc: 75-110K 8000 runtime of BuildGrid (ms) 7000 6000 • Presented an experimental evaluation on Yahoo! 5000 4000 Personals (also restaurants in Yahoo! Local) 3000 2000 • Effectiveness 1000 0 • Efficiency 0 100000 200000 300000 # items 400000 500000 Яндекс 23.08.2011 35
  • 36. Related Work • Subspace clustering – CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999] – Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002] • Ranking of structured data – Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al, 2003] – Rank-aware attribute selection [Das et al, 2006] • Integrating ranking with clustering – Mixture model, mutual reinforcement between ranking and clustering, for heterogeneous information networks, e.g., DBLP [Sun et al, 2009] • Diversification – Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru et al, 2004], … – Database queries [Chen and Li, 2007], [Vee et al, 2008] – Recommendation [Boim et al, 2011], [Yu et al, 2009] Яндекс 23.08.2011 36
  • 37. Future Work: Choosing a Clustering Quality Measure 12 attribute-rank 10 geo-rank 8 score 6 4 2 0 0 20 40 60 80 100 rank Яндекс 23.08.2011 37
  • 39. Take 1: Density-Based Clustering age: 18-25 age: 26-30 age: 31-35 age: 36-40 min density = 2 income: 50-75K income: 76-100K income: 101-125K Income: 126-150K Яндекс 23.08.2011 39
  • 40. Take 1: Density-Based Clustering age: 18-30 age: 31-35 age: 36-40 min density = 2 age: 18-30 age: 36-40 Income: 50-75K income: 101-150K income: 50-75K income: 76-100K income: 101-150K Яндекс 23.08.2011 40
  • 41. Take 2: A Lower Threshold? age: 18-25 age: 26-30 age: 31-35 age: 36-40 min density = 1 income: 50-75K income: 76-100K income: 101-125K income 126-150K Яндекс 23.08.2011 41
  • 42. Take 2: A Lower Threshold? age: 18-40 density > 0 age: 18-40; income: 50-150K income: 50-150K Яндекс 23.08.2011 42
  • 43. Performance of BARAC 100% BuildGrid 90% Join 80% Total 70% 60% 50% 40% 30% 20% 10% 0% <30sec <20sec <15sec <10sec <5 sec <1 sec * results for 100 Yahoo! Personals users on the full Y!P dataset. Яндекс 23.08.2011 43