Julia Stoyanovich - Making interval-based clustering rank-aware

Making Interval-Based Clustering Rank-Aware

Julia Stoyanovich (University of Pennsylvania)

joint work with
Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv
University)

Яндекс 23.08.2011

Research Directions
• Representation of Large Complex Datasets
– Symmetric relationships [VLDB 2004]
– Faceted databases [VLDB 2005, Internet Archaeology 2007]
– Schema polynomials [EDBT 2008]
– Probabilistic databases [ICDE 2011]
– Scientific workflows with provenance [CIDR 2011, ICDT 2011]

• Information Discovery in Large Complex Datasets
– Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008]
– Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011]
– Rank-aware clustering [CIKM 2009, EDBT 2011]

– Exploring repositories of scientific workflows [WANDS 2010, AMW 2011]
– Exploring repositories of functional genomics experiments [submitted]
– Estimating susceptibility to genetic disorders [Bioinformatics 2007]

Яндекс 23.08.2011 2

Applications and Prototypes

• The Faceted Query Engine applied to archaeology

• Biological data management
– MutaGeneSys – estimating individual genetic disease susceptibility
– AnnotCompute – exploring repositories of microarray experiments
– SkylineSearch – semantic ranking and result visualization for PubMed
– myExperiment topics – exploring repositories of scientific workflows

• “Shopping and dating”
– Yahoo! Garçon – a collaborative tagging recommender system
– Yahoo! FindLove – rank-aware clustering for dating data

Яндекс 23.08.2011 3

Ranked Exploration of Structured Datasets
MBA, 40 years old
Dating service user Mike makes $150K

• Find matches MBA, 40 years old
– age: [18,40] makes $150K
– education: at least some college
– income: > $50,000 / year MBA, 40 years old
makes $150K
• Rank by income from higher to lower
MBA, 40 years old
makes $150K
• Problems
– too many results … 999 matches

– results are homogeneous at top ranks, PhD, 36 years old
due to correlations among makes $100K
attributes! … 9999 matches

– correlations may be complex, BS, 27 years old
depend on the selection criteria and makes $80K
on the ranking function
Яндекс 23.08.2011 4

An Example from Yahoo! Personals

-- income > $50K

-- edu > BS

Observe that
1. % of women with income > $50K increases with age
2. % women with post-graduate education increases until age 29, then plateaus
There is a clear positive correlation between
1. age and income, for all ages
2. education and income, at least until age 29 Correlations are local
Яндекс 23.08.2011 5

Goal: Find Clusters that Correlate with Ranking

age: 26-37
age: 18-25
edu: PhD
edu: BS, MS
age: 33-40 income: 100-130K
income: 50-75K
income: 125-150K

edu: MS age: 26-30
income: 50-75K income: 75-110K

Яндекс 23.08.2011 6

Roadmap

• Introduction

➞ Rank-aware clustering
– The formalism
– The BARAC algorithm

• Experimental evaluation
– Effectiveness
– Efficiency

• Conclusion

Яндекс 23.08.2011 7

What Is Subspace Clustering?

Parsons et al., SIGKDD Explorations 6(1), 2006

8
Яндекс 23.08.2011

Why Do We Need Subspace Clustering?

Parsons et al., SIGKDD Explorations 6(1), 2006

9
Яндекс 23.08.2011

How Do We Find Subspace Clusters?

• Finds clusters in multiple, possibly overlapping, subspaces
– Dimensionality reduction per cluster
– Lower-dimensional clusters are easier to identify and their
descriptions are more palatable to the users
– Example: “age 20-25” and “edu = BS” and “income 25K-50K”

• Two main approaches
– Top-down: start with full dimensionality and refine
– Bottom-up: start with dense units in 1D,
combine to find higher-dimensional clusters

• Issues
– What is a cluster? – need a measure of quality
– How do we find clusters? – need a search strategy

Яндекс 23.08.2011 10

Problem Statement

• User specifies a conjunction of filtering conditions, e.g.,

Q : age  20,40  edu  Bachelors

• User specifies a ranking function, e.g., linear combination

 R :[income,],[age,]
We do not restrict the set of ranking functions, but assume that ranking is
derived from, or correlates with, attribute values

Given a query Q and a ranking function R, find rank-aware clusters
 in subspaces of the dataset. Clusters are subspaces that:
• have sufficient rank-aware quality
• are tight
• are maximal

Яндекс 23.08.2011 11

BARAC: Bottom-up Algorithm for Rank-Aware Clustering

• BuildGrid
– split each dimension into intervals
– compute top-N for each interval

• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)

ensures tightness

• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters
using rank-aware clustering quality

ensures maximality and rank-aware quality

Яндекс 23.08.2011 12

Avoiding Match Homogeneity at Top Ranks
Cluster descriptions must accurately describe the top-N items
MBA, 40 years old
makes $150K

MBA, 40 years old
makes $150K

MBA, 40 years old
age: 25-40
makes $150K
income: 75-150K
MBA, 40 years old
makes $150K
… 999 matches

PhD, 36 years old
makes $100K
age: 40 … 9999 matches
income:150K
BS, 27 years old
makes $80K
Tightness will give us this property

Яндекс 23.08.2011 13

Ranked Intervals and Interval Dominance
• Ranked intervals: description, contents (items), top-N
– I1: age  [25,30], I2: edu = MBA
• Interval dominance is a rank-aware measure of locality, defined
– over 2 consecutive intervals on the same attribute
– for a ranking function R, integer N, and dominance threshold θdom  (0.5, 1]

I1 dominates I2 if

I1 + I2 : age  [20,29]
I1 : age  [20,24] I2 : age  [25,29]
R1 : age (asc) R2 : 0.3inc + 0.7edu (desc) R3 : rel serv (asc)
top-10

I2 <10,1 I1 I1 <10,0.8 I2 I1 <>10,0.5 I2
Яндекс 23.08.2011 14

Property 1: Tightness
38 years old 36 years old R :[income,]

age: 35-39
edu: PhD
 age: 30-39
edu: PhD

I1 : age  [30,34] I2 : age  [35,39] I1 + I 2 : age  [30,39]

if I1 dominates I2, then add I1 and I2 to the search space
else add I1, I2, and I1+ I2 to the search space

Яндекс 23.08.2011 15

Choose Best from Among Comparable
R :[income,]

>
? 
age: 33-40 age: 33-40

≠
?
age: 33-40 age: 26-30

Rank-aware clustering quality will give us this property

Яндекс 23.08.2011 16

Ranked Subspaces and Clusters

A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct
attributes, e.g., S: { age  [25,30] , edu = MBA }
• interpreted as a conjunction of predicates over dataset D
• dimensionality = number of intervals

Goal: find subspaces that have sufficient rank-aware clustering quality

All rank-aware clustering quality measures
– compare the top-N list of a ranked subspace to the top-N lists of its
constituent ranked intervals
– are defined for a ranking function R, an integer N, and a quality
threshold θ Q  (0.5, 1]

Яндекс 23.08.2011 17

Property 2: Rank-Aware Clustering Quality
R : income 
2
N  3 Q 
3 age: 25-29 edu: BS age: 30-34
m1 99K m1 99K m6 125K
m3 90K m2 95K m8 110K
m7 75K m3 90K m10 100K

 m9 65K m4 85K m2 95K
m4 85K
m5 85K
age: 25-29 age: 30-34
edu: BS edu: BS
m1 99K m2 95K
m3 90K m4 85K

Яндекс 23.08.2011 18

Rank-Aware Clustering Quality Measures
• QtopN : subspace contains > θ Q items from the top-N of its intervals
– Considers top-N lists as sets

• QSCORE : subspace contains > θ Q high-scoring items from the top-N of
its intervals
– Based on the sums of scores of top-N items

• QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items
from the top-N of its intervals
– Based on NDCG, incorporates both scores and ranks

• Clustering quality measures must exhibit downward closure
– Quality of a subspace is no higher than the quality of its included subspaces
– Holds trivially for density-based measures, due to set properties
– Also holds for our measures, details omitted here

Яндекс 23.08.2011 19

Property 3: Maximality
Avoid producing redundant clusters

age: 25-40
edu: PhD

age: 25-40
edu: PhD edu: PhD

age: 25-40
income: 100-130K

Maximality will give us this property
comes for free with bottom-up subspace clustering

Яндекс 23.08.2011 20

BARAC Recap

• BuildGrid
– split each dimension into intervals
– compute top-N for each interval

• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)

ensures tightness

• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters
using rank-aware clustering quality

ensures maximality and rank-aware quality

Яндекс 23.08.2011 21

Complexity of BARAC

• Polynomial in input size, exponential in the number of attributes

• Exponential dependency is unavoidable!
– Even counting distinct maximal frequent itemsets is #P-complete

• Example
– 1 item for each combination of attribute values
– each item has an arbitrary distinct score
– find rank-aware clusters with QtopN, N = 1
– there is 1 cluster per item, so an exponential number of clusters!

• But lower in practice
– correlations are local
– clustering quality requires 50% overlap at top-N

Яндекс 23.08.2011 22

Roadmap

• Introduction

• Rank-aware clustering
– The formalism

➞ Experimental evaluation
– Effectiveness
– Efficiency

• Conclusion

Яндекс 23.08.2011 23

Experimental Dataset: Yahoo! Personals
• Data and users
– 5 weeks, 454 users, 861 searches
– 19 filtering attributes, 17 clustering attributes, 6 ranking attributes
– Filtering on attributes, user-specified
– Filtering on geo location (only for effectiveness evaluation)
– QtopN clustering quality metric

• Ranking function: weighted sum
– sum of normalized per-attribute distances from best attribute value
from among matches
– attributes: age, height, body type, education, income, religious
services
– personalized by user: choice of attributes, sort order, normalization

Яндекс 23.08.2011 24

Evaluation of Effectiveness: User Study

presentation

list groups
content

top-100 top list top groups

BARAC BARAC list BARAC groups

Яндекс 23.08.2011 25

Яндекс 23.08.2011 26

Яндекс 23.08.2011 27

Effectiveness Metrics and Results
• Users may fave matches and / or groups
– When a group is faved, all matches in that group are faved

• A productive search has at least 1 faved match/group

% prod. num. faves per num. faves per prod.
treatment
searches search search
top list 17 0.84 5.05

top group 14 0.87 7.33 / 1.17 groups

BARAC list 15 0.74 4.93

BARAC group 20 1.55 12.38 / 1.91 groups

Яндекс 23.08.2011 28

Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible

• All reported results are over the complete set of female profiles
in Yahoo! Personals, without any location-based filtering!

Яндекс 23.08.2011 29


runtime of BuildGrid

8000
runtime of BuildGrid (ms)

7000

6000

5000

4000

3000

2000

1000

0
0 100000 200000 300000 400000 500000
# items

Яндекс 23.08.2011 30


runtime of Join
3500

3000
runtime of Join (ms)

2500

2000

1500

1000

500

0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# clustering dimensions

Яндекс 23.08.2011 31

Performance of Join

600

500

9D
400
8D
7D
300 6D
5D
4D
200
3D

100

0
0.5 0.6 0.7 0.8 0.9 1
quality threshold

* results for 100 Yahoo! Personals users on the full Y!P dataset.

Яндекс 23.08.2011 32

Performance of Join

1000

900

800

700 9D
8D
600
7D
500 6D
5D
400
4D
300 3D

200

100

0
0.5 0.6 0.7 0.8 0.9 1
dominance threshold


Яндекс 23.08.2011 33

Roadmap

• Introduction

• Rank-aware clustering
– The formalism

• Experimental evaluation
– Effectiveness
– Efficiency

➞ Conclusion

Яндекс 23.08.2011 34

Rank-Aware Clustering: Recap
• Formalized rank-aware clustering, a novel
data exploration paradigm

age: 18-25
• Developed a rank-aware measure of locality and a edu: BS, MS age: 33-40
inc: 50-75K
family of rank-aware clustering quality measures inc: 126-150K

• Proposed BARAC: a bottom-up algorithm for rank- age: 26-30
aware clustering inc: 75-110K

8000

runtime of BuildGrid (ms)
7000

6000

• Presented an experimental evaluation on Yahoo! 5000

4000
Personals (also restaurants in Yahoo! Local) 3000

2000
• Effectiveness 1000

0
• Efficiency 0 100000 200000 300000
# items
400000 500000

Яндекс 23.08.2011 35

Related Work

• Subspace clustering
– CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]
– Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002]
• Ranking of structured data
– Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al,
2003]
– Rank-aware attribute selection [Das et al, 2006]
• Integrating ranking with clustering
– Mixture model, mutual reinforcement between ranking and clustering, for
heterogeneous information networks, e.g., DBLP [Sun et al, 2009]
• Diversification
– Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru
et al, 2004], …
– Database queries [Chen and Li, 2007], [Vee et al, 2008]
– Recommendation [Boim et al, 2011], [Yu et al, 2009]

Яндекс 23.08.2011 36

Future Work: Choosing a Clustering Quality Measure

12
attribute-rank
10 geo-rank

8
score

6

4

2

0
0 20 40 60 80 100
rank

Яндекс 23.08.2011 37

Thank you!

Яндекс 23.08.2011

Take 1: Density-Based Clustering

age: 18-25 age: 26-30 age: 31-35 age: 36-40

min density = 2

income: 50-75K income: 76-100K income: 101-125K Income: 126-150K

Яндекс 23.08.2011 39

Take 1: Density-Based Clustering

age: 18-30 age: 31-35 age: 36-40

min density = 2

age: 18-30 age: 36-40
Income: 50-75K income: 101-150K

income: 50-75K income: 76-100K income: 101-150K

Яндекс 23.08.2011 40

Take 2: A Lower Threshold?

age: 18-25 age: 26-30 age: 31-35 age: 36-40

min density = 1

income: 50-75K income: 76-100K income: 101-125K income 126-150K

Яндекс 23.08.2011 41

Take 2: A Lower Threshold?

age: 18-40

density > 0

age: 18-40; income: 50-150K

income: 50-150K

Яндекс 23.08.2011 42

Performance of BARAC

100%
BuildGrid
90% Join
80%
Total

70%

60%

50%
40%

30%

20%

10%

0%
<30sec <20sec <15sec <10sec <5 sec <1 sec

Яндекс 23.08.2011 43

Julia Stoyanovich - Making interval-based clustering rank-aware

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von yaevents

Mehr von yaevents (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Julia Stoyanovich - Making interval-based clustering rank-aware