Tata AIG General Insurance Company - Insurer Innovation Award 2024
Julia Stoyanovich - Making interval-based clustering rank-aware
1. Making Interval-Based Clustering Rank-Aware
Julia Stoyanovich (University of Pennsylvania)
joint work with
Sihem Amer-Yahia (Qatar Foundation) and Tova Milo (Tel Aviv
University)
Яндекс 23.08.2011
2. Research Directions
• Representation of Large Complex Datasets
– Symmetric relationships [VLDB 2004]
– Faceted databases [VLDB 2005, Internet Archaeology 2007]
– Schema polynomials [EDBT 2008]
– Probabilistic databases [ICDE 2011]
– Scientific workflows with provenance [CIDR 2011, ICDT 2011]
• Information Discovery in Large Complex Datasets
– Search and ranking in social context [VLDB 2008, AAAI-SIP 2008, SIGMOD 2008]
– Ranked data exploration in semantic context [ICDE 2010, SIGMOD 2011]
– Rank-aware clustering [CIKM 2009, EDBT 2011]
– Exploring repositories of scientific workflows [WANDS 2010, AMW 2011]
– Exploring repositories of functional genomics experiments [submitted]
– Estimating susceptibility to genetic disorders [Bioinformatics 2007]
Яндекс 23.08.2011 2
3. Applications and Prototypes
• The Faceted Query Engine applied to archaeology
• Biological data management
– MutaGeneSys – estimating individual genetic disease susceptibility
– AnnotCompute – exploring repositories of microarray experiments
– SkylineSearch – semantic ranking and result visualization for PubMed
– myExperiment topics – exploring repositories of scientific workflows
• “Shopping and dating”
– Yahoo! Garçon – a collaborative tagging recommender system
– Yahoo! FindLove – rank-aware clustering for dating data
Яндекс 23.08.2011 3
4. Ranked Exploration of Structured Datasets
MBA, 40 years old
Dating service user Mike makes $150K
• Find matches MBA, 40 years old
– age: [18,40] makes $150K
– education: at least some college
– income: > $50,000 / year MBA, 40 years old
makes $150K
• Rank by income from higher to lower
MBA, 40 years old
makes $150K
• Problems
– too many results … 999 matches
– results are homogeneous at top ranks, PhD, 36 years old
due to correlations among makes $100K
attributes! … 9999 matches
– correlations may be complex, BS, 27 years old
depend on the selection criteria and makes $80K
on the ranking function
Яндекс 23.08.2011 4
5. An Example from Yahoo! Personals
-- income > $50K
-- edu > BS
Observe that
1. % of women with income > $50K increases with age
2. % women with post-graduate education increases until age 29, then plateaus
There is a clear positive correlation between
1. age and income, for all ages
2. education and income, at least until age 29 Correlations are local
Яндекс 23.08.2011 5
6. Goal: Find Clusters that Correlate with Ranking
age: 26-37
age: 18-25
edu: PhD
edu: BS, MS
age: 33-40 income: 100-130K
income: 50-75K
income: 125-150K
edu: MS age: 26-30
income: 50-75K income: 75-110K
Яндекс 23.08.2011 6
8. What Is Subspace Clustering?
Parsons et al., SIGKDD Explorations 6(1), 2006
8
Яндекс 23.08.2011
9. Why Do We Need Subspace Clustering?
Parsons et al., SIGKDD Explorations 6(1), 2006
9
Яндекс 23.08.2011
10. How Do We Find Subspace Clusters?
• Finds clusters in multiple, possibly overlapping, subspaces
– Dimensionality reduction per cluster
– Lower-dimensional clusters are easier to identify and their
descriptions are more palatable to the users
– Example: “age 20-25” and “edu = BS” and “income 25K-50K”
• Two main approaches
– Top-down: start with full dimensionality and refine
– Bottom-up: start with dense units in 1D,
combine to find higher-dimensional clusters
• Issues
– What is a cluster? – need a measure of quality
– How do we find clusters? – need a search strategy
Яндекс 23.08.2011 10
11. Problem Statement
• User specifies a conjunction of filtering conditions, e.g.,
Q : age 20,40 edu Bachelors
• User specifies a ranking function, e.g., linear combination
R :[income,],[age,]
We do not restrict the set of ranking functions, but assume that ranking is
derived from, or correlates with, attribute values
Given a query Q and a ranking function R, find rank-aware clusters
in subspaces of the dataset. Clusters are subspaces that:
• have sufficient rank-aware quality
• are tight
• are maximal
Яндекс 23.08.2011 11
12. BARAC: Bottom-up Algorithm for Rank-Aware Clustering
• BuildGrid
– split each dimension into intervals
– compute top-N for each interval
• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)
ensures tightness
• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters
using rank-aware clustering quality
ensures maximality and rank-aware quality
Яндекс 23.08.2011 12
13. Avoiding Match Homogeneity at Top Ranks
Cluster descriptions must accurately describe the top-N items
MBA, 40 years old
makes $150K
MBA, 40 years old
makes $150K
MBA, 40 years old
age: 25-40
makes $150K
income: 75-150K
MBA, 40 years old
makes $150K
… 999 matches
PhD, 36 years old
makes $100K
age: 40 … 9999 matches
income:150K
BS, 27 years old
makes $80K
Tightness will give us this property
Яндекс 23.08.2011 13
14. Ranked Intervals and Interval Dominance
• Ranked intervals: description, contents (items), top-N
– I1: age [25,30], I2: edu = MBA
• Interval dominance is a rank-aware measure of locality, defined
– over 2 consecutive intervals on the same attribute
– for a ranking function R, integer N, and dominance threshold θdom (0.5, 1]
I1 dominates I2 if
I1 + I2 : age [20,29]
I1 : age [20,24] I2 : age [25,29]
R1 : age (asc) R2 : 0.3inc + 0.7edu (desc) R3 : rel serv (asc)
top-10
I2 <10,1 I1 I1 <10,0.8 I2 I1 <>10,0.5 I2
Яндекс 23.08.2011 14
15. Property 1: Tightness
38 years old 36 years old R :[income,]
age: 35-39
edu: PhD
age: 30-39
edu: PhD
I1 : age [30,34] I2 : age [35,39] I1 + I 2 : age [30,39]
if I1 dominates I2, then add I1 and I2 to the search space
else add I1, I2, and I1+ I2 to the search space
Яндекс 23.08.2011 15
16. Choose Best from Among Comparable
R :[income,]
>
?
age: 33-40 age: 33-40
income: 126-150K income: 70-100K
≠
?
age: 33-40 age: 26-30
income: 125-150K income: 75-110K
Rank-aware clustering quality will give us this property
Яндекс 23.08.2011 16
17. Ranked Subspaces and Clusters
A ranked subspace S : {I1, …, Im} is a set of ranked intervals over distinct
attributes, e.g., S: { age [25,30] , edu = MBA }
• interpreted as a conjunction of predicates over dataset D
• dimensionality = number of intervals
Goal: find subspaces that have sufficient rank-aware clustering quality
All rank-aware clustering quality measures
– compare the top-N list of a ranked subspace to the top-N lists of its
constituent ranked intervals
– are defined for a ranking function R, an integer N, and a quality
threshold θ Q (0.5, 1]
Яндекс 23.08.2011 17
19. Rank-Aware Clustering Quality Measures
• QtopN : subspace contains > θ Q items from the top-N of its intervals
– Considers top-N lists as sets
• QSCORE : subspace contains > θ Q high-scoring items from the top-N of
its intervals
– Based on the sums of scores of top-N items
• QSCORE & RANK : subspace contains > θ Q high-scoring, high-ranking items
from the top-N of its intervals
– Based on NDCG, incorporates both scores and ranks
• Clustering quality measures must exhibit downward closure
– Quality of a subspace is no higher than the quality of its included subspaces
– Holds trivially for density-based measures, due to set properties
– Also holds for our measures, details omitted here
Яндекс 23.08.2011 19
20. Property 3: Maximality
Avoid producing redundant clusters
age: 25-40
edu: PhD
age: 25-40
edu: PhD edu: PhD
income: 100-130K income: 100-130K
age: 25-40
income: 100-130K
Maximality will give us this property
comes for free with bottom-up subspace clustering
Яндекс 23.08.2011 20
21. BARAC Recap
• BuildGrid
– split each dimension into intervals
– compute top-N for each interval
• Merge
– merge neighboring intervals using rank-aware locality (interval dominance)
ensures tightness
• Join
– build K-dimensional clusters from compatible (K-1)-dimensional clusters
using rank-aware clustering quality
ensures maximality and rank-aware quality
Яндекс 23.08.2011 21
22. Complexity of BARAC
• Polynomial in input size, exponential in the number of attributes
• Exponential dependency is unavoidable!
– Even counting distinct maximal frequent itemsets is #P-complete
• Example
– 1 item for each combination of attribute values
– each item has an arbitrary distinct score
– find rank-aware clusters with QtopN, N = 1
– there is 1 cluster per item, so an exponential number of clusters!
• But lower in practice
– correlations are local
– clustering quality requires 50% overlap at top-N
Яндекс 23.08.2011 22
24. Experimental Dataset: Yahoo! Personals
• Data and users
– 5 weeks, 454 users, 861 searches
– 19 filtering attributes, 17 clustering attributes, 6 ranking attributes
– Filtering on attributes, user-specified
– Filtering on geo location (only for effectiveness evaluation)
– QtopN clustering quality metric
• Ranking function: weighted sum
– sum of normalized per-attribute distances from best attribute value
from among matches
– attributes: age, height, body type, education, income, religious
services
– personalized by user: choice of attributes, sort order, normalization
Яндекс 23.08.2011 24
25. Evaluation of Effectiveness: User Study
presentation
list groups
content
top-100 top list top groups
BARAC BARAC list BARAC groups
Яндекс 23.08.2011 25
28. Effectiveness Metrics and Results
• Users may fave matches and / or groups
– When a group is faved, all matches in that group are faved
• A productive search has at least 1 faved match/group
% prod. num. faves per num. faves per prod.
treatment
searches search search
top list 17 0.84 5.05
top group 14 0.87 7.33 / 1.17 groups
BARAC list 15 0.74 4.93
BARAC group 20 1.55 12.38 / 1.91 groups
Яндекс 23.08.2011 28
29. Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
• All reported results are over the complete set of female profiles
in Yahoo! Personals, without any location-based filtering!
Яндекс 23.08.2011 29
30. Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
runtime of BuildGrid
8000
runtime of BuildGrid (ms)
7000
6000
5000
4000
3000
2000
1000
0
0 100000 200000 300000 400000 500000
# items
Яндекс 23.08.2011 30
31. Evaluation of Efficiency
• Summary of results: BARAC is scalable
– runtimes of BuildGrid and Join dominate performance
– runtime of Merge is negligible
runtime of Join
3500
3000
runtime of Join (ms)
2500
2000
1500
1000
500
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# clustering dimensions
Яндекс 23.08.2011 31
32. Performance of Join
600
500
runtime of Join (ms)
9D
400
8D
7D
300 6D
5D
4D
200
3D
100
0
0.5 0.6 0.7 0.8 0.9 1
quality threshold
* results for 100 Yahoo! Personals users on the full Y!P dataset.
Яндекс 23.08.2011 32
33. Performance of Join
1000
900
800
runtime of Join (ms)
700 9D
8D
600
7D
500 6D
5D
400
4D
300 3D
200
100
0
0.5 0.6 0.7 0.8 0.9 1
dominance threshold
* results for 100 Yahoo! Personals users on the full Y!P dataset.
Яндекс 23.08.2011 33
35. Rank-Aware Clustering: Recap
• Formalized rank-aware clustering, a novel
data exploration paradigm
age: 18-25
• Developed a rank-aware measure of locality and a edu: BS, MS age: 33-40
inc: 50-75K
family of rank-aware clustering quality measures inc: 126-150K
• Proposed BARAC: a bottom-up algorithm for rank- age: 26-30
aware clustering inc: 75-110K
8000
runtime of BuildGrid (ms)
7000
6000
• Presented an experimental evaluation on Yahoo! 5000
4000
Personals (also restaurants in Yahoo! Local) 3000
2000
• Effectiveness 1000
0
• Efficiency 0 100000 200000 300000
# items
400000 500000
Яндекс 23.08.2011 35
36. Related Work
• Subspace clustering
– CLIQUE [Agrawal et al, 1998], ENCLUS [Cheng et al, 1999]
– Improvements [Nagesh, 1999], [Liu et al, 2000], [Chang and Jin, 2002]
• Ranking of structured data
– Many answers, empty answer problems [Chaudhuri et al, 2004], [Agrawal et al,
2003]
– Rank-aware attribute selection [Das et al, 2006]
• Integrating ranking with clustering
– Mixture model, mutual reinforcement between ranking and clustering, for
heterogeneous information networks, e.g., DBLP [Sun et al, 2009]
• Diversification
– Web search [Agichtein et al, 2007], [Anagnostopoulos et al, 2005], [Kummamuru
et al, 2004], …
– Database queries [Chen and Li, 2007], [Vee et al, 2008]
– Recommendation [Boim et al, 2011], [Yu et al, 2009]
Яндекс 23.08.2011 36