This document discusses techniques for improving query recommendations, especially for rare or long-tail queries, using query templates. It presents an approach that first generates candidate query templates for each query by replacing entities in the query with entity types from a hierarchy. It then infers transitions between templates and uses these to infer recommendations for rare queries not seen before by filling in templates with specific entities.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Арис Гионис «Методы анализа поведения пользователей и его применение в веб-поиске и рекомендации контента»
1. usage mining techniques
with applications to web search
and content recommendation
Aristides Gionis
Yahoo! Research, Barcelona
yandex aug 31, 2012
2. yahoo! research, barcelona
web mining
social media and multimedia
large-scale distributed systems
user engagement
semantic web
yandex aug 31, 2012
3. web mining in yahoo! research
themes
usage mining and query-log mining
social network analysis and graph mining
influence propagation
other data mining problems
data sources
- query logs (search) and toolbar (browsing)
- social networks (flickr, messenger, email, ...)
- question-answering (answers)
- micro-blogging (twitter)
yandex aug 31, 2012
4. web mining in yahoo! research
themes
usage mining and query-log mining
social network analysis and graph mining
influence propagation
other data mining problems
data sources
- query logs (search) and toolbar (browsing)
- social networks (flickr, messenger, email, ...)
- question-answering (answers)
- micro-blogging (twitter)
yandex aug 31, 2012
5. overview of the talk
query-log mining
query graphs
query recommendations
yahoo! tips
news recommendations using real-time web
yandex aug 31, 2012
7. query-log mining
search engines collect a large amount of query logs
lots of interesting information
analyzing users’ behavior
creating user profiles and personalization
creating knowledge bases and folksonomies
finding similar concepts
building systems for query recommendations
using statistics for improving systems’ performance
...
yandex aug 31, 2012
8. query-log mining
search engines collect a large amount of query logs
lots of interesting information
analyzing users’ behavior
creating user profiles and personalization
creating knowledge bases and folksonomies
finding similar concepts
building systems for query recommendations
using statistics for improving systems’ performance
...
yandex aug 31, 2012
9. the click graph
[Craswell and Szummer, 2007]
yandex aug 31, 2012
10. applications of the click graph
[Craswell and Szummer, 2007]
query-to-document search
query-to-query suggestion
document-to-query annotation
document-to-document relevance feedback
yandex aug 31, 2012
11. the query-flow graph
[Boldi et al., 2008]
take into account temporal information
captures the “flow” of how users submit queries
definition:
nodes V = Q ∪ {s, t} the distinct set of queries Q, plus
a starting state s and a terminal state t
edges E ⊆ V × V
weights w (q, q ) representing the probability
that q and q are part of the same chain
yandex aug 31, 2012
12. building the query-flow graph
an edge (q, q ) if q and q are consecutive in
at least one session
weights w (q, q ) learned by machine learning
features used
textual features: cosine similarity, Jaccard coefficient,
size of intersection, etc.
session features: the number of sessions, the average
session length, the average number of clicks in the
sessions, the average position of the queries in the
sessions, etc. and
time-related features: average time difference, etc.
yandex aug 31, 2012
13. query-flow graph barcelona fc
website
0.043
barcelona fc
fixtures
0.031
barcelona fc 0.017 real
madrid
0.080
0.011
0.506
0.439
barcelona
hotels 0.072
0.018 cheap
barcelona
0.023
hotels
0.029
<T>
barcelona luxury
0.043 barcelona
0.018
barcelona hotels
weather
0.416
0.523
0.100
barcelona
weather
online
yandex aug 31, 2012
14. query-flow graph
picture of a funny
cat and dog
picture of a cat
funny dog
cat
funny cat
^ picture of a dog
dog dog for sale $
breed of dog
yandex aug 31, 2012
15. query recommendations
the general theme:
given an input query q
identify similar queries q
rank them and present them to the user
most query graphs can be used for both tasks:
similarity and ranking
yandex aug 31, 2012
16. query recommendations
the general theme:
given an input query q
identify similar queries q
rank them and present them to the user
most query graphs can be used for both tasks:
similarity and ranking
yandex aug 31, 2012
17. recommendations using the query-flow graph
[Boldi et al., 2008]
perform a random walk on the query-flow graph
teleportation to the submitted query
teleportation to previous queries to take into account
the user history
normalize PageRank score to un-biasing
for very popular queries
yandex aug 31, 2012
18. example : apple
Max. weight sq sq
ˆ sq
¯
t t apple apple
apple ipod apple apple fruit apple ipod
apple store apple ipod apple ipod apple trailers
apple trailers apple store apple belgium apple store
amazon apple trailers eating apple apple mac
apple mac google apple.nl apple fruit
itunes amazon apple monitor apple usa
pc world argos apple usa apple ipod nano
argos itunes apple jobs apple.com/ipod...
yandex aug 31, 2012
19. example : banana → apple
banana → apple banana
banana banana
apple eating bugs
usb no banana holiday
banana cs opening a banana
giant chocolate bar banana shoe
where is the seed in fruit banana
anut
banana shoe recipe 22 feb 08
fruit banana banana jules oliver
banana cloths banana cs
eating bugs banana cloths
yandex aug 31, 2012
20. example : beatles → apple
beatles → apple beatles
beatles beatles
apple scarring
apple ipod paul mcartney
scarring yarns from ireland
srg peppers artwork statutory instrument
A55
ill get you silver beatles tribute
band
bashles beatles mp3
dundee folk songs GHOST’S
the beatles love album ill get you
place lyrics beatles fugees triger finger
remix
yandex aug 31, 2012
26. the recommendation problem
model user behavior as a random walk on qfg
a user starts at query q0 and follows a path p of
reformulations on qfg before terminating
consider a reward function w (q) on the nodes of qfg
goal: “nudge” users in order to maximize their reward
objectives:
1. collect a large reward along the way
2. end the session at a high-reward node
applications: a general problem formulation for suggesting
shortcuts (web graph, social networks, etc.)
yandex aug 31, 2012
27. probabilistic model
we can only suggest, not order the user
we do not know how the user will act
random walk on qfg is modeled by stochastic matrix P
recommendations R modify P to P = P + R
yandex aug 31, 2012
28. utility functions
reward function w (q) on queries
- quality of search results, user satisfaction, dwell time,
monetization, etc.
utility function U(p) on paths p = q0 . . . qk−1 T
U(p) = w (q) U(p) = w (qk−1 ),
q∈p
(Cafavy) (Machiavelli)
“road to Ithaca” “end justify the means”
yandex aug 31, 2012
29. utility
Sum of expected values
1.2
1.0
0.8
0.6
0.4
0.2
0.0
w ρ ρw 1−step heuristic
yandex aug 31, 2012
30. qfg projections for diverse recommendations
[Bordino et al., 2010]
yandex aug 31, 2012
31. diverse recommendations
[Bordino et al., 2010]
we want not only relevant and high-quality
recommendations, but also a diverse set
we want recommendations that take to different
“directions” in the qfg
need notions of distance of queries in the qfg
use spectral embeddings
project a graph in a low dimensional space, so that
embedding minimizes total edge distortion
finding diverse recommendations reduces to a geometric
problem
yandex aug 31, 2012
32. example: time
Spectral projection on 2-hop neighborhood
time time magazine new york times time zone world time what time is it time warner time warner cable
time magazine 0.9953 0.0162 0.1422 0.1049 -0.6071 -0.6056
new york times 0.9953 -0.0051 0.1248 0.0893 -0.6478 -0.6462
time zone 0.0162 -0.0051 0.9903 0.9891 -0.5234 -0.5254
world time 0.1422 0.1248 0.9903 0.9970 -0.6263 -0.6282
what time is it 0.1049 0.0893 0.9891 0.9970 -0.6244 -0.6263
time warner -0.6071 -0.6478 -0.5234 -0.6263 -0.6244 0.9999
time warner cable -0.6056 -0.6462 -0.5254 -0.6282 -0.6263 0.9999
yandex aug 31, 2012
33. improving recommendation
for long-tail queries via templates
[Szpektor et al., 2011]
yandex aug 31, 2012
34. motivation
goal: improve coverage of query-recommendation systems
observation: in a typical query log 50 % of query volume
are unique queries [Baeza-Yates et al., 2007]
most query-recommendation systems are based on finding
queries that co-occur frequently
inherent limitation on using co-occurrences
need to be able to develop methods to reason for rare,
and even previously unseen, queries
yandex aug 31, 2012
35. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
36. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
37. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
38. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
39. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
40. overview of the approach
1 generate candidate query-templates for each query
Paris hotels → <city> hotels
Paris hotels → <district> hotels
Moscow hotels → <city> hotels
2 infer transitions between templates
<city> hotels → <city> restaurants
3 infer recommendations for rare queries
Yancheng hotels → Yancheng restaurants
yandex aug 31, 2012
41. query templates
defined over a hierarchy of entity types
define a global set of templates over the whole query log
do not restrict on specific domains
(such as, travel, weather, or movies)
examples:
jaguar spare parts → <car> spare parts
name for salt → name for <compound>
a thousand miles notes → <song> notes
yandex aug 31, 2012
42. query templates
defined over a hierarchy of entity types
define a global set of templates over the whole query log
do not restrict on specific domains
(such as, travel, weather, or movies)
examples:
jaguar spare parts → <car> spare parts
name for salt → name for <compound>
a thousand miles notes → <song> notes
yandex aug 31, 2012
46. ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
47. ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
48. ranking candidate templates
ambiguity
Jaguar spare parts → <car> spare parts
Jaguar spare parts → <animal> spare parts
focus
name for salt → name for <compound>
name for salt → <description> for salt
right generalization level
Paris hotels → <capital> hotels
Paris hotels → <city> hotels
Paris hotels → <location> hotels
yandex aug 31, 2012
49. construction of query templates – details
hierarchy used: WordNet 3.0 hierarchy and Wikipedia
category hierarchy, connected via yago mapping
queries are tokenized, and n-grams are looked up and
mapped to entities in the hierarchy
enriched with heuristic generalizations for <email>,
<url>, numbers, and noun-phrases not in the taxonomy
yandex aug 31, 2012
50. query-to-template edges
mapping from a query q to its set of templates T (q)
viewed as query-to-template edges
associated edge scores
sqt (q, t) = αd
when t obtained by generalizing q at distance d in H
parameter α set experimentally to 0.9
set sqt (q, q ) = 1, if (q, q ) edge in query-flow graph
normalize so that all sqt (q, ·) sum to 1
yandex aug 31, 2012
51. template-to-templates edges
reasoning about transitions between templates
<food> recipe → healthy <food> recipe
for templates (t1 , t2 ) define the support set of query pairs
{(q1 , q2 )}, s.t.
t1 ∈ T (q1 ) and t2 ∈ T (q2 )
t1 and t2 substitute the same token in q1 and q2
(e.g., dosa recipe and healthy dosa recipe)
define template-to-template edge score as
stt (t1 , t2 ) = sqq (q1 , q2 )
(q1 ,q2 )∈Sup(t1 ,t2 )
normalize so that all stt (t, ·) sum to 1
yandex aug 31, 2012
52. example – ambiguity
consider query transition:
jaguar transmission → jaguar spare parts
template transition
<car> transmission → <car> spare parts
supported by
bmw transmission → bmw spare parts
audi transmission → audi spare parts
...
template transition
<animal> transmission → <animal> spare parts
will not be supported by
lion transmission → lion spare parts
tiger transmission → tiger spare parts
...
yandex aug 31, 2012
53. example – ambiguity
consider query transition:
jaguar transmission → jaguar spare parts
template transition
<car> transmission → <car> spare parts
supported by
bmw transmission → bmw spare parts
audi transmission → audi spare parts
...
template transition
<animal> transmission → <animal> spare parts
will not be supported by
lion transmission → lion spare parts
tiger transmission → tiger spare parts
...
yandex aug 31, 2012
54. the query-template flow graph
extension of the query-flow graph
superposition of all the concepts we have seen so far:
set of nodes consists of queries and templates
set of edges consists of
query to query edges
query to template edges
template to template edges
associated weights
yandex aug 31, 2012
55. generating recommendations
s4
q q
s1
s2 s5 q
q t1 t3
s6
s3
t2 s7 t4
r (q, q ) = s1 s4 + s2 s5 + s3 s6 + s3 s7
interpretation: probability of a feasible path
dashed lines do not really exist, but discovered on-the-fly
queries q and q may not have been seen before
transitions in the query-flow graph ranked first
yandex aug 31, 2012
56. methodology
methods:
query-template flow graph
query-flow graph
evaluation:
inspection a sample of the results
editorial evaluation
automated evaluation
yandex aug 31, 2012
58. anecdotal evidence
{“guangzhou flights”, “guangzhou map”}
<capital> flights → <capital> map
{“a thousand miles notes”, “a thousand miles piano notes”}
<single> notes → <single> piano notes
{“8 week old weimaraner”, “8 week old weimaraner puppy”}
8 week old <breed> → 8 week old <breed> puppy
{“aaa office twin falls idaho”, “aaa twin falls idaho”}
aaa office <city> → aaa <city>
{“air force titles”, “air force ranks”}
<military service> titles → <military service> ranks
{“name for salt”, “chemical name for salt”}
name for <compound> → chemical name for <compound>
yandex aug 31, 2012
59. editorial evaluation
set-A: 300 pairs from each configuration,
recommendation in the top-10
set-B: 100 pairs, same queries in each configuration,
same position
set-C: 100 pairs for which query-flow graph has no
recommendation
editors labeled query-recommendation pairs as:
relevant, not relevant, cannot tell
two editors, 100 common queries, kappa-statistic 0.37
qfg qtfg
set-A 98.48% 97.84%
set-B 97.65% 98.86%
set-C — 94.38%
yandex aug 31, 2012
60. automated evaluation – guiding principle
extract query pairs {qi , qi+1 } from a testing dataset, such
that user submitted qi+1 after qi in the same session
measure if qi+1 is predicted by our methods, and in which
position
assumption: qi+1 should be relevant and useful for qi
yandex aug 31, 2012
61. results
qfg qtfg relative increase
pair occurrences
total pairs 3134388 3134388
coverage 22.65 % 28.17 % 24.37 %
# in top-100 16.97 % 25.49 % 50.23 %
# in top-10 9.49 % 20.74 % 118.49 %
# in top-1 2.86 % 10.01 % 249.5 %
MAP 0.050 0.137
avg. position 18.35 8.3
unique pairs
total pairs 2755922 2755922
coverage 13.28 % 19.38 % 45.87 %
# in top-100 12.06 % 17.25 % 42.96 %
# in top-10 8.41 % 13.52 % 60.68 %
# in top-1 2.86 % 6.5 % 127.32 %
MAP 0.047 0.089
yandex avg. position 12.33 9.43 aug 31, 2012
63. conclusions
improve coverage of query recommendation systems
recommendations for rare or previously unseen queries
well suited for tail queries
complements rather than replaces existing methods
future work: improve quality of extracted templates
yandex aug 31, 2012
64. yahoo! tips
[Weber et al., 2011]
yandex aug 31, 2012
65. motivation
provide answers, not links
identify “how to” queries and provide tips
tip: piece of advice that is
1 short
2 concrete
3 self-contained
4 non-obvious
yandex aug 31, 2012
70. extract tips from yahoo! answers
tip: To tell if your eggs are fresh : place eggs in a bowl/glass
of water.....if it floats it’s bad. if it sinks it’s good.
yandex aug 31, 2012
71. system diagram
zest lime without zester
rule-based extraction
250k candidate tips Does query have no show normal
how-to intent? search results
Obtain quality labels for 20k
candidate tip using CrowdFlower yes
machine learning
Are there relevant show normal
22k high quality tips no
high quality tips? search results
yes
rank the matching tips and
display highest ranking one
TIP: To zest a lime if you don‘t have a zester : use a cheese grater
yandex aug 31, 2012
72. mining tips from yahoo! answers
consider tips of a specific structure: “X : Y ”
X : goal of the tip
Y : action of the tip
examples
To get the mildew smell out of your towels : try soaking
it in a salt water solution, then washing with soap and
cold water, that tends to get rid of smells
To style your hair without heat, gel or straighteners : try
coconut oil mark k
yandex aug 31, 2012
73. mining tips from yahoo! answers
english
only literal “how to” queries
answer should start with a verb
consider only best answers
replace I, my, me, myself, etc.
with you, your, you, yourself, etc.
yandex aug 31, 2012
74. quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)
decision trees
k-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaid
reading level), sentiment, # urls, emoticons, etc.
content: SVD on the tip×term matrix
yandex aug 31, 2012
75. quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)
decision trees
k-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaid
reading level), sentiment, # urls, emoticons, etc.
content: SVD on the tip×term matrix
yandex aug 31, 2012
76. quality filtering
generated 249 675 tips
manually label 20 000 using CrowdFlower
classes: very good (25%), ok (48%), bad (27%)
algorithms
svm (rbf)
decision trees
k-nn (Euclidean, k = 21 . . . 50)
feature families:
18 handcrafted features: e.g., style (Flesch-Kincaid
reading level), sentiment, # urls, emoticons, etc.
content: SVD on the tip×term matrix
yandex aug 31, 2012
77. quality filtering — machine learning results
Method handcrafted content both
features features
SVM 0.63/0.13 0.60/0.09 0.63/0.16
Hard
Decision Tree 0.67/0.07 0.61/0.06 0.66/0.13
k-NN 0.62/0.23 0.56/0.11 0.63/0.11
SVM 0.95/0.11 0.93/0.05 0.95/0.08
Soft
Decision Tree 0.95/0.03 0.92/0.03 0.94/0.06
k-NN 0.94/0.11 0.91/0.05 0.94/0.05
yandex aug 31, 2012
78. quality filtering — machine learning results
Category P,R VG size
Beauty & Style 0.53,0.08 0.16 0.08
Business & Finance 0.57,0.20 0.20 0.03
Cars & Transportation 0.64,0.12 0.23 0.03
Computers & Internet 0.69,0.33 0.45 0.15
Consumer Electronics 0.70,0.23 0.38 0.06
Entertainment & Music 0.60,0.39 0.15 0.05
Family & Relationships 0.35,0.05 0.06 0.14
Games & Recreation 0.61,0.31 0.24 0.04
Health 0.62,0.07 0.15 0.09
Home & Garden 0.43,0.06 0.27 0.04
Society & Culture 0.50,0.19 0.09 0.03
Sports 0.68,0.24 0.19 0.03
Yahoo! Products 0.73,0.43 0.45 0.07
yandex aug 31, 2012
79. detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptop
P: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raido
P: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boys
P: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phone
P: 61-75%, cover: 0.08%
yandex aug 31, 2012
80. detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptop
P: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raido
P: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boys
P: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phone
P: 61-75%, cover: 0.08%
yandex aug 31, 2012
81. detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptop
P: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raido
P: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boys
P: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phone
P: 61-75%, cover: 0.08%
yandex aug 31, 2012
82. detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptop
P: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raido
P: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boys
P: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phone
P: 61-75%, cover: 0.08%
yandex aug 31, 2012
83. detecting “how to” queries
how many? 2-3% of volume, 3-4% of distinct queries
start with “how to” “how do i” or “how can i”
how do you fix keys on a laptop
P: 96-99%, cover: 1.0%
queries start with an action verb
play my music on tool bar raido
P: 7-14%, cover: 3.2%
if exists “how to X” then “X”
craft ideas for boys
P: 87-94%, cover: 1.1%
incoming queries to “how to” web sites
fixing a wet cell phone
P: 61-75%, cover: 0.08%
yandex aug 31, 2012
84. matching queries to tips
precision–recall trade-off
index only the “goal” or also “action”
use AND or OR mode for query
require minimum “span” for the goal
ranking
rank by number of query tokens in goal, then tf·idf
yandex aug 31, 2012
85. matching queries to tips — evaluation
mode min span vol. dist. P@1 median
AND .50 8.7% 2.7% .428/.680 1
AND .66 6.8% 1.8% .557/.770 1
AND 1.0 4.4% 0.8% .625/.835 1
OR .50 87.4% 88.4% .048/.110 18
OR .66 36.8% 36.3% .092/.200 2
OR 1.0 13.5% 10.3% .160/.300 1
yandex aug 31, 2012
86. future work
mine tips from other recourses
twitter
wikitravel
improve quality of existing system
incorporating more features
improving rule extraction
classification
yandex aug 31, 2012
88. the information dissemination spectrum
news sites
content-provider sites web search
editorially curated url, images, music,
users browse ...
no specific info need clear intent
social media (twitter, facebook)
recommendations
(content- or context- or geo-aware)
user-generated content
(blogs, images, q/a)
yandex aug 31, 2012
89. the information dissemination spectrum
news sites
content-provider sites web search
editorially curated url, images, music,
users browse ...
no specific info need clear intent
social media (twitter, facebook)
recommendations
(content- or context- or geo-aware)
user-generated content
(blogs, images, q/a)
yandex aug 31, 2012
90. the information dissemination spectrum
news sites
content-provider sites web search
editorially curated url, images, music,
users browse ...
no specific info need clear intent
social media (twitter, facebook)
recommendations
(content- or context- or geo-aware)
user-generated content
(blogs, images, q/a)
yandex aug 31, 2012
93. social media and user-generated content
paradigm shift from a broadcast one-to-many mechanism
to a many-to-many model
users at the role of information producers
yandex aug 31, 2012
94. benefits and opportunities
wealth of information of extreme volume and diversity
wisdom of crowd phenomena
accurate profiling and personalization
(toolbar, search, clicks)
content- and context- information available
social and geo information available
yandex aug 31, 2012
95. challenges
heterogeneous sources
high variability in quality
needle-in-the-haystack problems
we want to:
support users to seek, filter, and disseminate information
build efficient platforms that support social-media
functionalities
yandex aug 31, 2012
96. challenges
heterogeneous sources
high variability in quality
needle-in-the-haystack problems
we want to:
support users to seek, filter, and disseminate information
build efficient platforms that support social-media
functionalities
yandex aug 31, 2012
98. overview
a news recommendation system based on real-time web,
e.g., twitter
suggest news articles to twitter users
infer user preferences from twitter activity
yandex aug 31, 2012
102. sources characteristics
news stream
+ high coverage
− sparse and noisy data for user profiling
− latency on collecting user feedback
twitter stream
+ much more accurate personalization
+ news spread very fast
yandex aug 31, 2012
103. otivation
1.2 1.4
news
$+*:#,(Q"1%$8:<"*%+>%+''8**"$'"0
$+*:#,(Q"1%$8:<"*%+>%+''8**"$'"0
twitter 1.2
1 clicks
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
-0.2 -0.2
M
M
M
M
M
M
M
M
M
M
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
10000
1
2
2
2
2
2
2
3
3
3
h2
h0
h0
h0
h1
h1
h2
h0
h0
h0
0
0
4
8
2
6
0
0
4
8
9:;<;'=-1'>;?$1%9*"$10
yandex aug 31, 2012
104. ke into account recency: new Motivat
pularity45counts of older enti- 1.2
e popularity counts using an
News-click delay
$+*:#,(Q"1%$8:<"*%+>%+''8**"$'"0
":5% 40 1
ails in Section 5.3.1. However,
-% 35
0.8
$8:<"*%+>%+''8**"$'"0
30
dent of 25 recommendation
+405 our 0.6
0.4
n be used.20
15 0.2
for recommending news arti-
10 0
r combination of the scoring
5 -0.2
05
investigate the effect of100non-
0
1 10 1000 10000
Minutes
R"?0V',('-%1",#E%1(09*(<89(+$
yandex aug 31, 2012
106. challenges
scale to large volumes of news and tweets
high dynamicity of news and tweets
news have short life-cycle
twitter users use jargon language
find the right degree of personalization
cope with inactive twitter users
yandex aug 31, 2012
108. 9:;<;'=-1'>;?$1%9*"$10 @ABC-1'!AD1;?A
T.rex architecture
"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1
Method
T.Rex
Followee User
User
tweets tweets Model
Π " Personalized
ranked list of
"% Followee
news articles
!
1/5 tweets
twitter
#
tweets
Followee
I- tweets news
articles
R ECE
C LIC
E% S OCI
T.Rex C ON
$%
!"#$%%<8(,10%80"*%)*+=,"0%>*+:%9?(99"*5 P OPU
yandex aug 31, 2012
109. recommendation model
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
social model
Σ(i, j) social relevance of
news j to user i
content model
Γ(i, j) content relevance
of news j to user i
popularity model
Π(j) popularity model of
news article j
yandex aug 31, 2012
110. recommendation model
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
social model
Σ(i, j) social relevance of
news j to user i
content model
Γ(i, j) content relevance
of news j to user i
popularity model
Π(j) popularity model of
news article j
yandex aug 31, 2012
111. recommendation model
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
social model
Σ(i, j) social relevance of
news j to user i
content model
Γ(i, j) content relevance
of news j to user i
popularity model
Π(j) popularity model of
news article j
yandex aug 31, 2012
112. recommendation model
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
social model
Σ(i, j) social relevance of
news j to user i
content model
Γ(i, j) content relevance
of news j to user i
popularity model
Π(j) popularity model of
news article j
yandex aug 31, 2012
113. Personalized News Recommendation
popularity update rule
orales Aristides Gionis Claudio Lucche
gionis@yahoo-inc.om claudio.lucchese@isti.c
take into account recency: new Motivation
popularity45counts of older enti- 1.2 1.4
e the popularity counts using an
News-click delay news news
$+*:#,(Q"1%$8:<"*%+>%+''8**"$'"0
$+*:#,(Q"1%$8:<"*%+>%+''8**"$'"0
twitter twitter
%0E09":5% 40 1 clicks
1.2
clicks
details in Section 5.3.1. However,
V*#$-% 35
0.8
1
$8:<"*%+>%+''8**"$'"0
5
,('-%,+405
30
pendent of 25 recommendation
our 0.6 news become stale after two 0.8
0.6
n can be used.
0.4
20
15 0.2
days 0.4
on for recommending news arti-
0.2
10 0
near combination of the scoring
5 -0.2
track mentions in news and 0
-0.2
#*%,+405
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
to investigate the effect of100non-
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
ay
a
0
tweets with exponential
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-2
-2
-2
-2
-2
1 10 1000 10000
1
2
2
2
2
2
2
3
3
3
2
2
3
3
4
h2
h0
h0
h0
h1
h1
h2
h0
h0
h0
h0
h1
h0
h1
h0
Minutes
0
0
4
8
2
6
0
0
4
8
0
2
0
2
0
R"?0V',('-%1",#E%1(09*(<89(+$ 9:;<;'=-1'>;?$1%9*"$10 @ABC-1'!AD1;?A'9*"$10
#'E% decay
$1%
g Rτ (u, n)). Given the components
',"05 Why Twitter?%%P(:",($"00%#$1%)"*0+$#,(Q#9(+$5%R"?0%<"'+:"%09#,"%3"*E%>#09%#$1%0)*"#1%>#09"*%+$%9?(99"*5%P?(99"*%(0%#%4++1%)*"1
news N and a stream of tweets T
mmendation score of a news article
as τ
Method
Z = λZτ −1 + wT HT + wN HN
Model R
· Γτ (u, n) + γ · Πτ (n), T.Rex Alg
Followee User
tweets tweets
User R EC
Model C LI
e relative weight of the components.
del Γ Popularity Model Π " Personalized S OC
ranked list of
0%9@"%'+$9"$9% 6'('7'*'8%?@"*"'6,/0%(0%9@"% Followee
news articles
C ON
r system produces a set of news
*%80"*%2-5 )+)8,#*(9E%+>%$"?0%#*9(',"%1/5 tweets
! P OP
T.R
andidate yandex e.g., the most re-
news, twitter
# aug 31, 2012 T.R
114. model learning and evaluation
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
Yahoo! toolbar data
the recommendation model should rank high
news articles that users click
learn the model using SVM
use clicks and twitter profiles of 3K users
to train and test the system
yandex aug 31, 2012
115. systems evaluated
T.rex: basic model using only user profiles
Rτ (u, n) = α · Στ (u, n) + β · Γτ (u, n) + γ · Πτ (n)
T.rex+: additional features
entity hotness
news click count
news article age
yandex aug 31, 2012
116. 0%#%4++1%)*"1('9+*%+>%($9"*"095 $(3.!4)/!5.(/!&!2&!&#-(τ6
results
Results
Table 5.2: MRR, precision and coverage.
Algorithm MRR P@1 P@5 P@10 Coverage
R ECENCY 0.020 0.002 0.018 0.036 1.000
C LICK C OUNT 0.059 0.024 0.086 0.135 1.000
S OCIAL 0.017 0.002 0.018 0.036 0.606
C ONTENT 0.107 0.029 0.171 0.286 0.158
P OPULARITY 0.008 0.003 0.005 0.012 1.000
T.R EX 0.107 0.073 0.130 0.168 1.000
T.R EX+ 0.109 0.062 0.146 0.189 1.000
!"#$%&"'()*+'#,%&#$-.%/*"'(0(+$%#$1%2+3"*#4"5
R ECENCY: it ranks news articles by time of publication (most recent first);
C LICK C OUNT: it ranks news articles by click count (highest count first);
S OCIAL:14 ranks news articles by using T.R EX with β = γ = 0;
it
yandex T.Rex+ aug 31, 2012
117. results :
R ECENCY it ranks news articles by time of publication (most recent first)
C LICK C OUNT: it ranks news articles by click count (highest count first);
S OCIAL:14 ranks news articles by using T.R EX with β = γ = 0;
it
T.Rex+
C ONTENT: it ranks news articles by using T.R EX with α = γ = 0;
T.Rex
12 Popularity
P OPULARITY: it ranks news articles by using T.R EX with α = β = 0.
Content
Social
10 Recency
5.6.5 Results Click count
Average DCG
8
We report MRR, precision and coverage results in Table 5.6.3. The two
variants of our system, T.R EX and T.R EX+, have the best results overall.
6
T.R EX+ has the highest MRR of all the alternatives. This result means
4
that our model has a good overall performance across the dataset. C ON -
TENT has 2also a very high MRR. Unfortunately, the coverage level achieve
by the C ONTENT strategy is very low. This issue is mainly caused by the
0
sparsity of 1 2 user4 profiles. It is well know 14 15 most 18 19 20 users
the 3 5 6 7 8 9 10 11 12 13 that 16 17 of twitter
belong to the “silent majority,” andRanknot tweet very much.
do
The S OCIAL strategy is affected by the same problem, albeit to a much
63"*#4"%7(0'+8$9"1%28:8,#9(3"%;#($5
yandex aug 31, 2012
118. conclusions
real-time web information can be leveraged to deliver
relevant information
future directions
LSI analysis on entities
models for different user clusters
georgaphic information
yandex aug 31, 2012
119. conclusions
real-time web information can be leveraged to deliver
relevant information
future directions
LSI analysis on entities
models for different user clusters
georgaphic information
yandex aug 31, 2012
120. summary
review concepts on query-log mining
answering directly queries with useful tips
challenges and opportunities in information dissemination
news recommendations using real-time web
many nice problems and research opportunities
yandex aug 31, 2012