Modern web search engines are making increasing use of signals other than mere textual statistics. While documents used to be matched to keyword queries based on term counting alone, modern information retrieval systems incorporate and learn from a large number of features pertaining to the query, user, documents, entities, sessions, etc. In particular, a document ranking generated by a web search engine involves combining signals from rich representations of users (including their location, browser, device, profile, history, etc.), semantics (ranging from simple spell-checking to recognizing entities), popularity, social networking, and more. All of these features need to be computed at an increasingly large scale and call for Big Data storage and analytics methods. In this talk I will give some examples of current IR research being done at the University of Amsterdam, leaning heavily on MapReduce and related programming paradigms.
See http://www.nlhug.org/events/56584462/.
2. Joint work with Amit Bronner, Hendrike Peetz,
Wouter Weerkamp, Anne Schuth, Maarten de Rijke
Large-scale Data Processing for IR 2
3. Big Information
data retrieval
lingual
mation
cess Machine Theory and
translation models
Evaluation
methodology
Text
mining Intelligent
Information retrieval
information
for information
services
access
Political
information
Storytelling
Human-
computer Knowledge
information representation
retrieval & reasoning
Information
Exploratory integration
Foundations
search Large-scale Data Processing for IR 3
of XML
4. Semantic
search
Real-time
analytics
Social
signal
analysis
Big Information
data retrieval
achine Theory and
nslation models
Evaluation
methodology
Intelligent
Information retrieval
information
for information
services
access
Large-scale Data Processing for IR 3
Political
information
5. s
Real-time
analytics
Synchronize
content
Big Inform
data retr
Multi-lingual
information
access Machine
translation
Text
mining Intelligent
Information retrieval
information
for information
services
access
Storytelling Large-scale Data Processing for IR 3
Human-
6. Intelligent
Information retrieval
information
for information
services
access
Political
information
Human-
computer Knowledge
information representation
retrieval & reasoning
Information
integration
Foundations
of XML
Multi-modal Open
summaries data
Large-scale Data Processing for IR 3
7. Text
mining Intelligent
Information retrieval
information
for information
services
access
Storytelling
Human-
computer Knowledg
information representat
retrieval & reasonin
Exploratory Foundations
search of XML
Multi-modal Op
summaries da
Large-scale Data Processing for IR 3
8. Me
¢ Information retrieval (~ search engines)
¢ Semantic search/annotations
¢ Use knowledge bases (Wikipedia, Freebase, etc.) as
£ primary information source for search or
£ as complement to traditional retrieval
Large-scale Data Processing for IR 4
10. Search engines – a bird’s eye view
¢ Main ingredient: Counting words
£ Query ~ distribution over words
£ Document ~ distribution over words
£ Ranking ~ comparing distributions
Large-scale Data Processing for IR 6
11. Search engines – a bird’s eye view
¢ Main ingredient: Counting words
£ Query ~ distribution over words
£ Document ~ distribution over words
£ Ranking ~ comparing distributions
Large-scale Data Processing for IR 6
12. Forecasters are watching
fore
cas tropical storms that could
t
pose hurricane threats to
hurricane fun the southern United States.
tropical One is a downgraded …
wind
weather home
Large-scale Data Processing for IR 7
13. Search engines – a bit of history
¢ Anno 1995
£ Counting words (only)...
£ Stopwords
£ Linguistic normalization
Large-scale Data Processing for IR 8
14. Search engines – a bit of history
¢ Anno 2000: 2nd generation
£ Link structure
˜ Anchor text
˜ PageRank
£ Document structure
˜ title, top/bottom, etc.
˜ boilerplate
£ Click-through data
Large-scale Data Processing for IR 9
15. Search engines – a bit of history
¢ Anno now
£ Real-time indexing/search
£ Increasingly personalized
£ Increasingly social
£ Apply “observations” of human behavior to
improve, to evaluate
˜ Search behavior, click behavior, dwell behavior, reading time,
…, other things that are happening in the world
£ Rich signals
Large-scale Data Processing for IR 10
16. Signals
¢ Users/Personalisation
£ group: country, region, language, device, browser, etc.
£ individual: profile, history, sessions, etc.
Why “learning to rank”?
¢ Linguistics (e.g., spell-checking)
¢ Semantics (e.g., entities)
¢ Popularity (e.g. PageRank)
¢ Social (e.g. G+)
And more...
1
¢
• More and more features are found to be useful for ranking
£ readability, relevance assessments, clicks, etc.
documents.
• How should we combine these?
1
http://www.flickr.com/photos/sameli/540933604/
Large-scale Data Processing for IR 11
KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
17. Applying signals
¢ Typically at query time...
£ Leaning heavily on machine learning
¢ Not the focus here... Why “learning to rank”?
1
• More and more features are found to be useful for ranking
documents.
• How should we combine these?
1
http://www.flickr.com/photos/sameli/540933604/
Large-scale Data Processing for IR 12
KH&MdR (U. Amsterdam) Advanced Information Retrieval MS
19. What generates (non-monetary) value?
¢ What is value?
£ Better/Richer UX
˜ Clever term/phrase suggestions
˜ Clever, rich snippets
£ Finding what you need faster/better/...
˜ Homing in on what you want to find
˜ Task/Problem solving
£ and more...
Large-scale Data Processing for IR 14
20. For instance...
good camera under
300 euro
Large-scale Data Processing for IR 15
27. So, where else do you get value from?
¢ Improving signals...
£ Richer/Better/More focused signals
˜ Richer data/better extraction/...
˜ "Google acquires Freebase"
¢ ... or the application thereof
£ Algorithmic innovations
£ Training data
˜ Logs (queries, clicks, ...) – from toolbars, redirects, etc.
˜ Relevance assessments – manual, professionals, mechanical turk, etc.
¢ "More intelligent systems"
Large-scale Data Processing for IR 22
28. Intelligence?
¢ Need analysis of (large quantities of) data
£ Typically, "transformations"
˜ graphs (PageRank, FriendRank)
˜ text => structure
˜ aggregations
˜ etc.
¢ Then, aggregate analyses to obtain "value"
£ count/sum/min/max/avg/etc.
¢ Hadoop!
Large-scale Data Processing for IR 23
29. Use-cases
Large-scale Data Processing for IR 24
30. Use-case 1: Search and analysis on tweets
¢ Even getting them is not quite trivial
¢ Example: TREC Microblog track
£ 16M tweets
˜ Published as ID
˜ Default HTML download option without metadata (geo data, original
tweet when retweeted, reply-to, etc.)
˜ JSON format has all the beautiful stuff
£ HTML crawling vs getting the JSON objects
˜ JSON download limited to 150 tweets per hour per IP address
™ On a single machine: more than 12 years
™ 884 nodes running for close to a week
Large-scale Data Processing for IR 25
31. And once you have millions of tweets…
¢ Text analytics on twitter streams
£ Information extraction, sentiment analysis, …
£ Given an entity (company, product, …), what is being said
about it?
Obama almost 15mins late...
wonder if he's watching college
hoops. Less than 2mins left in
Texas Oakland game #NCAA
#Marc ...
Large-scale Data Processing for IR 26
32. And once you have millions of tweets…
¢ Text analytics on twitter streams
£ Information extraction, sentiment analysis, …
£ Given an entity (company, product, …), what is being said
about it?
Which aspects?
Which attitudes?
£ Extract triples
X–R–Y
£ Dependency parsing
Large-scale Data Processing for IR 27
34. Some numbers
¢ Data
£ ~10% public English tweets in 2010
£ ~250M tweets
¢ Performance
£ Single machine (1 Dual core, 2.2GHz, 3GB ram)
˜ ~2 years
£ Sara Hadoop cluster (20 nodes x Dual core, 2.6GHz, 16GB ram)
˜ ~30 days
£ DAS4 Hadoop cluster (36 nodes x Dual quad-core, 2.4GHz,
24GB ram)
˜ ~1 day
Large-scale Data Processing for IR 29
35. Intermezzo: The-Web-as-a-corpus
¢ Web retrieval
£ TREC Web track – ClueWeb09
˜ 1,040,809,705 web pages, in 10 languages
˜ 25TB uncompressed
¢ Parse TBs of web data
£ SARA Hadoop
£ cloud9/Ivory(/Elasticsearch/SOLR/Lucene)
£ POS, DEP, entities
£ easy peasy
Large-scale Data Processing for IR 30
36. Using Bursts for Query Modeling
Use-case 2: Temporal patterns for IR
¢ Temporal relevance?
¢ Relevant documents
£ query: ‘grammys’
£ time (in days) along the x-axis
£ nr. of judged relevant
documents along the y-axis
¢ Value: detect “temporal”
queries
(a) Relevant documents
Table 1: Temporal Processing for IR
Large-scale Data distributions for the que
31
Figure 1a is the same as Figure 1?
37. 4d), with many more new home products
being sold, has a knot point at 10 hours
versus Anchorage’s 29 (4c).
Unique visitors: Unlike inter-version
means, there is no statistical difference in
Use-case 2: Temporal patterns for IR
where the knot point falls as a function of
unique visitors. This is consistent with the
fact that while popular pages change more
often, they change less whenplot do, and
¢ “Term lifespan” they
thus require the same amount of time to
“stabilize” as less popular pages. the x-axis
£ time (in days) along
URL Depth: Thealong the page is in the
£ terms deeper the y-axis
page £ every the further the knot an
hierarchy dot represents point,
potentially indicating that content on pages
deep within a site “decay” atthat day
occurrence on a slower rate.
Category: Perhaps unsurprisingly, their first
£ terms are ordered by
News
and Sports pages have an earlierwebpages
occurrence in the knot point
as content in these pages is likely to be
replacedon allrecipes.com
quickly. Industry/trade pages,
including corporate home pages, display a
much more gradual rate of content decay
before reaching the knot point.
4.3 Term-Level Change
The above analysis explores how page Figure 5. Term lifespanfor IR for several pages
Large-scale Data Processing plots 32
content changes across an entire Web replaced with the BestBuy homepage. Time (in
38. Or from Wikipedia access logs...
¢ 1 year = ~ 555GB of
raw Wikipedia logs
£ filter
£ aggregate
£ link
£ visualize
¢ Inherently parallelizable
Large-scale Data Processing for IR 33
39. Or from Wikipedia access logs...
31-2 30000
01012
nts-2
¢ 1 year = ~ 555GB of pagec
ou
3
57482
raw Wikipedia logs [... ]
ristm
68 76
as 11 rol 1 713 th%20Apax
a
en Ch stmas%20C oling%20W
i
1 602
ri ar 1
en Ch stmas%20C slip 1 59
£ filter en Chri
tmas%
20Cow d 1 630
n
Chris as%20Isla ture 1 72 ant%20Wal
0 l%20D
ecal
1 611
en
ristm itera %20Gi
£ aggregate en Ch stmas%20L e%20Quote 593
ri re
en Ch stmas%20T 20medium
1
1 596 98
en Chri s%2 0by% %20 wall 1 5
£ link r istma 20fantasy all%20art i 1 605
en Ch stmas%
Chri
l%20w
0viny s_Solis_I
nvict
en mas%2 i
hrist s%23Natal
£ visualize e
e
n C
n Chr
istma
[...]
¢ Inherently parallelizable
Large-scale Data Processing for IR 33
40. Use-case 3: Mining user edits on Wikipedia
¢ As a social signal …
¢ As a language resource …
£ Target: User edits, textual differences between revisions of
the same document
£ Objective: Distinguish between factual edits (alter the
meaning) and fluency edits (address style or readability)
£ Dataset: Full revision history of the English Wikipedia
Large-scale Data Processing for IR 34
41. The data
¢ Average of 3.5 to 4 million revisions per month
£ English Wikipedia, August 2006 to August 2011
£ Each revision may contain multiple edits (many are
irrelevant)
£ 342GB compressed
text (snapshot of
15/01/2011)
Large-scale Data Processing for IR 35
42. What to do?
¢ A lot of pre-processing
£ Filtering out irrelevant revisions
£ Parsing wiki markup
£ Words tokenization
£ Sentence splitting
£ Computing textual diff between revisions
£ Indexing user edits at sentence level and across sentence boundaries
£ Computing classification features per user edit
¢ And then
£ Execution:15 nodes, each processes a data stream
£ Average of 2-3 days per node
¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits
Large-scale Data Processing for IR 36
43. What to do?
¢ A lot of pre-processing
£ Filtering out irrelevant revisions
£ Parsing wiki markup
£ Words tokenization
£ Sentence splitting
£ Computing textual diff between revisions
£ Indexing user edits at sentence level and across sentence boundaries
£ Computing classification features per user edit
¢ And then
£ Execution:15 nodes, each processes a data stream
£ Average of 2-3 days per node
¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits
Large-scale Data Processing for IR 36
44. What’s next?
Large-scale Data Processing for IR 37
45. Real-time semantic analysis
¢ Example: reputation management
¢ Follow twitter stream
£ Am I being mentioned?
£ What are they saying about me?
£ Is this potentially damaging?
¢ Why a challenge
£ Ambiguity
£ Noise
£ “I need to know now!”
¢ Big data
Large-scale Data Processing for IR 38
46. Extreme personalisation
¢ “Zero click”, “zero query”
¢ Tell me what I should
know
£ Summarize a few million
documents
£ Show a semantically
meaningful result on my
screen
¢ Big data
Large-scale Data Processing for IR 39
47. Social search
¢ Socially improved search
£ General search, personalized search
£ Thousands of users of social networks actively share
content and attitudes and opinions and experiences
£ Use this to “push content”
£ Return results that you care about, with a broad “subjective
context”
¢ Big data
Large-scale Data Processing for IR 40
48. Thanks!
¢ Edgar Meij
£ http://edgar.meij.pro
£ edgar.meij@uva.nl
£ @edgarmeij
Large-scale Data Processing for IR 41