Large-scale Data Processing for Information Retrieval #nlhug

Large-scale Data Processing for
Information Retrieval
Edgar Meij
Informatics Institute

Joint work with Amit Bronner, Hendrike Peetz,
Wouter Weerkamp, Anne Schuth, Maarten de Rijke

Large-scale Data Processing for IR 2

Big Information
data retrieval
lingual
mation
cess Machine Theory and
translation models

Evaluation
methodology
Text
mining Intelligent
Information retrieval
information
for information
services
access
Political
information
Storytelling

Human-
computer Knowledge
information representation
retrieval & reasoning

Information
Exploratory integration
Foundations
search Large-scale Data Processing for IR 3
of XML

Semantic
search

Real-time
analytics
Social
signal
analysis

Big Information
data retrieval

achine Theory and
nslation models

Evaluation
methodology

Intelligent
information
for information
services
access
Political
information

s

Real-time
analytics
Synchronize
content

Big Inform
data retr
Multi-lingual
information
access Machine
translation

Text
mining Intelligent
information
for information
services
access

Storytelling Large-scale Data Processing for IR 3

Human-

Intelligent
information
for information
services
access
Political
information

Human-
computer Knowledge
information representation
retrieval & reasoning

Information
integration
Foundations
of XML

Multi-modal Open
summaries data


Text
mining Intelligent
information
for information
services
access

Storytelling

Human-
computer Knowledg
information representat
retrieval & reasonin

Exploratory Foundations
search of XML

Multi-modal Op
summaries da


Me

¢ Information retrieval (~ search engines)
¢ Semantic search/annotations
¢ Use knowledge bases (Wikipedia, Freebase, etc.) as
£ primary information source for search or
£ as complement to traditional retrieval


Search engines


Search engines – a bird’s eye view

¢ Main ingredient: Counting words
£ Query ~ distribution over words
£ Document ~ distribution over words
£ Ranking ~ comparing distributions


Forecasters are watching
fore
cas tropical storms that could
t
pose hurricane threats to
hurricane fun the southern United States.
tropical One is a downgraded …
wind
weather home


Search engines – a bit of history

¢ Anno 1995
£ Counting words (only)...
£ Stopwords
£ Linguistic normalization



¢ Anno 2000: 2nd generation
£ Link structure
˜ Anchor text
˜ PageRank
£ Document structure
˜ title, top/bottom, etc.
˜ boilerplate
£ Click-through data



¢ Anno now
£ Real-time indexing/search
£ Increasingly personalized
£ Increasingly social
£ Apply “observations” of human behavior to
improve, to evaluate
˜ Search behavior, click behavior, dwell behavior, reading time,
…, other things that are happening in the world
£ Rich signals


Signals

¢ Users/Personalisation
£ group: country, region, language, device, browser, etc.
£ individual: profile, history, sessions, etc.
Why “learning to rank”?
¢ Linguistics (e.g., spell-checking)
¢ Semantics (e.g., entities)
¢ Popularity (e.g. PageRank)
¢ Social (e.g. G+)
And more...
1
¢
• More and more features are found to be useful for ranking
£ readability, relevance assessments, clicks, etc.
documents.
• How should we combine these?
1
http://www.ﬂickr.com/photos/sameli/540933604/
KH&MdR (U. Amsterdam) Advanced Information Retrieval MS

Applying signals

¢ Typically at query time...
£ Leaning heavily on machine learning
¢ Not the focus here... Why “learning to rank”?

1

• More and more features are found to be useful for ranking
documents.
• How should we combine these?
1
http://www.ﬂickr.com/photos/sameli/540933604/
KH&MdR (U. Amsterdam) Advanced Information Retrieval MS

What generates (non-monetary) value?


What generates (non-monetary) value?

¢ What is value?
£ Better/Richer UX
˜ Clever term/phrase suggestions
˜ Clever, rich snippets
£ Finding what you need faster/better/...
˜ Homing in on what you want to find
˜ Task/Problem solving
£ and more...


For instance...

good camera under
300 euro


Or...


So, where else do you get value from?

¢ Improving signals...
£ Richer/Better/More focused signals
˜ Richer data/better extraction/...
˜ "Google acquires Freebase"

¢ ... or the application thereof
£ Algorithmic innovations
£ Training data
˜ Logs (queries, clicks, ...) – from toolbars, redirects, etc.
˜ Relevance assessments – manual, professionals, mechanical turk, etc.

¢ "More intelligent systems"


Intelligence?

¢ Need analysis of (large quantities of) data
£ Typically, "transformations"
˜ graphs (PageRank, FriendRank)
˜ text => structure
˜ aggregations
˜ etc.

¢ Then, aggregate analyses to obtain "value"
£ count/sum/min/max/avg/etc.
¢ Hadoop!


Use-cases


Use-case 1: Search and analysis on tweets

¢ Even getting them is not quite trivial
¢ Example: TREC Microblog track
£ 16M tweets
˜ Published as ID
˜ Default HTML download option without metadata (geo data, original
tweet when retweeted, reply-to, etc.)
˜ JSON format has all the beautiful stuff
£ HTML crawling vs getting the JSON objects
˜ JSON download limited to 150 tweets per hour per IP address
™ On a single machine: more than 12 years
™ 884 nodes running for close to a week


And once you have millions of tweets…

¢ Text analytics on twitter streams
£ Information extraction, sentiment analysis, …
£ Given an entity (company, product, …), what is being said
about it?

Obama almost 15mins late...
wonder if he's watching college
hoops. Less than 2mins left in
Texas Oakland game #NCAA
#Marc ...


And once you have millions of tweets…

¢ Text analytics on twitter streams
£ Information extraction, sentiment analysis, …
£ Given an entity (company, product, …), what is being said
about it?
Which aspects?
Which attitudes?
£ Extract triples
X–R–Y
£ Dependency parsing


Some numbers

¢ Data
£ ~10% public English tweets in 2010
£ ~250M tweets
¢ Performance
£ Single machine (1 Dual core, 2.2GHz, 3GB ram)
˜ ~2 years
£ Sara Hadoop cluster (20 nodes x Dual core, 2.6GHz, 16GB ram)
˜ ~30 days
£ DAS4 Hadoop cluster (36 nodes x Dual quad-core, 2.4GHz,
24GB ram)
˜ ~1 day


Intermezzo: The-Web-as-a-corpus

¢ Web retrieval
£ TREC Web track – ClueWeb09
˜ 1,040,809,705 web pages, in 10 languages
˜ 25TB uncompressed

¢ Parse TBs of web data
£ SARA Hadoop
£ cloud9/Ivory(/Elasticsearch/SOLR/Lucene)
£ POS, DEP, entities
£ easy peasy


Using Bursts for Query Modeling
Use-case 2: Temporal patterns for IR

¢ Temporal relevance?
¢ Relevant documents
£ query: ‘grammys’
£ time (in days) along the x-axis
£ nr. of judged relevant
documents along the y-axis
¢ Value: detect “temporal”
queries
(a) Relevant documents

Table 1: Temporal Processing for IR
Large-scale Data distributions for the que
31
Figure 1a is the same as Figure 1?

4d), with many more new home products
being sold, has a knot point at 10 hours
versus Anchorage’s 29 (4c).
Unique visitors: Unlike inter-version
means, there is no statistical difference in
Use-case 2: Temporal patterns for IR
where the knot point falls as a function of
unique visitors. This is consistent with the
fact that while popular pages change more
often, they change less whenplot do, and
¢ “Term lifespan” they
thus require the same amount of time to
“stabilize” as less popular pages. the x-axis
£ time (in days) along

URL Depth: Thealong the page is in the
£ terms deeper the y-axis
page £ every the further the knot an
hierarchy dot represents point,
potentially indicating that content on pages
deep within a site “decay” atthat day
occurrence on a slower rate.
Category: Perhaps unsurprisingly, their first
£ terms are ordered by
News
and Sports pages have an earlierwebpages
occurrence in the knot point
as content in these pages is likely to be
replacedon allrecipes.com
quickly. Industry/trade pages,
including corporate home pages, display a
much more gradual rate of content decay
before reaching the knot point.
4.3 Term-Level Change
The above analysis explores how page Figure 5. Term lifespanfor IR for several pages
Large-scale Data Processing plots 32
content changes across an entire Web replaced with the BestBuy homepage. Time (in

Or from Wikipedia access logs...

¢ 1 year = ~ 555GB of
raw Wikipedia logs
£ filter
£ aggregate
£ link
£ visualize
¢ Inherently parallelizable


Or from Wikipedia access logs...
31-2 30000
01012
nts-2
¢ 1 year = ~ 555GB of pagec
ou

3
57482
raw Wikipedia logs [... ]
ristm
68 76
as 11 rol 1 713 th%20Apax
a
en Ch stmas%20C oling%20W
i
1 602

ri ar 1
en Ch stmas%20C slip 1 59
£ filter en Chri
tmas%
20Cow d 1 630
n
Chris as%20Isla ture 1 72 ant%20Wal
0 l%20D
ecal
1 611
en
ristm itera %20Gi
£ aggregate en Ch stmas%20L e%20Quote 593
ri re
en Ch stmas%20T 20medium
1
1 596 98
en Chri s%2 0by% %20 wall 1 5
£ link r istma 20fantasy all%20art i 1 605
en Ch stmas%
Chri
l%20w
0viny s_Solis_I
nvict
en mas%2 i
hrist s%23Natal
£ visualize e
e
n C
n Chr
istma
[...]
¢ Inherently parallelizable


Use-case 3: Mining user edits on Wikipedia

¢ As a social signal …
¢ As a language resource …
£ Target: User edits, textual differences between revisions of
the same document
£ Objective: Distinguish between factual edits (alter the
meaning) and fluency edits (address style or readability)
£ Dataset: Full revision history of the English Wikipedia


The data

¢ Average of 3.5 to 4 million revisions per month
£ English Wikipedia, August 2006 to August 2011
£ Each revision may contain multiple edits (many are
irrelevant)
£ 342GB compressed
text (snapshot of
15/01/2011)


What to do?

¢ A lot of pre-processing
£ Filtering out irrelevant revisions
£ Parsing wiki markup
£ Words tokenization
£ Sentence splitting
£ Computing textual diff between revisions
£ Indexing user edits at sentence level and across sentence boundaries
£ Computing classification features per user edit
¢ And then
£ Execution:15 nodes, each processes a data stream
£ Average of 2-3 days per node
¢ Outcome: 6.3 million textual diff segments, 4.3 million user edits


What’s next?


Real-time semantic analysis

¢ Example: reputation management
¢ Follow twitter stream
£ Am I being mentioned?
£ What are they saying about me?
£ Is this potentially damaging?
¢ Why a challenge
£ Ambiguity
£ Noise
£ “I need to know now!”
¢ Big data

Extreme personalisation

¢ “Zero click”, “zero query”
¢ Tell me what I should
know
£ Summarize a few million
documents
£ Show a semantically
meaningful result on my
screen
¢ Big data


Social search

¢ Socially improved search
£ General search, personalized search
£ Thousands of users of social networks actively share
content and attitudes and opinions and experiences
£ Use this to “push content”
£ Return results that you care about, with a broad “subjective
context”
¢ Big data


Thanks!

¢ Edgar Meij
£ http://edgar.meij.pro
£ edgar.meij@uva.nl
£ @edgarmeij


Large-scale Data Processing for Information Retrieval #nlhug

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Large-scale Data Processing for Information Retrieval #nlhug