SlideShare a Scribd company logo
1 of 54
Beyond document retrieval using 
semantic annotations 
Roi Blanco (roi@yahoo-inc.com) 
http://labs.yahoo.com/Yahoo_Labs_Barcelona
Yahoo! Research Barcelona 
• Established January, 2006 
• Led by Ricardo Baeza-Yates 
• Research areas 
• Web Mining 
• Social Media 
• Distributed Web retrieval 
• Semantic Search 
• NLP and Semantics
Contributions 
Hugo 
Zaragoza 
• “Every time I fire a linguist my performance goes up…” (Fred Jelinek) 
Great strategy until you’ve fired them all… but what then? 
Michael 
Matthews 
Jordi 
Atserias 
Roi 
Blanco 
Sebastiano Vigna (U. Milan) 
Paolo Boldi 
- Indexing (MG4J) 
Peter 
Mika
Agenda 
• Search: this was then, this is now 
• Natural Language processing and search 
• Semantic Search 
• Search over annotated documents 
• Time Explorer
Natural Language Retrieval 
• How to exploit the structure and meaning of 
natural language text to improve search 
• Current search engines perform only limited NLP 
(tokenization, stemming) 
• Automated tools exist for deeper analysis 
• Applications to diversity-aware search 
• Source, Location, Time, Language, Opinion, 
Ranking… 
• Search over semi-structured data, semantic 
search 
• Roll-out user experiences that use higher layers 
of the NLP stack
WEB SEARCH
Structured data - Web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML
New devices 
• Different interaction (e.g. voice) 
• Different capabilities (e.g. display) 
• More Information (geo-localization) 
• More personalized
Yahoo! Axis 
Smarter, Faster Search 
instant answers 
visual previews 
infinite browsing 
Connected Experience: 
across devices, iPhone, iPad, 
Firefox, Safari, Internet Explorer, 
Chrome. 
Pesonalized Home Page 
Signing:Yahoo!, Google, Facebook, 
direct access to favorite sites, 
saved articles and bookmarks.
SEMANTIC SEARCH
Semantic Search 
• What different kinds of search and 
applications beyond string matching or 
returning 10 blue links? 
• Can we have a better understanding of 
documents and queries? 
• New devices open new possibilities, new 
experiences 
• Is current technology in natural language 
understanding mature enough?
Semantic Search (II) 
• Matching the user’s query with the Web’s content at a 
conceptual level, often with the help of world knowledge 
– Natural Language Search 
• Exploiting the (implicit) structure and semantics of natural language 
• Intersection of IR and NLP 
– Semantic Web Search 
• Exploiting the (explicit) meaning of data 
• Intersection of IR and Semantic Web 
• As a field 
– ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM 
– Exploring Semantic Annotations in Information Retrieval 
(ECIR08, WSDM09) 
– Semantic Search Workshop (ESWC08, WWW09, WWW10) 
– Future of Web Search: Semantic Search (FoWS09)
State of search 
• “We are at the beginning of search.“ (Marissa Mayer) 
• Old battles are won 
– Marginal returns on investments in crawling, indexing, 
ranking 
– Solved large classes of queries (e.g. navigational) 
– Lots of tempting, but high hanging fruits 
• Currently, the biggest bottlenecks in IR not 
computational, but in modeling user cognition 
– If only we could find a computationally expensive way to 
solve the problem… 
• In particular, solving queries that require a deeper understanding of the 
query, the content and/or the world at large 
– Corollary : go beyond string matching!
Some examples… 
• Ambiguous searches 
– paris hilton 
• Multimedia search 
– paris hilton sexy 
• Imprecise or overly precise searches 
– jim hendler 
– pictures of strong adventures people 
• Searches for descriptions 
– 33 year old computer scientist living in barcelona 
– reliable digital camera under 300 dollars 
• Searches that require aggregation 
– height eiffel tower 
– harry potter movie review 
– world temperature 2020
Is NLU that complex? 
”A child of five would understand this. 
Send someone to fetch a child of five”. 
Groucho Marx
Language is Ambiguous 
The man saw the girl with the telescope
Paraphrases 
• ‘This parrot is dead’ 
• ‘This parrot has kicked the bucket’ 
• ‘This parrot has passed away’ 
• ‘This parrot is no more' 
• 'He's expired and gone to meet his maker,’ 
• 'His metabolic processes are now history’
Not just search…
Semantics at every step of the IR process 
bla bla bla? 
bla 
bla bla 
The IR engine The Web 
Query interpretation 
q=“bla” * 3 
Document processing bla 
bla bla 
bla 
bla 
bla 
Indexing 
Ranking 
θ(q,d) “bla” 
Result presentation
Understanding Queries 
• Query logs are a big source of information & 
knowledge 
To rank better the results (what you click) 
To understand queries better 
Paris Paris Flights 
Paris Paris Hilton
“Understand” Documents 
NLU Still 
an open 
issue
NLP for IR 
• Full NLU is AI complete, not scalable to the web 
size (parsing the web is really hard). 
• BUT … what about other shallow NLP techniques? 
• Hypothesis/Requirements: 
• Linear extraction/parsing time 
• Error-prone output (e.g. 60-90%) 
• Highly redundant information 
• Explore new ways of browsing 
• Support your answers
Usability 
We also fail at using the technology 
Sometimes
Support your answers 
Errors happen: choose the right ones! 
• Humans need to “verify” unknown facts 
• Multiple sources of evidence 
• Common sense vs. Contradictions 
• are you sure? is this spam? Interesting! 
• Tolerance to errors greatly increases if users can 
verify things fast 
• Importance of snippets, image search 
• Often the context is as important as the fact 
• E.g. “S discovered the transistor in X” 
• There are different kinds of errors 
• Ridiculous result (decreases overall confidence in system) 
• Reasonably wrong result (makes us feel good)
SEARCH OVER ANNOTATED 
DOCUMENTS
Annotated documents 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday 
20 May 2009 
28 May 2009 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday
How does it work? 
Monty 
Python Inverted Index 
(sentence/doc level) 
Forward Index 
(entity level) 
Flying Circus 
John Cleese 
Brian
Efficient element retrieval 
• Goal 
– Given an ad-hoc query, return a list of documents and 
annotations ranked according to their relevance to the query 
• Simple Solution 
– For each document that matches the query, retrieve its 
annotations and return the ones with the highest counts 
• Problems 
– If there are many documents in the result set this will take too 
long - too many disk seeks, too much data to search through 
– What if counting isn’t the best method for ranking elements? 
• Solution 
– Special compressed data structures designed specifically for 
annotation retrieval
Forward Index 
• Access metadata and document contents 
– Length, terms, annotations 
• Compressed (in memory) forward indexes 
– Gamma, Delta, Nibble, Zeta codes (power laws) 
• Retrieving and scoring annotations 
– Sort terms by frequency 
• Random access using an extra compressed 
pointer list (Elias-Fano)
Parallel Indexes 
• Standard index contains only tokens 
• Parallel indices contain annotations on the tokens – the 
annotation indices must be aligned with main token index 
• For example: given the sentence “New York has great 
pizza” where New York has been annotated as a LOCATION 
– Token index has five entries 
(“new”, “york”, “has”, “great”, “pizza”) 
– The annotation index has five entries 
(“LOC”, “LOC”, “O”,”O”,”O”) 
Can optionally encode BIO format (e.g. LOC-B, LOC-I) 
• To search for the New York location entity, we search for: 
“token:New ^ entity:LOC token:York ^ entity:LOC”
Parallel Indices (II) 
Doc #3: The last time Peter exercised was in the XXth century. 
Doc #5: Hope claims that in 1994 she run to Peter Town. 
Peter  D3:1, D5:9 
Town  D5:10 
Hope  D5:1 
1994  D5:5 
… 
Possible Queries: 
“Peter AND run” 
“Peter AND WNS:N_DATE” 
“(WSJ:CITY ^ *) AND run” 
“(WSJ:PERSON ^ Hope) AND run” 
WSJ:PERSON  D3:1, D5:1 
WSJ:CITY  D5:9 
WNS:V_DATE  D5:5 
(Bracketing can also be dealt with)
Pipelined Architecture
TIME EXPLORER
Time(ly) opportunities 
Can we create new user experiences based on a deeper analysis and 
exploration of the time dimension? 
• Goals: 
– Build an application that helps users to explore, 
interact and ultimately understand existing 
information about the past and the future. 
– Help the user cope with the information overload 
and eventually find/learn about what she’s looking 
for.
Original Idea 
• R. Baeza-Yates, Searching the Future, MF/IR 2005 
– On December 1st 2003, on Google News, there were more than 100K 
references to 2004 and beyond. 
– E.g. 2034: 
• The ownership of Dolphin Square in London must revert to an insurance company. 
• Voyager 2 should run out of fuel. 
• Long-term care facilities may have to house 2.1 million people in the USA. 
• A human base in the moon would be in operation.
Time Explorer 
• Public demo since August 2010 
• Winner of HCIR NYT Challenge 
• Goal: explore news through time and into 
the future 
• Using a customized Web crawl from news 
and blog feeds 
• http://fbmya01.barcelonamedia.org:8080/future/
Time Explorer
Time Explorer - Motivation 
 Time is important to search 
 Recency, particularly in news is highly related 
to relevancy 
 But, what about evolution over time? 
 How has a topic evolved over time? 
 How did the entities (people, place, etc) evolve with respect to the topic over 
time? 
 How will this topic continue to evolve over the future? 
 How does bias and sentiment in blogs and news change over time? 
 Google Trends, Yahoo! Clues, RecordedFuture 
… 
 Great research playground 
 Open source!
Time Explorer
Analysis Pipeline 
 Tokenization, Sentence Splitting, Part-of-speech 
tagging, chunking with OpenNLP 
 Entity extraction with SuperSense tagger 
 Time expressions extracted with TimeML 
 Explicit dates (August 23rd, 2008) 
 Relative dates (Next year, resolved with Pub Date) 
 Sentiment Analysis with LivingKnowledge 
 Ontology matching with Yago 
 Image Analysis – sentiment and face detection
Indexing/Search 
• Lucene/Solr search platform to index and search 
– Sentence level 
– Document level 
• Facets for annotations (multiple fields for faster 
entity-type access) 
• Index publication date and content date –extracted 
dates if they exists or publication date 
• Solr Faceting allows aggregation over query entity 
ranking and for aggregating counts over time 
• Content date enables search into the future
Timeline
Timeline - Document
Facets
Timeline – Facet Trend
Timeline – Future
Opinions
Quotes
Other challenges 
– Large scale processing 
• Distributed computing 
• Shift from batch (Hadoop) to online (S4) 
– Efficient extraction/retrieval, algorithmic/data 
structures 
• Critical for interactive exploration 
– Connection with the user experience 
• Measures! User engagement? 
– Personalization 
– Integration with Knowledge Bases (Semantic Web) 
– Multilingual

More Related Content

What's hot

Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialPeter Mika
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at YahooPeter Mika
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1 Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1 Richard Urban
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsArmin Haller
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchDavid Amerland
 
Semantic Web applications for mobility and social interaction
Semantic Web applications for mobility and social interactionSemantic Web applications for mobility and social interaction
Semantic Web applications for mobility and social interactionAna Roxin
 
Linkator: enriching web pages by automatically adding dereferenceable semanti...
Linkator: enriching web pages by automatically adding dereferenceable semanti...Linkator: enriching web pages by automatically adding dereferenceable semanti...
Linkator: enriching web pages by automatically adding dereferenceable semanti...Samur Araujo
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessOntotext
 

What's hot (18)

Making things findable
Making things findableMaking things findable
Making things findable
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Semantic Search at Yahoo
Semantic Search at YahooSemantic Search at Yahoo
Semantic Search at Yahoo
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1 Publishing and Using Linked Open Data - Day 1
Publishing and Using Linked Open Data - Day 1
 
Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
An Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic SearchAn Introduction to Entities in Semantic Search
An Introduction to Entities in Semantic Search
 
Semantic Web applications for mobility and social interaction
Semantic Web applications for mobility and social interactionSemantic Web applications for mobility and social interaction
Semantic Web applications for mobility and social interaction
 
Linkator: enriching web pages by automatically adding dereferenceable semanti...
Linkator: enriching web pages by automatically adding dereferenceable semanti...Linkator: enriching web pages by automatically adding dereferenceable semanti...
Linkator: enriching web pages by automatically adding dereferenceable semanti...
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 

Viewers also liked

Samsung Monitor
Samsung  MonitorSamsung  Monitor
Samsung MonitorPaul T
 
What is effective technology integration for 21st century
What is effective technology integration for 21st centuryWhat is effective technology integration for 21st century
What is effective technology integration for 21st centurycaptainpiller
 
Intense imbroglios experiences in cameroon
Intense imbroglios experiences  in cameroonIntense imbroglios experiences  in cameroon
Intense imbroglios experiences in cameroonVerina Ingram
 
Help, my security officer doesn’t trust me v0.4
Help, my security officer doesn’t trust me v0.4Help, my security officer doesn’t trust me v0.4
Help, my security officer doesn’t trust me v0.4Frank Breedijk
 
Earth Science Chapter 1
Earth Science Chapter 1Earth Science Chapter 1
Earth Science Chapter 1mshenry
 
Powerpoint guide
Powerpoint guidePowerpoint guide
Powerpoint guideshare2010
 
D:\Ben\G48 53011810075
D:\Ben\G48 53011810075D:\Ben\G48 53011810075
D:\Ben\G48 53011810075BenjamasS
 
AABM - September 2012 Newsletter
AABM - September 2012 NewsletterAABM - September 2012 Newsletter
AABM - September 2012 NewsletterFelix Ortiz
 
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD Index
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD IndexDiabetes For Dummies, 3rd Edition by Alan L. Rubin, MD Index
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD IndexAlanLRubinMD
 
Physical Science Ch2: sec1
Physical Science Ch2: sec1Physical Science Ch2: sec1
Physical Science Ch2: sec1mshenry
 
Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2mshenry
 
Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3mshenry
 
Delivery Of Future Content
Delivery Of Future ContentDelivery Of Future Content
Delivery Of Future ContentPeter Lancaster
 
Back to school
Back to schoolBack to school
Back to schoolMatt
 

Viewers also liked (20)

Samsung Monitor
Samsung  MonitorSamsung  Monitor
Samsung Monitor
 
What is effective technology integration for 21st century
What is effective technology integration for 21st centuryWhat is effective technology integration for 21st century
What is effective technology integration for 21st century
 
Intense imbroglios experiences in cameroon
Intense imbroglios experiences  in cameroonIntense imbroglios experiences  in cameroon
Intense imbroglios experiences in cameroon
 
Help, my security officer doesn’t trust me v0.4
Help, my security officer doesn’t trust me v0.4Help, my security officer doesn’t trust me v0.4
Help, my security officer doesn’t trust me v0.4
 
Windows 8
Windows 8Windows 8
Windows 8
 
Earth Science Chapter 1
Earth Science Chapter 1Earth Science Chapter 1
Earth Science Chapter 1
 
Powerpoint guide
Powerpoint guidePowerpoint guide
Powerpoint guide
 
D:\Ben\G48 53011810075
D:\Ben\G48 53011810075D:\Ben\G48 53011810075
D:\Ben\G48 53011810075
 
AABM - September 2012 Newsletter
AABM - September 2012 NewsletterAABM - September 2012 Newsletter
AABM - September 2012 Newsletter
 
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD Index
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD IndexDiabetes For Dummies, 3rd Edition by Alan L. Rubin, MD Index
Diabetes For Dummies, 3rd Edition by Alan L. Rubin, MD Index
 
Physical Science Ch2: sec1
Physical Science Ch2: sec1Physical Science Ch2: sec1
Physical Science Ch2: sec1
 
Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2Physical Science: Chapter 4, sec 2
Physical Science: Chapter 4, sec 2
 
Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3Physical Science: Chapter 3 sec3
Physical Science: Chapter 3 sec3
 
Dentist appointment
Dentist appointmentDentist appointment
Dentist appointment
 
Redes Sociales
Redes SocialesRedes Sociales
Redes Sociales
 
Cloud webinar final
Cloud webinar finalCloud webinar final
Cloud webinar final
 
Delivery Of Future Content
Delivery Of Future ContentDelivery Of Future Content
Delivery Of Future Content
 
Jupiter1
Jupiter1Jupiter1
Jupiter1
 
Back to school
Back to schoolBack to school
Back to school
 
Gic2011 aula7-ingles-theory
Gic2011 aula7-ingles-theoryGic2011 aula7-ingles-theory
Gic2011 aula7-ingles-theory
 

Similar to Beyond document retrieval using semantic annotations

Introduction
IntroductionIntroduction
Introductionsriniefs
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Ian Milligan
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptxbuivantan_uneti
 
Education institute2012
Education institute2012Education institute2012
Education institute2012Stephen Abram
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Brave new search world
Brave new search worldBrave new search world
Brave new search worldvoginip
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchJaap Kamps
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesZOLLHOF - Tech Incubator
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabUniversity of Edinburgh
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagementSTIinnsbruck
 
Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesFariz Darari
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptxShowravDuttaAnkur
 

Similar to Beyond document retrieval using semantic annotations (20)

Introduction
IntroductionIntroduction
Introduction
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014Historical Research Breakout Session Notes, WIRE 2014
Historical Research Breakout Session Notes, WIRE 2014
 
Searching Online
Searching OnlineSearching Online
Searching Online
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
Education institute2012
Education institute2012Education institute2012
Education institute2012
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Brave new search world
Brave new search worldBrave new search world
Brave new search world
 
When Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes SearchWhen Search becomes Research and Research becomes Search
When Search becomes Research and Research becomes Search
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Ir1
Ir1Ir1
Ir1
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutes
 
Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Ted Talk
Ted TalkTed Talk
Ted Talk
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
 
Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and Challenges
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 

More from Roi Blanco

Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationRoi Blanco
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF GraphsRoi Blanco
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search EnginesRoi Blanco
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesRoi Blanco
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entitiesRoi Blanco
 

More from Roi Blanco (7)

Entity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance MinimizationEntity Linking via Graph-Distance Minimization
Entity Linking via Graph-Distance Minimization
 
Keyword Search over RDF Graphs
Keyword Search over RDF GraphsKeyword Search over RDF Graphs
Keyword Search over RDF Graphs
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesEnergy-Price-Driven Query Processing in Multi-center WebSearch Engines
Energy-Price-Driven Query Processing in Multi-center Web Search Engines
 
Effective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF data
 
Caching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental IndicesCaching Search Engine Results over Incremental Indices
Caching Search Engine Results over Incremental Indices
 
Finding support sentences for entities
Finding support sentences for entitiesFinding support sentences for entities
Finding support sentences for entities
 

Recently uploaded

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Recently uploaded (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Beyond document retrieval using semantic annotations

  • 1. Beyond document retrieval using semantic annotations Roi Blanco (roi@yahoo-inc.com) http://labs.yahoo.com/Yahoo_Labs_Barcelona
  • 2. Yahoo! Research Barcelona • Established January, 2006 • Led by Ricardo Baeza-Yates • Research areas • Web Mining • Social Media • Distributed Web retrieval • Semantic Search • NLP and Semantics
  • 3. Contributions Hugo Zaragoza • “Every time I fire a linguist my performance goes up…” (Fred Jelinek) Great strategy until you’ve fired them all… but what then? Michael Matthews Jordi Atserias Roi Blanco Sebastiano Vigna (U. Milan) Paolo Boldi - Indexing (MG4J) Peter Mika
  • 4. Agenda • Search: this was then, this is now • Natural Language processing and search • Semantic Search • Search over annotated documents • Time Explorer
  • 5. Natural Language Retrieval • How to exploit the structure and meaning of natural language text to improve search • Current search engines perform only limited NLP (tokenization, stemming) • Automated tools exist for deeper analysis • Applications to diversity-aware search • Source, Location, Time, Language, Opinion, Ranking… • Search over semi-structured data, semantic search • Roll-out user experiences that use higher layers of the NLP stack
  • 7.
  • 8.
  • 9.
  • 10. Structured data - Web search Top-1 entity with structured data Related entities Structured data extracted from HTML
  • 11. New devices • Different interaction (e.g. voice) • Different capabilities (e.g. display) • More Information (geo-localization) • More personalized
  • 12. Yahoo! Axis Smarter, Faster Search instant answers visual previews infinite browsing Connected Experience: across devices, iPhone, iPad, Firefox, Safari, Internet Explorer, Chrome. Pesonalized Home Page Signing:Yahoo!, Google, Facebook, direct access to favorite sites, saved articles and bookmarks.
  • 14. Semantic Search • What different kinds of search and applications beyond string matching or returning 10 blue links? • Can we have a better understanding of documents and queries? • New devices open new possibilities, new experiences • Is current technology in natural language understanding mature enough?
  • 15. Semantic Search (II) • Matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge – Natural Language Search • Exploiting the (implicit) structure and semantics of natural language • Intersection of IR and NLP – Semantic Web Search • Exploiting the (explicit) meaning of data • Intersection of IR and Semantic Web • As a field – ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM – Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09) – Semantic Search Workshop (ESWC08, WWW09, WWW10) – Future of Web Search: Semantic Search (FoWS09)
  • 16. State of search • “We are at the beginning of search.“ (Marissa Mayer) • Old battles are won – Marginal returns on investments in crawling, indexing, ranking – Solved large classes of queries (e.g. navigational) – Lots of tempting, but high hanging fruits • Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition – If only we could find a computationally expensive way to solve the problem… • In particular, solving queries that require a deeper understanding of the query, the content and/or the world at large – Corollary : go beyond string matching!
  • 17. Some examples… • Ambiguous searches – paris hilton • Multimedia search – paris hilton sexy • Imprecise or overly precise searches – jim hendler – pictures of strong adventures people • Searches for descriptions – 33 year old computer scientist living in barcelona – reliable digital camera under 300 dollars • Searches that require aggregation – height eiffel tower – harry potter movie review – world temperature 2020
  • 18. Is NLU that complex? ”A child of five would understand this. Send someone to fetch a child of five”. Groucho Marx
  • 19. Language is Ambiguous The man saw the girl with the telescope
  • 20. Paraphrases • ‘This parrot is dead’ • ‘This parrot has kicked the bucket’ • ‘This parrot has passed away’ • ‘This parrot is no more' • 'He's expired and gone to meet his maker,’ • 'His metabolic processes are now history’
  • 22. Semantics at every step of the IR process bla bla bla? bla bla bla The IR engine The Web Query interpretation q=“bla” * 3 Document processing bla bla bla bla bla bla Indexing Ranking θ(q,d) “bla” Result presentation
  • 23. Understanding Queries • Query logs are a big source of information & knowledge To rank better the results (what you click) To understand queries better Paris Paris Flights Paris Paris Hilton
  • 24. “Understand” Documents NLU Still an open issue
  • 25. NLP for IR • Full NLU is AI complete, not scalable to the web size (parsing the web is really hard). • BUT … what about other shallow NLP techniques? • Hypothesis/Requirements: • Linear extraction/parsing time • Error-prone output (e.g. 60-90%) • Highly redundant information • Explore new ways of browsing • Support your answers
  • 26. Usability We also fail at using the technology Sometimes
  • 27. Support your answers Errors happen: choose the right ones! • Humans need to “verify” unknown facts • Multiple sources of evidence • Common sense vs. Contradictions • are you sure? is this spam? Interesting! • Tolerance to errors greatly increases if users can verify things fast • Importance of snippets, image search • Often the context is as important as the fact • E.g. “S discovered the transistor in X” • There are different kinds of errors • Ridiculous result (decreases overall confidence in system) • Reasonably wrong result (makes us feel good)
  • 29. Annotated documents Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday 20 May 2009 28 May 2009 Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday
  • 30.
  • 31. How does it work? Monty Python Inverted Index (sentence/doc level) Forward Index (entity level) Flying Circus John Cleese Brian
  • 32. Efficient element retrieval • Goal – Given an ad-hoc query, return a list of documents and annotations ranked according to their relevance to the query • Simple Solution – For each document that matches the query, retrieve its annotations and return the ones with the highest counts • Problems – If there are many documents in the result set this will take too long - too many disk seeks, too much data to search through – What if counting isn’t the best method for ranking elements? • Solution – Special compressed data structures designed specifically for annotation retrieval
  • 33. Forward Index • Access metadata and document contents – Length, terms, annotations • Compressed (in memory) forward indexes – Gamma, Delta, Nibble, Zeta codes (power laws) • Retrieving and scoring annotations – Sort terms by frequency • Random access using an extra compressed pointer list (Elias-Fano)
  • 34. Parallel Indexes • Standard index contains only tokens • Parallel indices contain annotations on the tokens – the annotation indices must be aligned with main token index • For example: given the sentence “New York has great pizza” where New York has been annotated as a LOCATION – Token index has five entries (“new”, “york”, “has”, “great”, “pizza”) – The annotation index has five entries (“LOC”, “LOC”, “O”,”O”,”O”) Can optionally encode BIO format (e.g. LOC-B, LOC-I) • To search for the New York location entity, we search for: “token:New ^ entity:LOC token:York ^ entity:LOC”
  • 35. Parallel Indices (II) Doc #3: The last time Peter exercised was in the XXth century. Doc #5: Hope claims that in 1994 she run to Peter Town. Peter  D3:1, D5:9 Town  D5:10 Hope  D5:1 1994  D5:5 … Possible Queries: “Peter AND run” “Peter AND WNS:N_DATE” “(WSJ:CITY ^ *) AND run” “(WSJ:PERSON ^ Hope) AND run” WSJ:PERSON  D3:1, D5:1 WSJ:CITY  D5:9 WNS:V_DATE  D5:5 (Bracketing can also be dealt with)
  • 38. Time(ly) opportunities Can we create new user experiences based on a deeper analysis and exploration of the time dimension? • Goals: – Build an application that helps users to explore, interact and ultimately understand existing information about the past and the future. – Help the user cope with the information overload and eventually find/learn about what she’s looking for.
  • 39. Original Idea • R. Baeza-Yates, Searching the Future, MF/IR 2005 – On December 1st 2003, on Google News, there were more than 100K references to 2004 and beyond. – E.g. 2034: • The ownership of Dolphin Square in London must revert to an insurance company. • Voyager 2 should run out of fuel. • Long-term care facilities may have to house 2.1 million people in the USA. • A human base in the moon would be in operation.
  • 40. Time Explorer • Public demo since August 2010 • Winner of HCIR NYT Challenge • Goal: explore news through time and into the future • Using a customized Web crawl from news and blog feeds • http://fbmya01.barcelonamedia.org:8080/future/
  • 42. Time Explorer - Motivation  Time is important to search  Recency, particularly in news is highly related to relevancy  But, what about evolution over time?  How has a topic evolved over time?  How did the entities (people, place, etc) evolve with respect to the topic over time?  How will this topic continue to evolve over the future?  How does bias and sentiment in blogs and news change over time?  Google Trends, Yahoo! Clues, RecordedFuture …  Great research playground  Open source!
  • 44. Analysis Pipeline  Tokenization, Sentence Splitting, Part-of-speech tagging, chunking with OpenNLP  Entity extraction with SuperSense tagger  Time expressions extracted with TimeML  Explicit dates (August 23rd, 2008)  Relative dates (Next year, resolved with Pub Date)  Sentiment Analysis with LivingKnowledge  Ontology matching with Yago  Image Analysis – sentiment and face detection
  • 45. Indexing/Search • Lucene/Solr search platform to index and search – Sentence level – Document level • Facets for annotations (multiple fields for faster entity-type access) • Index publication date and content date –extracted dates if they exists or publication date • Solr Faceting allows aggregation over query entity ranking and for aggregating counts over time • Content date enables search into the future
  • 46.
  • 54. Other challenges – Large scale processing • Distributed computing • Shift from batch (Hadoop) to online (S4) – Efficient extraction/retrieval, algorithmic/data structures • Critical for interactive exploration – Connection with the user experience • Measures! User engagement? – Personalization – Integration with Knowledge Bases (Semantic Web) – Multilingual