SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Lucene/SOLR Revolution 2013 1
From Text to Truth: Real World Facets for
Multilingual Search
Benson Margulies
Executive Vice President and Chief Technical Officer
Lucene/SOLR Revolution 2013 2
Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.
Motivation
Lucene/SOLR Revolution 2013 4
✗	
  
✗	
  
✗	
  
Lucene/SOLR Revolution 2013 10
✗	
  
✗	
  
✓	
  
✗	
  
✗	
  
Lucene/SOLR Revolution 2013 14
That was a lot of work.
Can text analytics help?
Help?
Lucene/SOLR Revolution 2013 15
✓	
  
✗	
  
✗	
  
Filter out pages with the wrong guy?
Filter?
Lucene/SOLR Revolution 2013 16
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 17
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Lucene/SOLR Revolution 2013 18
✓	
  
✗	
  
✗	
  
Add some filters (a/k/a facets)…
Filter?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 19
✓	
  
✗	
  
✗	
  
But what can we use as choices?
Filter?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 20
Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)
	
  	
  
Lucene/SOLR Revolution 2013 21
Group names referring to the same person, within a document.
In-document Coreference Resolution
Lucene/SOLR Revolution 2013 22
✓	
  
✗	
  
✗	
  
But what can we use as choices?
Filter choices?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 23
✓	
  
✗	
  
✗	
  
Choices: first way that each person was mentioned
in each document?
Filter choices?
Filter	
  results	
  by…	
  
Persons	
  named	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Lucene/SOLR Revolution 2013 24
✓	
  
✗	
  
Choices: first name string for each person in each
document?
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
   ✗	
  
Lucene/SOLR Revolution 2013 25
✓	
  
✗	
  
Choices: first name string for each person in each
document?
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 26
✓	
  
✗	
  
Problem: Ambiguity – one name, many entities
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 27
✓	
  
✗	
  
Problem: Variety – one person, many names
Filter?
Add	
  filters…	
  
Filtered	
  by…	
  
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 28
✓	
  
✗	
  
Problem: Variety – one person, many names
Filter?
Add	
  filters…	
  
Persons	
  named	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Chris	
  Stevens	
  
J.	
  Christopher	
  	
  
	
  	
  Stevens	
  
…	
  
Filtered	
  by…	
  
Persons	
  named	
  
Chris	
  Stephens	
  
Lucene/SOLR Revolution 2013 29
✓	
  
✗	
  
✗	
  
Magically group names by person across
documents.
Deal with ambiguity and variety?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
Lucene/SOLR Revolution 2013 30
✓	
  
✗	
  
✗	
  
But there’s still the problem of choices…
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
<choice	
  1>	
  
<choice	
  2>	
  
<choice	
  3>	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 31
✓	
  
✗	
  
✗	
  
Use person’s name from highest ranked doc?
Still some ambiguity.
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  1	
  	
  
Chris	
  Stephens	
  2	
  
…	
  
	
  	
  
Lucene/SOLR Revolution 2013 32
✓	
  
✗	
  
✗	
  
Entity Resolution: group and also link to a
database of known entities (e.g., Wikipedia).
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
Chris	
  Stephens	
  1	
  	
  
Chris	
  Stephens	
  2	
  
…	
  
	
  	
  
Kris	
  Stephens	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  	
  
…	
  
Lucene/SOLR Revolution 2013 33
✓	
  
✗	
  
✗	
  
Labels for choices?
Filter	
  results	
  by…	
  
People	
  
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Kris	
  Stephens	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  	
  
…	
  
	
  	
  
	
  	
  
Lucene/SOLR Revolution 2013 34
✓	
  
✗	
  
✗	
  
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter?
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
	
  
	
  	
  
	
  	
  
Lucene/SOLR Revolution 2013 35
✓	
  
✗	
  
✗	
  
Let’s give it a try…
Filter.
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
	
  
Lucene/SOLR Revolution 2013 36
✓	
  
✗	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
✗	
  
Lucene/SOLR Revolution 2013 37
✓	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Lucene/SOLR Revolution 2013 38
✓	
  
Let’s give it a try…
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Lucene/SOLR Revolution 2013 39
✓	
  
On a cross lingual index, real-world entity facets can
open results up across languages, unlike search
strings
Filter.
Add	
  filters…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
✓	
  
✓	
  
Language	
  
English	
  
Chinese	
  
Arabic	
  
Lucene/SOLR Revolution 2013 40
Let’s pretend you’re researching the pastors
instead.
Trading off Errors
Filter	
  results	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  	
  
Chris	
  Stephens	
  
	
  	
  	
  (pastor)	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
	
  
Lucene/SOLR Revolution 2013 41
What if you think there are too many (or too few)?
Add a slider for making filter more fine (or coarse).
Trading off Errors
Add	
  filters…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Lucene/SOLR Revolution 2013 42
Make the filter more fine.
Trading off Errors
Add	
  filters…	
  
People	
  
J.	
  Christopher	
  
	
  	
  	
  Stevens	
  
Chris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Dan	
  Cathy	
  
George	
  LiBle	
  
…	
  
Filtered	
  by…	
  
People	
  
Kris	
  Stephens	
  
	
  	
  (pastor)	
  	
  
Demo
Lucene/SOLR Revolution 2013 44
RNI Similarity Matching “Tamerlan Tsarnaev”
And the problem only gets worse with Multiple Languages
Lucene/SOLR Revolution 2013 45
Fuzzy name search in Solr
• Facets	
  are	
  one	
  way	
  to	
  navigate	
  names	
  
o  assume	
  that	
  you've	
  found	
  some	
  interesNng	
  data	
  
with	
  an	
  ordinary	
  query	
  
o  what	
  if	
  you	
  are	
  having	
  trouble	
  gePng	
  started?	
  
• Name-­‐specific	
  comparison	
  search	
  is	
  another	
  
• More	
  complex	
  algorithm	
  than	
  levenshtein	
  
distance	
  on	
  names	
  
Lucene/SOLR Revolution 2013 46
Plugging in more complex search
• Open	
  up	
  the	
  'search	
  component	
  pipeline'	
  
• First	
  component	
  preprocesses	
  query	
  
o  Maps	
  from	
  "Fred	
  Chopin"	
  to	
  a	
  complex	
  Lucene	
  
query	
  that	
  looks	
  for	
  possible	
  matches	
  across	
  
languages	
  and	
  scripts	
  
• Second	
  component	
  rescores	
  results	
  
o  detailed	
  comparison	
  of	
  pairs	
  of	
  names	
  to	
  derive	
  
final	
  score.	
  
• Sad	
  limitaNon	
  (so	
  far):	
  scores	
  not	
  normalized	
  
to	
  ordinary	
  Lucene	
  values	
  
Lucene/SOLR Revolution 2013 47
And it does SolrCloud, too ...
• Preprocessor	
  runs	
  before	
  fan-­‐out	
  to	
  shards	
  
• rescoring	
  runs	
  out	
  on	
  the	
  shards	
  
• So	
  the	
  work	
  of	
  checking	
  candidate	
  matches	
  is	
  
divided	
  up	
  amongst	
  the	
  scores.	
  
Lucene/SOLR Revolution 2013 48
Questions
•  Suggested questions:
– Doesn’t Google already do this?
– Speed? Scale?
– Multi-lingual?
– What other uses are there for entity resolution
beyond faceted search?
Lucene/SOLR Revolution 2013 49
Doesn’t	
  Google	
  already	
  do	
  this?	
  
Some, when searching for famous entities.
Lucene/SOLR Revolution 2013 50
Speed/Scale
•  Future Plans include scaling experiments
•  Research version:
– tested up to 1m docs
– Sub-second per document
– Incremental updates (i.e., you see documents
published minutes ago)
Lucene/SOLR Revolution 2013 51
Other uses for entity resolution ?
•  Supporting relationship resolution by resolving
participating entities in the them.
•  Knowledge base population
•  Integrating disparate data sets
•  Alerting
•  Improving relevance of search results
•  Predictive Analytics
Lucene/SOLR Revolution 2013 52
For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090
Thank you!
Lucene/SOLR Revolution 2013 53
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
Benson Margulies
benson@basistech.com

Weitere ähnliche Inhalte

Mehr von lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Kürzlich hochgeladen (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

From text to truth real world facets for multilingual search

  • 1. Lucene/SOLR Revolution 2013 1 From Text to Truth: Real World Facets for Multilingual Search Benson Margulies Executive Vice President and Chief Technical Officer
  • 2. Lucene/SOLR Revolution 2013 2 Your job is to analyze reciprocal antagonism between Christian and Islamic extremists across the globe. You want to find information on the Internet on Christian extremist reaction to the killing of the U.S. Ambassador to Libya. Motivation
  • 3.
  • 6.
  • 8.
  • 12.
  • 14. Lucene/SOLR Revolution 2013 14 That was a lot of work. Can text analytics help? Help?
  • 15. Lucene/SOLR Revolution 2013 15 ✓   ✗   ✗   Filter out pages with the wrong guy? Filter?
  • 16. Lucene/SOLR Revolution 2013 16 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter?
  • 17. Lucene/SOLR Revolution 2013 17 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter?
  • 18. Lucene/SOLR Revolution 2013 18 ✓   ✗   ✗   Add some filters (a/k/a facets)… Filter? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 19. Lucene/SOLR Revolution 2013 19 ✓   ✗   ✗   But what can we use as choices? Filter? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …      
  • 20. Lucene/SOLR Revolution 2013 20 Find names of person, places, organizations in document. Entity Extraction (Name Tagging)    
  • 21. Lucene/SOLR Revolution 2013 21 Group names referring to the same person, within a document. In-document Coreference Resolution
  • 22. Lucene/SOLR Revolution 2013 22 ✓   ✗   ✗   But what can we use as choices? Filter choices? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 23. Lucene/SOLR Revolution 2013 23 ✓   ✗   ✗   Choices: first way that each person was mentioned in each document? Filter choices? Filter  results  by…   Persons  named   Kris  Stephens   Chris  Stephens   Dan  Cathy   George  LiBle   …  
  • 24. Lucene/SOLR Revolution 2013 24 ✓   ✗   Choices: first name string for each person in each document? Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens   ✗  
  • 25. Lucene/SOLR Revolution 2013 25 ✓   ✗   Choices: first name string for each person in each document? Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 26. Lucene/SOLR Revolution 2013 26 ✓   ✗   Problem: Ambiguity – one name, many entities Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 27. Lucene/SOLR Revolution 2013 27 ✓   ✗   Problem: Variety – one person, many names Filter? Add  filters…   Filtered  by…   Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 28. Lucene/SOLR Revolution 2013 28 ✓   ✗   Problem: Variety – one person, many names Filter? Add  filters…   Persons  named   Dan  Cathy   George  LiBle   …   Chris  Stevens   J.  Christopher        Stevens   …   Filtered  by…   Persons  named   Chris  Stephens  
  • 29. Lucene/SOLR Revolution 2013 29 ✓   ✗   ✗   Magically group names by person across documents. Deal with ambiguity and variety? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …  
  • 30. Lucene/SOLR Revolution 2013 30 ✓   ✗   ✗   But there’s still the problem of choices… Labels for choices? Filter  results  by…   People   <choice  1>   <choice  2>   <choice  3>   …      
  • 31. Lucene/SOLR Revolution 2013 31 ✓   ✗   ✗   Use person’s name from highest ranked doc? Still some ambiguity. Labels for choices? Filter  results  by…   People   Kris  Stephens   Chris  Stephens  1     Chris  Stephens  2   …      
  • 32. Lucene/SOLR Revolution 2013 32 ✓   ✗   ✗   Entity Resolution: group and also link to a database of known entities (e.g., Wikipedia). Labels for choices? Filter  results  by…   People   Kris  Stephens   Chris  Stephens  1     Chris  Stephens  2   …       Kris  Stephens   J.  Christopher        Stevens     Chris  Stephens     …  
  • 33. Lucene/SOLR Revolution 2013 33 ✓   ✗   ✗   Labels for choices? Filter  results  by…   People   For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Kris  Stephens   J.  Christopher        Stevens     Chris  Stephens     …          
  • 34. Lucene/SOLR Revolution 2013 34 ✓   ✗   ✗   For items not in the database, infer a unique label (e.g., for hypothetical Wikipedia page). Filter? Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)              
  • 35. Lucene/SOLR Revolution 2013 35 ✓   ✗   ✗   Let’s give it a try… Filter. Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …    
  • 36. Lucene/SOLR Revolution 2013 36 ✓   ✗   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens     ✗  
  • 37. Lucene/SOLR Revolution 2013 37 ✓   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens    
  • 38. Lucene/SOLR Revolution 2013 38 ✓   Let’s give it a try… Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens    
  • 39. Lucene/SOLR Revolution 2013 39 ✓   On a cross lingual index, real-world entity facets can open results up across languages, unlike search strings Filter. Add  filters…   People   Kris  Stephens      (pastor)   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   J.  Christopher        Stevens     ✓   ✓   Language   English   Chinese   Arabic  
  • 40. Lucene/SOLR Revolution 2013 40 Let’s pretend you’re researching the pastors instead. Trading off Errors Filter  results  by…   People   Kris  Stephens      (pastor)   J.  Christopher        Stevens     Chris  Stephens        (pastor)   Dan  Cathy   George  LiBle   …    
  • 41. Lucene/SOLR Revolution 2013 41 What if you think there are too many (or too few)? Add a slider for making filter more fine (or coarse). Trading off Errors Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   Kris  Stephens      (pastor)    
  • 42. Lucene/SOLR Revolution 2013 42 Make the filter more fine. Trading off Errors Add  filters…   People   J.  Christopher        Stevens   Chris  Stephens      (pastor)     Dan  Cathy   George  LiBle   …   Filtered  by…   People   Kris  Stephens      (pastor)    
  • 43. Demo
  • 44. Lucene/SOLR Revolution 2013 44 RNI Similarity Matching “Tamerlan Tsarnaev” And the problem only gets worse with Multiple Languages
  • 45. Lucene/SOLR Revolution 2013 45 Fuzzy name search in Solr • Facets  are  one  way  to  navigate  names   o  assume  that  you've  found  some  interesNng  data   with  an  ordinary  query   o  what  if  you  are  having  trouble  gePng  started?   • Name-­‐specific  comparison  search  is  another   • More  complex  algorithm  than  levenshtein   distance  on  names  
  • 46. Lucene/SOLR Revolution 2013 46 Plugging in more complex search • Open  up  the  'search  component  pipeline'   • First  component  preprocesses  query   o  Maps  from  "Fred  Chopin"  to  a  complex  Lucene   query  that  looks  for  possible  matches  across   languages  and  scripts   • Second  component  rescores  results   o  detailed  comparison  of  pairs  of  names  to  derive   final  score.   • Sad  limitaNon  (so  far):  scores  not  normalized   to  ordinary  Lucene  values  
  • 47. Lucene/SOLR Revolution 2013 47 And it does SolrCloud, too ... • Preprocessor  runs  before  fan-­‐out  to  shards   • rescoring  runs  out  on  the  shards   • So  the  work  of  checking  candidate  matches  is   divided  up  amongst  the  scores.  
  • 48. Lucene/SOLR Revolution 2013 48 Questions •  Suggested questions: – Doesn’t Google already do this? – Speed? Scale? – Multi-lingual? – What other uses are there for entity resolution beyond faceted search?
  • 49. Lucene/SOLR Revolution 2013 49 Doesn’t  Google  already  do  this?   Some, when searching for famous entities.
  • 50. Lucene/SOLR Revolution 2013 50 Speed/Scale •  Future Plans include scaling experiments •  Research version: – tested up to 1m docs – Sub-second per document – Incremental updates (i.e., you see documents published minutes ago)
  • 51. Lucene/SOLR Revolution 2013 51 Other uses for entity resolution ? •  Supporting relationship resolution by resolving participating entities in the them. •  Knowledge base population •  Integrating disparate data sets •  Alerting •  Improving relevance of search results •  Predictive Analytics
  • 52. Lucene/SOLR Revolution 2013 52 For more information: Visit www.basistech.com Write to conference@basistech.com Call 617-386-2090 Thank you!
  • 53. Lucene/SOLR Revolution 2013 53 CONFERENCE PARTY The Tipsy Crow: 770 5th Ave Starts after Stump The Chump Your conference badge gets you in the door TOMORROW Breakfast starts at 7:30 Keynotes start at 8:30 CONTACT Benson Margulies benson@basistech.com