Presented by Benson Margulies, Executive Vice President and Chief Technology Officer, Basis Technology
Solr's ability to facet search results gives end-users a valuable way to drill down to what they want. But for unstructured documents, deriving facets such as the persons mentioned requires advanced analytics. Even if names can be extracted from documents, the user doesn't want a "George Bush" facet that intermingles documents mentioning either the 41st and 43rd U.S. Presidents, nor does she want separate facets for "George W. Bush" or even "乔治·沃克·布什" (a Chinese translation) that are limited to just one string. We'll explore the benefits and challenges of empowering Solr users with real-world facets.
From text to truth real world facets for multilingual search
1. Lucene/SOLR Revolution 2013 1
From Text to Truth: Real World Facets for
Multilingual Search
Benson Margulies
Executive Vice President and Chief Technical Officer
2. Lucene/SOLR Revolution 2013 2
Your job is to analyze reciprocal antagonism
between Christian and Islamic extremists across the
globe.
You want to find information on the Internet on
Christian extremist reaction to the killing of the U.S.
Ambassador to Libya.
Motivation
19. Lucene/SOLR Revolution 2013 19
✓
✗
✗
But what can we use as choices?
Filter?
Filter
results
by…
People
<choice
1>
<choice
2>
<choice
3>
…
20. Lucene/SOLR Revolution 2013 20
Find names of person, places, organizations in document.
Entity Extraction (Name Tagging)
21. Lucene/SOLR Revolution 2013 21
Group names referring to the same person, within a document.
In-document Coreference Resolution
22. Lucene/SOLR Revolution 2013 22
✓
✗
✗
But what can we use as choices?
Filter choices?
Filter
results
by…
People
<choice
1>
<choice
2>
<choice
3>
…
23. Lucene/SOLR Revolution 2013 23
✓
✗
✗
Choices: first way that each person was mentioned
in each document?
Filter choices?
Filter
results
by…
Persons
named
Kris
Stephens
Chris
Stephens
Dan
Cathy
George
LiBle
…
24. Lucene/SOLR Revolution 2013 24
✓
✗
Choices: first name string for each person in each
document?
Filter?
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Filtered
by…
Persons
named
Chris
Stephens
✗
25. Lucene/SOLR Revolution 2013 25
✓
✗
Choices: first name string for each person in each
document?
Filter?
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Filtered
by…
Persons
named
Chris
Stephens
26. Lucene/SOLR Revolution 2013 26
✓
✗
Problem: Ambiguity – one name, many entities
Filter?
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Filtered
by…
Persons
named
Chris
Stephens
27. Lucene/SOLR Revolution 2013 27
✓
✗
Problem: Variety – one person, many names
Filter?
Add
filters…
Filtered
by…
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Filtered
by…
Persons
named
Chris
Stephens
28. Lucene/SOLR Revolution 2013 28
✓
✗
Problem: Variety – one person, many names
Filter?
Add
filters…
Persons
named
Dan
Cathy
George
LiBle
…
Chris
Stevens
J.
Christopher
Stevens
…
Filtered
by…
Persons
named
Chris
Stephens
29. Lucene/SOLR Revolution 2013 29
✓
✗
✗
Magically group names by person across
documents.
Deal with ambiguity and variety?
Filter
results
by…
People
<choice
1>
<choice
2>
<choice
3>
…
30. Lucene/SOLR Revolution 2013 30
✓
✗
✗
But there’s still the problem of choices…
Labels for choices?
Filter
results
by…
People
<choice
1>
<choice
2>
<choice
3>
…
31. Lucene/SOLR Revolution 2013 31
✓
✗
✗
Use person’s name from highest ranked doc?
Still some ambiguity.
Labels for choices?
Filter
results
by…
People
Kris
Stephens
Chris
Stephens
1
Chris
Stephens
2
…
32. Lucene/SOLR Revolution 2013 32
✓
✗
✗
Entity Resolution: group and also link to a
database of known entities (e.g., Wikipedia).
Labels for choices?
Filter
results
by…
People
Kris
Stephens
Chris
Stephens
1
Chris
Stephens
2
…
Kris
Stephens
J.
Christopher
Stevens
Chris
Stephens
…
33. Lucene/SOLR Revolution 2013 33
✓
✗
✗
Labels for choices?
Filter
results
by…
People
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Kris
Stephens
J.
Christopher
Stevens
Chris
Stephens
…
34. Lucene/SOLR Revolution 2013 34
✓
✗
✗
For items not in the database, infer a unique
label (e.g., for hypothetical Wikipedia page).
Filter?
Filter
results
by…
People
Kris
Stephens
(pastor)
J.
Christopher
Stevens
Chris
Stephens
(pastor)
35. Lucene/SOLR Revolution 2013 35
✓
✗
✗
Let’s give it a try…
Filter.
Filter
results
by…
People
Kris
Stephens
(pastor)
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
36. Lucene/SOLR Revolution 2013 36
✓
✗
Let’s give it a try…
Filter.
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
J.
Christopher
Stevens
✗
37. Lucene/SOLR Revolution 2013 37
✓
Let’s give it a try…
Filter.
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
J.
Christopher
Stevens
38. Lucene/SOLR Revolution 2013 38
✓
Let’s give it a try…
Filter.
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
J.
Christopher
Stevens
39. Lucene/SOLR Revolution 2013 39
✓
On a cross lingual index, real-world entity facets can
open results up across languages, unlike search
strings
Filter.
Add
filters…
People
Kris
Stephens
(pastor)
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
J.
Christopher
Stevens
✓
✓
Language
English
Chinese
Arabic
40. Lucene/SOLR Revolution 2013 40
Let’s pretend you’re researching the pastors
instead.
Trading off Errors
Filter
results
by…
People
Kris
Stephens
(pastor)
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
41. Lucene/SOLR Revolution 2013 41
What if you think there are too many (or too few)?
Add a slider for making filter more fine (or coarse).
Trading off Errors
Add
filters…
People
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
Kris
Stephens
(pastor)
42. Lucene/SOLR Revolution 2013 42
Make the filter more fine.
Trading off Errors
Add
filters…
People
J.
Christopher
Stevens
Chris
Stephens
(pastor)
Dan
Cathy
George
LiBle
…
Filtered
by…
People
Kris
Stephens
(pastor)
44. Lucene/SOLR Revolution 2013 44
RNI Similarity Matching “Tamerlan Tsarnaev”
And the problem only gets worse with Multiple Languages
45. Lucene/SOLR Revolution 2013 45
Fuzzy name search in Solr
• Facets
are
one
way
to
navigate
names
o assume
that
you've
found
some
interesNng
data
with
an
ordinary
query
o what
if
you
are
having
trouble
gePng
started?
• Name-‐specific
comparison
search
is
another
• More
complex
algorithm
than
levenshtein
distance
on
names
46. Lucene/SOLR Revolution 2013 46
Plugging in more complex search
• Open
up
the
'search
component
pipeline'
• First
component
preprocesses
query
o Maps
from
"Fred
Chopin"
to
a
complex
Lucene
query
that
looks
for
possible
matches
across
languages
and
scripts
• Second
component
rescores
results
o detailed
comparison
of
pairs
of
names
to
derive
final
score.
• Sad
limitaNon
(so
far):
scores
not
normalized
to
ordinary
Lucene
values
47. Lucene/SOLR Revolution 2013 47
And it does SolrCloud, too ...
• Preprocessor
runs
before
fan-‐out
to
shards
• rescoring
runs
out
on
the
shards
• So
the
work
of
checking
candidate
matches
is
divided
up
amongst
the
scores.
48. Lucene/SOLR Revolution 2013 48
Questions
• Suggested questions:
– Doesn’t Google already do this?
– Speed? Scale?
– Multi-lingual?
– What other uses are there for entity resolution
beyond faceted search?
50. Lucene/SOLR Revolution 2013 50
Speed/Scale
• Future Plans include scaling experiments
• Research version:
– tested up to 1m docs
– Sub-second per document
– Incremental updates (i.e., you see documents
published minutes ago)
51. Lucene/SOLR Revolution 2013 51
Other uses for entity resolution ?
• Supporting relationship resolution by resolving
participating entities in the them.
• Knowledge base population
• Integrating disparate data sets
• Alerting
• Improving relevance of search results
• Predictive Analytics
52. Lucene/SOLR Revolution 2013 52
For more information:
Visit www.basistech.com
Write to conference@basistech.com
Call 617-386-2090
Thank you!
53. Lucene/SOLR Revolution 2013 53
CONFERENCE PARTY
The Tipsy Crow: 770 5th Ave
Starts after Stump The Chump
Your conference badge gets
you in the door
TOMORROW
Breakfast starts at 7:30
Keynotes start at 8:30
CONTACT
Benson Margulies
benson@basistech.com