SlideShare a Scribd company logo
1 of 72
Download to read offline
SEARCH LIKE %SQL%
INFIX SEARCH IN LUCENE / SOLR / ELASTIC
SEP 15, 2017
2
Talk Title
Speaker Name
Company
SEARCH LIKE %SQL%
Mikhail Khludnev
EPAM
3
•  work in Search for 6 years
•  Apache Lucene/Solr committer for 2 years
•  speak at LuceneRevolution, BerlinBuzzwords
•  chief search engineer in EPAM
ABOUT ME
4
WE ARE
5
ESTABLISHED & EXPANDING GLOBAL VERTICALS
Award-winning Wealth Management Platform
Deep Expertise in Current and
Emerging FinTech
Working with 5 of the 10 Largest
Investment Banks
Leading Digital Transformation for
Global Retailers
Working with largest online travel association (OTA)
& largest global hospitality company
Recognized M&E Leader by
Independent Research Analysts
Working with 4 out of the 4 Top Broadcast Networks
and 14 out of the top 30 TV Networks to transform
consumer-driven media
R&D Domain Experts with 700+ Complex
Solutions & Services Supporting the Entire
Drug Discovery Workflow
Working with 9 of the 10 Top
Pharma Companies
24-Year History of Leading
Product Development
Working with 30+ of the top 100 ISVs
FINANCIAL SERVICES TRAVEL & CONSUMER
SOFTWARE & HI-TECHLIFE SCIENCES AND HEALTHCARE
MEDIA & ENTERTAINMENT
EMERGING
Deep Expertise Offers
Innovative Solutions
Working with industries ranging from
Energy and Utilities to Telecom and Automotive
6
•  Term and boolean query
•  Prefix*	
  query
•  *suffix	
  query
•  *infix*	
  query
•  Approaching	
  Suggester	
  
•  Derivative	
  Terms	
  	
  
AGENDA
7
•  Endeca
•  MarkLogic
•  FAST, Google Search Appliance
•  Sphinx
•  Apache Lucene
•  Apache Solr
•  Elastic
SEARCH ENGINES
8
9
10
CUSTOMER PROFILE
Any comprehensive text
search service
•  Patent
•  Legal
•  Chemistry
•  Bioinformatics
•  SQL legacy
11
WHERE LIKE %infix% *infix*
12
Business Problem/Opportunity
•  Ill searches for *infix*
CHALLENGE
Bank of England
13
Business Problem/Opportunity
•  Ill searches for *infix*
CHALLENGE
14
…at all?
Or what’s fast at comparison to it?
WHY IT’S A PROBLEM?
15
text:foo	
  	
  	
  OR	
  	
  text:bar	
  	
  	
  	
  	
  	
  	
  	
  
text:foo	
  	
  AND	
  	
  text:bar	
  	
  
THESE SEARCHES ARE (CONSIDERED AS) FAST
16
text:foo	
  OR	
  	
  text:bar	
  
text:foo	
  AND	
  text:bar	
  
	
  
O(r)	
  <<	
  O(Dall)	
  
	
  
r	
  –	
  results	
  
Dall	
  –	
  all	
  docs	
  
	
  
THESE SEARCHES ARE (CONSIDERED AS) FAST
17
• text:[sci	
  TO	
  scj]	
  
• text:sci*	
  
WHY THESE ARE STILL FAST?
18
TERM EXPANSION
•  discipline
•  luscious
•  science
•  scilla
•  scissors
text:[sci	
  TO	
  scj]	
  
text:sci*	
  
text:(science	
  OR	
  scilla	
  OR	
  scissors)	
  
	
  
O(t)+O(r)	
  
t – query terms
r - results
19
TERM EXPANSION
•  discipline
•  luscious
•  science
•  scilla
•  scissors
text:[sci	
  TO	
  scj]	
  
text:sci*	
  
text:(science	
  OR	
  scilla	
  OR	
  scissors)	
  
O(t)+O(r)	
  
20
PREFIX* SEARCH
24 ms
sci*
21
ms
22
WHAT’S THEN?
•  asci
•  disci
•  discipline
•  lemnisci
•  luscious
•  menisci
text:*sci	
  
23
WHAT’S THEN?
text:*sci 	
   	
  
	
  
O(Tall)+O(r)	
  
•  asci
•  disci
•  discipline
•  lemnisci
•  luscious
•  menisci
Tall – all terms
r - results
24
*SUFFIX SEARCH
4948 ms
*sci
25
ms
26
0 1000 2000 3000 4000 5000 6000
prefix*
*suffix
RESPONSE TIME, ms
27
text:*sci	
  
ReversedWildcardFilterFactory	
  
0enilpicsid
0icsa
0icsid
0icsinem
0icsinmel
0suoicsul
asci
disci
discipline
lemnisci
luscious
menisci
text:0ics*	
  
WHAT’S THEN? – REVERSE!
28
ReversedWildcardFilterFactory
29
30
31
WHAT’S THEN? – REVERSE!
text:*sci	
  
	
  
ReversedWildcardFilterFactory	
  
0enilpicsid/0
0icsa/10
0icsid/20
0icsinem/30
0icsinmel/40
0suoicsul/50
asci/60
disci/70
discipline/80
lemnisci/90
luscious/100
menisci/110
32
ReversedWildcardFilterFactory
33
Well.. Postings
asci/0 8, 9, 10, 14, 18, 23, 24, 26, 31, 35
disci/10 8, 11, 14, 18, 18, 18, 21, 23, 25, 27
discipline/20 4, 5, 6, 6, 9, 13, 13, 14, 18, 22
lemnisci/30 3, 4, 7, 9, 9, 9, 12, 13, 17, 20
luscious/40 3, 3, 5, 9, 9, 12, 14, 19, 23, 28
menisci/50 0, 2, 5, 6, 11, 13, 17, 22, 27
34
Well.. Postings .. ah yeah..
0enilpicsid
0icsa
0icsid
0icsinem
0icsinmel
0suoicsul
asci/0 8, 9, 10, 14, 18, 23, 24, 26, 31, 35
disci/10 8, 11, 14, 18, 18, 18, 21, 23, 25, 27
discipline/20 4, 5, 6, 6, 9, 13, 13, 14, 18, 22
lemnisci/30 3, 4, 7, 9, 9, 9, 12, 13, 17, 20
luscious/40 3, 3, 5, 9, 9, 12, 14, 19, 23, 28
menisci/50 0, 2, 5, 6, 11, 13, 17, 22, 27
35
Well.. Postings .. ah yeah.. (and positions!)
0enilpicsid/0 4, 5, 6, 6, 9, 13, 13, 14, 18, 22
0icsa/10 8, 9, 10, 14, 18, 23, 24, 26, 31, 35
0icsid/20 8, 11, 14, 18, 18, 18, 21, 23, 25, 27
0icsinem/30 0, 2, 5, 6, 11, 13, 17, 22, 27
0icsinmel/40 3, 4, 7, 9, 9, 9, 12, 13, 17, 20
0suoicsul/50 3, 3, 5, 9, 9, 12, 14, 19, 23, 28
asci/60 8, 9, 10, 14, 18, 23, 24, 26, 31, 35
disci/70 8, 11, 14, 18, 18, 18, 21, 23, 25, 27
discipline/80 4, 5, 6, 6, 9, 13, 13, 14, 18, 22
lemnisci/90 3, 4, 7, 9, 9, 9, 12, 13, 17, 20
luscious/100 3, 3, 5, 9, 9, 12, 14, 19, 23, 28
menisci/110 0, 2, 5, 6, 11, 13, 17, 22, 27
36
benchmark	
  khludnevm$	
  ant	
  run-­‐task	
  -­‐Dtask.alg=conf/index-­‐5m.alg	
  -­‐
Dtask.mem=1000m	
  
…	
  
	
  	
  	
  	
  	
  [java]	
  -­‐-­‐>	
  Round	
  0-­‐-­‐>1:	
  	
  	
  
solr.server:org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient
-­‐-­‐>org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient	
  
	
  	
  	
  	
  	
  [java]	
  	
  
	
  	
  	
  	
  	
  [java]	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>	
  starting	
  task:	
  StopSolrServer	
  
	
  	
  	
  	
  	
  [java]	
  	
  
	
  	
  	
  	
  	
  [java]	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>	
  Report	
  sum	
  by	
  Prefix	
  (AddDocs)	
  and	
  Round	
  (1	
  
about	
  1	
  out	
  of	
  13)	
  
	
  	
  	
  	
  	
  [java]	
  Operation	
  	
  	
  round	
  	
  	
  recsPerRun	
  	
  elapsedSec	
  	
  	
  	
  avgUsedMem	
  	
  	
  	
  
avgTotalMem	
  
	
  	
  	
  	
  	
  [java]	
  AddDocs	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  5000001	
  	
  	
  1,	
  100.41	
  	
  	
  102,215,792	
  	
  	
  	
  
257,425,408	
  
	
  	
  	
  	
  	
  [java]	
  	
  
	
  	
  	
  	
  	
  [java]	
  Reopen	
  Times:	
  
	
  	
  	
  	
  	
  [java]	
  	
  1166	
  
	
  	
  	
  	
  	
  [java]	
  ####################	
  
	
  	
  	
  	
  	
  [java]	
  ###	
  	
  D	
  O	
  N	
  E	
  !!!	
  ###	
  
	
  	
  	
  	
  	
  [java]	
  ####################	
  
	
  
BUILD	
  SUCCESSFUL	
  
Total	
  time:	
  19	
  minutes	
  28	
  seconds	
  
	
  
$	
  ant	
  run-­‐task	
  -­‐Dtask.alg=conf/index-­‐5m-­‐reverse.alg	
  -­‐Dtask.mem=1000m	
  
	
  	
  	
  	
  [java]	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>	
  starting	
  task:	
  Rounds	
  
	
  …....	
  
	
  
	
  	
  	
  	
  	
  [java]	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>	
  Report	
  sum	
  by	
  Prefix	
  (AddDocs)	
  and	
  Round	
  (1	
  about	
  1	
  out	
  
of	
  13)	
  
	
  	
  	
  	
  	
  [java]	
  Operation	
  	
  	
  round	
  recsPerRun	
  	
  	
  	
  	
  	
  	
  	
  rec/s	
  	
  elapsedSec	
  	
  	
  	
  avgUsedMem	
  	
  	
  	
  
avgTotalMem	
  
	
  	
  	
  	
  	
  [java]	
  AddDocs	
  	
  	
  	
  	
  	
  	
  	
  	
  0	
  5000001	
  	
  	
  	
  	
  3,556.96	
  	
  	
  	
  1,405.69	
  	
  	
  	
  75,075,400	
  	
  	
  	
  
257,425,408	
  
	
  	
  	
  	
  	
  [java]	
  	
  
	
  	
  	
  	
  	
  [java]	
  Reopen	
  Times:	
  
	
  	
  	
  	
  	
  [java]	
  	
  3114	
  
	
  	
  	
  	
  	
  [java]	
  ####################	
  
	
  	
  	
  	
  	
  [java]	
  ###	
  	
  D	
  O	
  N	
  E	
  !!!	
  ###	
  
	
  	
  	
  	
  	
  [java]	
  ####################	
  
BUILD	
  SUCCESSFUL	
  
Total	
  time:	
  26	
  minutes	
  50	
  seconds	
  
	
  
	
  
$	
  du	
  -­‐hs	
  ../example/schemaless/solr/gettingstarted/data/*	
  
	
  28G	
  ../example/schemaless/solr/gettingstarted/data/index-­‐reverse	
  
	
  13G	
  ../example/schemaless/solr/gettingstarted/data/index-­‐simple	
  
	
  
37
0
5
10
15
20
25
30
Baseline (5M en
wiki)
Reversed
Main Index
INDEX SIZE, GB
13G
28G
38
39
40
text:*sci*
AND THEN
41
discipline
EdgeNGramFilter + ReversedWildcardFilter
EdgeNGram Sort
discipline cipline
iscipline discipline
scipline e
cipline ine
ipline ipline
pline iscipline
line line
ine ne
ne pline
e scipline*sci* -> sci*
42
asci
ci
cious
cipline
disci
discipline
e
emnisci
enisci
i
ine
ious
ipline
isci
iscipline
lemnisci
line
luscious
menisci
mnisci
ne
nisci
ous
pline
s
sci
scious
scipline
us
uscious
43
0
5
10
15
20
25
30
35
Baseline (5M en
wiki)
Reversed EdgeNGramm
Main Index
INDEX SIZE, GB
13G
28G
~60G
44
https://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays
https://issues.apache.org/jira/browse/SOLR-9974
http://labs.carrotsearch.com/jsuffixarrays.html
SUFFIX ARRAY
45
SUGGESTER
46
AnalyzingInfixSuggester
LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji
SOLR-4945: Japanese Autocomplete and Highlighter broken
4945
autocomplete
broken
highlighter
japanese
solr
47
AnalysingInfixSuggester TO RESCUE!
48
49
AnalysingInfixSuggester FOR infix SEARCH
• feed AnalysingInfixSuggester with main index’s terms
• enable EdgeNGramFilter for AnalysingInfixSuggester
discipline
iscipline
scipline
cipline
ipline
pline
line
ine
ne
e
50
• 14 M terms -> 79 M EdgeNGramms
• 10 min
• 3.3 G (25%)
BUILDING SUGGESTER INDEX
discipline
iscipline
scipline
cipline
ipline
pline
line
ine
ne
e
51
ms
52
0
5
10
15
20
25
30
35
Baseline (5M en
wiki)
Reversed EdgeNGramm Suggester
Main Index
Suggester Index
INDEX SIZE, GB
13G
28G
13G+3.3G
~60G
53
http://localhost:8901/solr/gettingstarted/suggest?
suggest.dictionary = body_txt_en & suggest.q = sci
<response>
<lst name="responseHeader"><int
name="QTime">4</int>
</lst>
<lst name="suggest">
  <lst name="body_txt_en">
    <lst name="sci">
<int name="numFound">1000</int>
      <arr name="suggestions">
          <str>scienc</str>
          <str>scientif</str>
          <str>scientist</str>
          <str>disciplin</str>
          <str>sci</str>
          <str>conscious</str>
          <str>category:sci</str>
          <str>fascin</str>
          <str>discipl</str>
          <str>consciou</str>
          <str>unconsci</str>
          <str>conscienc</str>
          <str>oscil</str>
          <str>neurosci</str>
          <str>interdisciplinari</str>
          <str>disciplinari</str>
          <str>scissor</str>
          <str>ascii</str>
          <str>scientolog</str>
          <str>scimitar</str>
          <str>conscienti</str>
          <str>pseudosci</str>
          <str>rescind</str>
          <str>priscilla</str>
          <str>subconsci</str>
          <str>brescia</str>
          <str>scion</str>
          <str>category:scientif</str>
          <str>infobox_scientist</str>
  <str>www.newscientist.com</str>
          <str>resuscit</str>
          <str>plebiscit</str>
          <str>user:scimitar</str>
          <str>multidisciplinari</str>
          <str>fascia</str>
          <str>scifi</str>
          <str>geoscienc</str>
          <str>www.scifi.com</str>
<str>www.sciencemag.org</str>
          <str>omnisci</str>
          <str>scipio</str>
          <str>neuroscientist</str>
          <str>scientologist</str>
….
<int
name="QTime">4</int>
54
AnalysingInfixSuggester FOR infix SEARCH
• feed AnalysingInfixSuggester with main index’s terms
• enable EdgeNGramFilter for AnalysingInfixSuggester
• override wildcard expansion by calling AnalysingInfixSuggester
55
*infix* SEARCH
3834 ms
*sci*
142 ms
*sci*
56
57
58
59
0 1000 2000 3000 4000 5000 6000
prefix*
*suffix
*substr*
*suggester*
RESPONSE TIME, ms
60
AnalysingInfixSuggester FOR infix SEARCH
• existing scalable algorithm
• minor customization
• no postings explosion
• potentially supports NRT
61
discipline
discipline
Suggester
asci
ci
cious
cipline
..
sci
scious
scipline
us
uscious
discipline
asci
disci
discipline
lemnisci
luscious
menisci School discipline
.. is a required set of
actions by a teacher
towards a student …
Main Index
62
discipline
discipline
Derivative Terms
asci
ci
cious
cipline
..
sci
scious
scipline
us
uscious
discipline
asci
disci
discipline
lemnisci
luscious
menisci School discipline
.. is a required set of
actions by a teacher
towards a student …
63
• A slight index format change
• many terms refer to the same postings list
• API is :
•  indexWriter.deriveTerms(“name”, “name_edge”, new EdgeNgrammTokenFilter());
•  search: name_edge:sci*
• Hijacking and Injecting codecs LUCENE-7863
• Promising for deep taxonomies.
Derivative terms
64
*INFIX* SEARCH WITH DERIVATIVE TERMS
127 ms
*sci*
65
ms
66
0 1000 2000 3000 4000 5000 6000
prefix*
*suffix
*substr*
*suggester*
*derived*
RESPONSE TIME, ms
67
0
5
10
15
20
25
30
35
Baseline (5M en
wiki)
Reversed EdgeNGramm Suggester Derived Terms
Main Index
Suggester Index
INDEX SIZE, GB
13G
28G
13G+3.3G 17G
~60G
68
REFERENCES
What is in a Lucene index? Adrien Grand
https://www.youtube.com/watch?v=T5RmMNDR5XI
Automata Invasion. Robert Muir, Michael Mccandless
https://www.youtube.com/watch?v=pd2jvy2IbJE
• Lucene Search Essentials: Scorers, Collectors and Custom Queries, Mikhail Khludnev
https://www.youtube.com/watch?v=X9YovpYj6uo
A new Lucene suggester based on infix matches
http://blog.mikemccandless.com/2013/06/a-new-lucene-suggester-based-on-
infix.html
69
REFERENCES
What is in a Lucene index? Adrien Grand
https://www.youtube.com/watch?v=T5RmMNDR5XI
Automata Invasion. Robert Muir, Michael Mccandless
https://www.youtube.com/watch?v=pd2jvy2IbJE
В поисках Tommy Hilfiger, Михаил Хлуднев
https://www.youtube.com/watch?v=Azf4oUL-Dqc
A new Lucene suggester based on infix matches
http://blog.mikemccandless.com/2013/06/a-new-lucene-suggester-based-on-infix.html
70
ANYWAY
text:*a*
71
CONTACTS
Mikhail_Khludnev@EPAM.COM
mkhl@apache.or g
https://plus.google.com/+MikhailKhludnev
72
Thank YouThank You

More Related Content

Similar to Search LIKE %SQL% - Mikhail Khludnev, EPAM

Similar to Search LIKE %SQL% - Mikhail Khludnev, EPAM (20)

#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
 
Visualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LVVisualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LV
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
Fluent Refactoring (Lone Star Ruby Conf 2013)
Fluent Refactoring (Lone Star Ruby Conf 2013)Fluent Refactoring (Lone Star Ruby Conf 2013)
Fluent Refactoring (Lone Star Ruby Conf 2013)
 
Linked Data in Learning Analytics Tools
Linked Data in Learning Analytics ToolsLinked Data in Learning Analytics Tools
Linked Data in Learning Analytics Tools
 
Awesome SQL Tips and Tricks - Voxxed Days Cluj - 2019
 Awesome SQL Tips and Tricks - Voxxed Days Cluj - 2019 Awesome SQL Tips and Tricks - Voxxed Days Cluj - 2019
Awesome SQL Tips and Tricks - Voxxed Days Cluj - 2019
 
DC |> Elixir Meetup - Going off the Rails into Elixir - Dan Ivovich
DC |> Elixir Meetup - Going off the Rails into Elixir - Dan IvovichDC |> Elixir Meetup - Going off the Rails into Elixir - Dan Ivovich
DC |> Elixir Meetup - Going off the Rails into Elixir - Dan Ivovich
 
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
 
How to find what is making your Oracle database slow
How to find what is making your Oracle database slowHow to find what is making your Oracle database slow
How to find what is making your Oracle database slow
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
 
SoTWLG Intro to Code Bootcamps 2016 (Roger Nesbitt)
SoTWLG Intro to Code Bootcamps 2016 (Roger Nesbitt)SoTWLG Intro to Code Bootcamps 2016 (Roger Nesbitt)
SoTWLG Intro to Code Bootcamps 2016 (Roger Nesbitt)
 
Four Languages From Forty Years Ago
Four Languages From Forty Years AgoFour Languages From Forty Years Ago
Four Languages From Forty Years Ago
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011Lighting talk neo4j fosdem 2011
Lighting talk neo4j fosdem 2011
 
DevOps, Waffles, and Superheroes
DevOps, Waffles, and SuperheroesDevOps, Waffles, and Superheroes
DevOps, Waffles, and Superheroes
 

More from Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

More from Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Recently uploaded

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Search LIKE %SQL% - Mikhail Khludnev, EPAM

  • 1. SEARCH LIKE %SQL% INFIX SEARCH IN LUCENE / SOLR / ELASTIC SEP 15, 2017
  • 2. 2 Talk Title Speaker Name Company SEARCH LIKE %SQL% Mikhail Khludnev EPAM
  • 3. 3 •  work in Search for 6 years •  Apache Lucene/Solr committer for 2 years •  speak at LuceneRevolution, BerlinBuzzwords •  chief search engineer in EPAM ABOUT ME
  • 5. 5 ESTABLISHED & EXPANDING GLOBAL VERTICALS Award-winning Wealth Management Platform Deep Expertise in Current and Emerging FinTech Working with 5 of the 10 Largest Investment Banks Leading Digital Transformation for Global Retailers Working with largest online travel association (OTA) & largest global hospitality company Recognized M&E Leader by Independent Research Analysts Working with 4 out of the 4 Top Broadcast Networks and 14 out of the top 30 TV Networks to transform consumer-driven media R&D Domain Experts with 700+ Complex Solutions & Services Supporting the Entire Drug Discovery Workflow Working with 9 of the 10 Top Pharma Companies 24-Year History of Leading Product Development Working with 30+ of the top 100 ISVs FINANCIAL SERVICES TRAVEL & CONSUMER SOFTWARE & HI-TECHLIFE SCIENCES AND HEALTHCARE MEDIA & ENTERTAINMENT EMERGING Deep Expertise Offers Innovative Solutions Working with industries ranging from Energy and Utilities to Telecom and Automotive
  • 6. 6 •  Term and boolean query •  Prefix*  query •  *suffix  query •  *infix*  query •  Approaching  Suggester   •  Derivative  Terms     AGENDA
  • 7. 7 •  Endeca •  MarkLogic •  FAST, Google Search Appliance •  Sphinx •  Apache Lucene •  Apache Solr •  Elastic SEARCH ENGINES
  • 8. 8
  • 9. 9
  • 10. 10 CUSTOMER PROFILE Any comprehensive text search service •  Patent •  Legal •  Chemistry •  Bioinformatics •  SQL legacy
  • 12. 12 Business Problem/Opportunity •  Ill searches for *infix* CHALLENGE Bank of England
  • 13. 13 Business Problem/Opportunity •  Ill searches for *infix* CHALLENGE
  • 14. 14 …at all? Or what’s fast at comparison to it? WHY IT’S A PROBLEM?
  • 15. 15 text:foo      OR    text:bar                 text:foo    AND    text:bar     THESE SEARCHES ARE (CONSIDERED AS) FAST
  • 16. 16 text:foo  OR    text:bar   text:foo  AND  text:bar     O(r)  <<  O(Dall)     r  –  results   Dall  –  all  docs     THESE SEARCHES ARE (CONSIDERED AS) FAST
  • 17. 17 • text:[sci  TO  scj]   • text:sci*   WHY THESE ARE STILL FAST?
  • 18. 18 TERM EXPANSION •  discipline •  luscious •  science •  scilla •  scissors text:[sci  TO  scj]   text:sci*   text:(science  OR  scilla  OR  scissors)     O(t)+O(r)   t – query terms r - results
  • 19. 19 TERM EXPANSION •  discipline •  luscious •  science •  scilla •  scissors text:[sci  TO  scj]   text:sci*   text:(science  OR  scilla  OR  scissors)   O(t)+O(r)  
  • 21. 21 ms
  • 22. 22 WHAT’S THEN? •  asci •  disci •  discipline •  lemnisci •  luscious •  menisci text:*sci  
  • 23. 23 WHAT’S THEN? text:*sci       O(Tall)+O(r)   •  asci •  disci •  discipline •  lemnisci •  luscious •  menisci Tall – all terms r - results
  • 25. 25 ms
  • 26. 26 0 1000 2000 3000 4000 5000 6000 prefix* *suffix RESPONSE TIME, ms
  • 29. 29
  • 30. 30
  • 31. 31 WHAT’S THEN? – REVERSE! text:*sci     ReversedWildcardFilterFactory   0enilpicsid/0 0icsa/10 0icsid/20 0icsinem/30 0icsinmel/40 0suoicsul/50 asci/60 disci/70 discipline/80 lemnisci/90 luscious/100 menisci/110
  • 33. 33 Well.. Postings asci/0 8, 9, 10, 14, 18, 23, 24, 26, 31, 35 disci/10 8, 11, 14, 18, 18, 18, 21, 23, 25, 27 discipline/20 4, 5, 6, 6, 9, 13, 13, 14, 18, 22 lemnisci/30 3, 4, 7, 9, 9, 9, 12, 13, 17, 20 luscious/40 3, 3, 5, 9, 9, 12, 14, 19, 23, 28 menisci/50 0, 2, 5, 6, 11, 13, 17, 22, 27
  • 34. 34 Well.. Postings .. ah yeah.. 0enilpicsid 0icsa 0icsid 0icsinem 0icsinmel 0suoicsul asci/0 8, 9, 10, 14, 18, 23, 24, 26, 31, 35 disci/10 8, 11, 14, 18, 18, 18, 21, 23, 25, 27 discipline/20 4, 5, 6, 6, 9, 13, 13, 14, 18, 22 lemnisci/30 3, 4, 7, 9, 9, 9, 12, 13, 17, 20 luscious/40 3, 3, 5, 9, 9, 12, 14, 19, 23, 28 menisci/50 0, 2, 5, 6, 11, 13, 17, 22, 27
  • 35. 35 Well.. Postings .. ah yeah.. (and positions!) 0enilpicsid/0 4, 5, 6, 6, 9, 13, 13, 14, 18, 22 0icsa/10 8, 9, 10, 14, 18, 23, 24, 26, 31, 35 0icsid/20 8, 11, 14, 18, 18, 18, 21, 23, 25, 27 0icsinem/30 0, 2, 5, 6, 11, 13, 17, 22, 27 0icsinmel/40 3, 4, 7, 9, 9, 9, 12, 13, 17, 20 0suoicsul/50 3, 3, 5, 9, 9, 12, 14, 19, 23, 28 asci/60 8, 9, 10, 14, 18, 23, 24, 26, 31, 35 disci/70 8, 11, 14, 18, 18, 18, 21, 23, 25, 27 discipline/80 4, 5, 6, 6, 9, 13, 13, 14, 18, 22 lemnisci/90 3, 4, 7, 9, 9, 9, 12, 13, 17, 20 luscious/100 3, 3, 5, 9, 9, 12, 14, 19, 23, 28 menisci/110 0, 2, 5, 6, 11, 13, 17, 22, 27
  • 36. 36 benchmark  khludnevm$  ant  run-­‐task  -­‐Dtask.alg=conf/index-­‐5m.alg  -­‐ Dtask.mem=1000m   …            [java]  -­‐-­‐>  Round  0-­‐-­‐>1:       solr.server:org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient -­‐-­‐>org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient            [java]              [java]  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>  starting  task:  StopSolrServer            [java]              [java]  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>  Report  sum  by  Prefix  (AddDocs)  and  Round  (1   about  1  out  of  13)            [java]  Operation      round      recsPerRun    elapsedSec        avgUsedMem         avgTotalMem            [java]  AddDocs                  0          5000001      1,  100.41      102,215,792         257,425,408            [java]              [java]  Reopen  Times:            [java]    1166            [java]  ####################            [java]  ###    D  O  N  E  !!!  ###            [java]  ####################     BUILD  SUCCESSFUL   Total  time:  19  minutes  28  seconds     $  ant  run-­‐task  -­‐Dtask.alg=conf/index-­‐5m-­‐reverse.alg  -­‐Dtask.mem=1000m          [java]  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>  starting  task:  Rounds    …....              [java]  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐>  Report  sum  by  Prefix  (AddDocs)  and  Round  (1  about  1  out   of  13)            [java]  Operation      round  recsPerRun                rec/s    elapsedSec        avgUsedMem         avgTotalMem            [java]  AddDocs                  0  5000001          3,556.96        1,405.69        75,075,400         257,425,408            [java]              [java]  Reopen  Times:            [java]    3114            [java]  ####################            [java]  ###    D  O  N  E  !!!  ###            [java]  ####################   BUILD  SUCCESSFUL   Total  time:  26  minutes  50  seconds       $  du  -­‐hs  ../example/schemaless/solr/gettingstarted/data/*    28G  ../example/schemaless/solr/gettingstarted/data/index-­‐reverse    13G  ../example/schemaless/solr/gettingstarted/data/index-­‐simple    
  • 38. 38
  • 39. 39
  • 41. 41 discipline EdgeNGramFilter + ReversedWildcardFilter EdgeNGram Sort discipline cipline iscipline discipline scipline e cipline ine ipline ipline pline iscipline line line ine ne ne pline e scipline*sci* -> sci*
  • 43. 43 0 5 10 15 20 25 30 35 Baseline (5M en wiki) Reversed EdgeNGramm Main Index INDEX SIZE, GB 13G 28G ~60G
  • 46. 46 AnalyzingInfixSuggester LUCENE-3922: Add Japanese Kanji number normalization to Kuromoji SOLR-4945: Japanese Autocomplete and Highlighter broken 4945 autocomplete broken highlighter japanese solr
  • 48. 48
  • 49. 49 AnalysingInfixSuggester FOR infix SEARCH • feed AnalysingInfixSuggester with main index’s terms • enable EdgeNGramFilter for AnalysingInfixSuggester discipline iscipline scipline cipline ipline pline line ine ne e
  • 50. 50 • 14 M terms -> 79 M EdgeNGramms • 10 min • 3.3 G (25%) BUILDING SUGGESTER INDEX discipline iscipline scipline cipline ipline pline line ine ne e
  • 51. 51 ms
  • 52. 52 0 5 10 15 20 25 30 35 Baseline (5M en wiki) Reversed EdgeNGramm Suggester Main Index Suggester Index INDEX SIZE, GB 13G 28G 13G+3.3G ~60G
  • 53. 53 http://localhost:8901/solr/gettingstarted/suggest? suggest.dictionary = body_txt_en & suggest.q = sci <response> <lst name="responseHeader"><int name="QTime">4</int> </lst> <lst name="suggest">   <lst name="body_txt_en">     <lst name="sci"> <int name="numFound">1000</int>       <arr name="suggestions">           <str>scienc</str>           <str>scientif</str>           <str>scientist</str>           <str>disciplin</str>           <str>sci</str>           <str>conscious</str>           <str>category:sci</str>           <str>fascin</str>           <str>discipl</str>           <str>consciou</str>           <str>unconsci</str>           <str>conscienc</str>           <str>oscil</str>           <str>neurosci</str>           <str>interdisciplinari</str>           <str>disciplinari</str>           <str>scissor</str>           <str>ascii</str>           <str>scientolog</str>           <str>scimitar</str>           <str>conscienti</str>           <str>pseudosci</str>           <str>rescind</str>           <str>priscilla</str>           <str>subconsci</str>           <str>brescia</str>           <str>scion</str>           <str>category:scientif</str>           <str>infobox_scientist</str>   <str>www.newscientist.com</str>           <str>resuscit</str>           <str>plebiscit</str>           <str>user:scimitar</str>           <str>multidisciplinari</str>           <str>fascia</str>           <str>scifi</str>           <str>geoscienc</str>           <str>www.scifi.com</str> <str>www.sciencemag.org</str>           <str>omnisci</str>           <str>scipio</str>           <str>neuroscientist</str>           <str>scientologist</str> …. <int name="QTime">4</int>
  • 54. 54 AnalysingInfixSuggester FOR infix SEARCH • feed AnalysingInfixSuggester with main index’s terms • enable EdgeNGramFilter for AnalysingInfixSuggester • override wildcard expansion by calling AnalysingInfixSuggester
  • 56. 56
  • 57. 57
  • 58. 58
  • 59. 59 0 1000 2000 3000 4000 5000 6000 prefix* *suffix *substr* *suggester* RESPONSE TIME, ms
  • 60. 60 AnalysingInfixSuggester FOR infix SEARCH • existing scalable algorithm • minor customization • no postings explosion • potentially supports NRT
  • 63. 63 • A slight index format change • many terms refer to the same postings list • API is : •  indexWriter.deriveTerms(“name”, “name_edge”, new EdgeNgrammTokenFilter()); •  search: name_edge:sci* • Hijacking and Injecting codecs LUCENE-7863 • Promising for deep taxonomies. Derivative terms
  • 64. 64 *INFIX* SEARCH WITH DERIVATIVE TERMS 127 ms *sci*
  • 65. 65 ms
  • 66. 66 0 1000 2000 3000 4000 5000 6000 prefix* *suffix *substr* *suggester* *derived* RESPONSE TIME, ms
  • 67. 67 0 5 10 15 20 25 30 35 Baseline (5M en wiki) Reversed EdgeNGramm Suggester Derived Terms Main Index Suggester Index INDEX SIZE, GB 13G 28G 13G+3.3G 17G ~60G
  • 68. 68 REFERENCES What is in a Lucene index? Adrien Grand https://www.youtube.com/watch?v=T5RmMNDR5XI Automata Invasion. Robert Muir, Michael Mccandless https://www.youtube.com/watch?v=pd2jvy2IbJE • Lucene Search Essentials: Scorers, Collectors and Custom Queries, Mikhail Khludnev https://www.youtube.com/watch?v=X9YovpYj6uo A new Lucene suggester based on infix matches http://blog.mikemccandless.com/2013/06/a-new-lucene-suggester-based-on- infix.html
  • 69. 69 REFERENCES What is in a Lucene index? Adrien Grand https://www.youtube.com/watch?v=T5RmMNDR5XI Automata Invasion. Robert Muir, Michael Mccandless https://www.youtube.com/watch?v=pd2jvy2IbJE В поисках Tommy Hilfiger, Михаил Хлуднев https://www.youtube.com/watch?v=Azf4oUL-Dqc A new Lucene suggester based on infix matches http://blog.mikemccandless.com/2013/06/a-new-lucene-suggester-based-on-infix.html