SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
OCTOBER 11-14, 2016 • BOSTON, MA
Autocomplete Multi-Language Search Using Ngram
and EDismax Phrase Queries
Ivan Provalov
Sr Software Engineer, Netflix
• Use Case
• Configuration, scoring
• Language challenges
• Character mapper
• Query testing framework
Overview
• Netflix launched globally in January 2016
• 190 countries
• Currently support 23 languages
Going Global at Netflix
Use Case
• Video titles, person's names, genre names
• Shorter documents should be ranked higher
• Autocomplete
• Recall over precision for lexical matches (click
signal corrects this)
Configuration
• Solr 4.6.1
• Edismax: boosting, simple syntax, max field
field score
• Phrase: prevents from cross field search
• Ngram: character ngram search
“Breaking bad”
b - 0
br - 0
bre - 0
brea - 0
break - 0
breaki - 0
breakin - 0
breaking - 0
b - 1
ba - 1
bad - 1
Character Ngram Search
Scoring
• Skewed data distribution (e.g. one field
sparsely populated)
• Doc length normalization
• Unigram language model
• Term Frequency / Terms in Doc
• Log to avoid underflow errors
• Negative score (5.5.2 Dismax Scorer breaks)
Language Challenges
• Multiple Scripts
– Japanese: Kanji, Hiragana, Katakana, Romaji
• No token delimiters: Japanese, Chinese
• Korean character composition
• Stopwords and autocomplete
• Stemming
Korean: Character Composition
• input jamo ㄱ ㅗㅏ ㅇ
• decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ
• fully composed hangul 광
Japanese: Multiple Scripts
• ‘南極物語’ (‘Antarctic Story’)
• Tokenizer: 南極 物語
• Reading form: ナンキョク モノガタリ
• Query in Katakana: ナンキョク
• Query in Hiragana: なんきょく
• Transliteration required
• Char Filter: pre-processes input characters
• Tokenizer: breaks data into tokens
• Filters: transform, remove, create new tokens
Tokenization Pipelines
Simple Pipeline Example: index
• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory,
EdgeNGramFilterFactory
• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory
Simple Pipeline Example: query
Simple Pipeline Example
• Prefix Removal
– Arabic ‫ال‬ (alef lam)
• Suffix folding
– Japanese ァ (katakana small a) => ア (a)
• Character decomposition
– Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ
(e)
Character Mapping Filter Cases
Character Mapping Filter Cases
• Stemmer implementation, or extension
– Character mapper reference implementation of
the Russian stemmer
• Patch to Lucene
– LUCENE-7321
Query Testing Framework
• Open source project
• Google Spreadsheets based UI
• Unit tests for languages queries
• Regression testing after changes, upgrades
• 20K queries
• 7K titles
Google Spreadsheets as Input
Google Spreadsheets as Detail
Report
Diff
Google Spreadsheets as Summary
Report
Diff
Summary
• Use case: short fields, autocomplete, P/R
• Configuration, scoring
• Language challenges
• Character Mapper patch (LUCENE-7321)
• Query testing framework
https://github.com/Netflix/q
Query testing framework
Chris Manning IR Book, LM Chapter
Trey Grainger’s presentation on Semantic & Multilingual
Strategies in Lucene/Solr
Character Mapping Patch and Documentation
Java Internationalization, March 25, 2001, by David Czarnecki,
Andy Deitsch
References

Weitere ähnliche Inhalte

Was ist angesagt?

High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)Ontico
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageGreg Brown
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastUwe Korn
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуOlga Lavrentieva
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Philips Kokoh Prasetyo
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Денис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APIДенис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APILEDC 2016
 

Was ist angesagt? (20)

High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
PostgreSQL @Alibaba Cloud / Xianming Dou (Alibaba Cloud)
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Big data elasticsearch practical
Big data  elasticsearch practicalBig data  elasticsearch practical
Big data elasticsearch practical
 
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fastpandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайту
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Денис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APIДенис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search API
 

Ähnlich wie Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Presented by Ivan Provalov, Netflix

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusAtsushi Keyaki
 
Optimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaksOptimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freakskarupanerura
 
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsApache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsIngo Renner
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)I Goo Lee.
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignKuppusamy P
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchangeChristoph Santschi
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012Tomas Doran
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府Au Tang
 
Formal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingFormal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingJustAnotherAbstraction
 
General guide in nlp
General guide in nlpGeneral guide in nlp
General guide in nlpHeng-Xiu Xu
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 

Ähnlich wie Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Presented by Ivan Provalov, Netflix (20)

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
 
Perl5 VS JSON
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSON
 
Optimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaksOptimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaks
 
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsApache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text Summarizer
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchange
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府
 
Formal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingFormal machines for Streaming XML Querying
Formal machines for Streaming XML Querying
 
General guide in nlp
General guide in nlpGeneral guide in nlp
General guide in nlp
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 

Mehr von Lucidworks

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceLucidworks
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesLucidworks
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchLucidworks
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondLucidworks
 

Mehr von Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Presented by Ivan Provalov, Netflix

  • 1. OCTOBER 11-14, 2016 • BOSTON, MA
  • 2. Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries Ivan Provalov Sr Software Engineer, Netflix
  • 3. • Use Case • Configuration, scoring • Language challenges • Character mapper • Query testing framework Overview
  • 4. • Netflix launched globally in January 2016 • 190 countries • Currently support 23 languages Going Global at Netflix
  • 5.
  • 6. Use Case • Video titles, person's names, genre names • Shorter documents should be ranked higher • Autocomplete • Recall over precision for lexical matches (click signal corrects this)
  • 7. Configuration • Solr 4.6.1 • Edismax: boosting, simple syntax, max field field score • Phrase: prevents from cross field search • Ngram: character ngram search
  • 8. “Breaking bad” b - 0 br - 0 bre - 0 brea - 0 break - 0 breaki - 0 breakin - 0 breaking - 0 b - 1 ba - 1 bad - 1 Character Ngram Search
  • 9. Scoring • Skewed data distribution (e.g. one field sparsely populated) • Doc length normalization • Unigram language model • Term Frequency / Terms in Doc • Log to avoid underflow errors • Negative score (5.5.2 Dismax Scorer breaks)
  • 10. Language Challenges • Multiple Scripts – Japanese: Kanji, Hiragana, Katakana, Romaji • No token delimiters: Japanese, Chinese • Korean character composition • Stopwords and autocomplete • Stemming
  • 11. Korean: Character Composition • input jamo ㄱ ㅗㅏ ㅇ • decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ • fully composed hangul 광
  • 12. Japanese: Multiple Scripts • ‘南極物語’ (‘Antarctic Story’) • Tokenizer: 南極 物語 • Reading form: ナンキョク モノガタリ • Query in Katakana: ナンキョク • Query in Hiragana: なんきょく • Transliteration required
  • 13. • Char Filter: pre-processes input characters • Tokenizer: breaks data into tokens • Filters: transform, remove, create new tokens Tokenization Pipelines
  • 14. Simple Pipeline Example: index • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory, EdgeNGramFilterFactory
  • 15. • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory Simple Pipeline Example: query
  • 17. • Prefix Removal – Arabic ‫ال‬ (alef lam) • Suffix folding – Japanese ァ (katakana small a) => ア (a) • Character decomposition – Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e) Character Mapping Filter Cases
  • 18. Character Mapping Filter Cases • Stemmer implementation, or extension – Character mapper reference implementation of the Russian stemmer • Patch to Lucene – LUCENE-7321
  • 19. Query Testing Framework • Open source project • Google Spreadsheets based UI • Unit tests for languages queries • Regression testing after changes, upgrades • 20K queries • 7K titles
  • 21. Google Spreadsheets as Detail Report Diff
  • 22. Google Spreadsheets as Summary Report Diff
  • 23. Summary • Use case: short fields, autocomplete, P/R • Configuration, scoring • Language challenges • Character Mapper patch (LUCENE-7321) • Query testing framework https://github.com/Netflix/q
  • 24. Query testing framework Chris Manning IR Book, LM Chapter Trey Grainger’s presentation on Semantic & Multilingual Strategies in Lucene/Solr Character Mapping Patch and Documentation Java Internationalization, March 25, 2001, by David Czarnecki, Andy Deitsch References