SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
OCTOBER 11-14, 2016 • BOSTON, MA
Autocomplete Multi-Language Search Using Ngram
and EDismax Phrase Queries
Ivan Provalov
Sr Software Engineer, Netflix
• Use Case
• Configuration, scoring
• Language challenges
• Character mapper
• Query testing framework
Overview
• Netflix launched globally in January 2016
• 190 countries
• Currently support 23 languages
Going Global at Netflix
Use Case
• Video titles, person's names, genre names
• Shorter documents should be ranked higher
• Autocomplete
• Recall over precision for lexical matches (click
signal corrects this)
Configuration
• Solr 4.6.1
• Edismax: boosting, simple syntax, max field
field score
• Phrase: prevents from cross field search
• Ngram: character ngram search
“Breaking bad”
b - 0
br - 0
bre - 0
brea - 0
break - 0
breaki - 0
breakin - 0
breaking - 0
b - 1
ba - 1
bad - 1
Character Ngram Search
Scoring
• Skewed data distribution (e.g. one field
sparsely populated)
• Doc length normalization
• Unigram language model
• Term Frequency / Terms in Doc
• Log to avoid underflow errors
• Negative score (5.5.2 Dismax Scorer breaks)
Language Challenges
• Multiple Scripts
– Japanese: Kanji, Hiragana, Katakana, Romaji
• No token delimiters: Japanese, Chinese
• Korean character composition
• Stopwords and autocomplete
• Stemming
Korean: Character Composition
• input jamo ㄱ ㅗㅏ ㅇ
• decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ
• fully composed hangul 광
Japanese: Multiple Scripts
• ‘南極物語’ (‘Antarctic Story’)
• Tokenizer: 南極 物語
• Reading form: ナンキョク モノガタリ
• Query in Katakana: ナンキョク
• Query in Hiragana: なんきょく
• Transliteration required
• Char Filter: pre-processes input characters
• Tokenizer: breaks data into tokens
• Filters: transform, remove, create new tokens
Tokenization Pipelines
Simple Pipeline Example: index
• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory,
EdgeNGramFilterFactory
• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory
Simple Pipeline Example: query
Simple Pipeline Example
• Prefix Removal
– Arabic ‫ال‬ (alef lam)
• Suffix folding
– Japanese ァ (katakana small a) => ア (a)
• Character decomposition
– Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ
(e)
Character Mapping Filter Cases
Character Mapping Filter Cases
• Stemmer implementation, or extension
– Character mapper reference implementation of
the Russian stemmer
• Patch to Lucene
– LUCENE-7321
Query Testing Framework
• Open source project
• Google Spreadsheets based UI
• Unit tests for languages queries
• Regression testing after changes, upgrades
• 20K queries
• 7K titles
Google Spreadsheets as Input
Google Spreadsheets as Detail
Report
Diff
Google Spreadsheets as Summary
Report
Diff
Summary
• Use case: short fields, autocomplete, P/R
• Configuration, scoring
• Language challenges
• Character Mapper patch (LUCENE-7321)
• Query testing framework
https://github.com/Netflix/q
Query testing framework
Chris Manning IR Book, LM Chapter
Trey Grainger’s presentation on Semantic & Multilingual
Strategies in Lucene/Solr
Character Mapping Patch and Documentation
Java Internationalization, March 25, 2001, by David Czarnecki,
Andy Deitsch
References

Weitere ähnliche Inhalte

Was ist angesagt?

TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligenceananth
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandasPiyush rai
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at NetflixLinas Baltrunas
 
[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人台灣資料科學年會
 
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Databricks
 
Rasa AI: Building clever chatbots
Rasa AI: Building clever chatbotsRasa AI: Building clever chatbots
Rasa AI: Building clever chatbotsTom Bocklisch
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 

Was ist angesagt? (20)

TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Nltk
NltkNltk
Nltk
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人[系列活動] 一天搞懂對話機器人
[系列活動] 一天搞懂對話機器人
 
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...
 
Rasa AI: Building clever chatbots
Rasa AI: Building clever chatbotsRasa AI: Building clever chatbots
Rasa AI: Building clever chatbots
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 

Ähnlich wie Netflix Global Search - Lucene Revolution

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusAtsushi Keyaki
 
Optimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaksOptimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freakskarupanerura
 
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsApache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsIngo Renner
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)I Goo Lee.
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignKuppusamy P
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchangeChristoph Santschi
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012Tomas Doran
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府Au Tang
 
Formal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingFormal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingJustAnotherAbstraction
 
General guide in nlp
General guide in nlpGeneral guide in nlp
General guide in nlpHeng-Xiu Xu
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 

Ähnlich wie Netflix Global Search - Lucene Revolution (20)

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusPart-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus
 
Perl5 VS JSON
Perl5 VS JSONPerl5 VS JSON
Perl5 VS JSON
 
Optimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaksOptimize perl5 code for perfomance freaks
Optimize perl5 code for perfomance freaks
 
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsApache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands
 
nGram full text search (by 이성욱)
nGram full text search (by 이성욱)nGram full text search (by 이성욱)
nGram full text search (by 이성욱)
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text Summarizer
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Json - ideal for data interchange
Json - ideal for data interchangeJson - ideal for data interchange
Json - ideal for data interchange
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 
萌典與零時政府
萌典與零時政府萌典與零時政府
萌典與零時政府
 
Formal machines for Streaming XML Querying
Formal machines for Streaming XML QueryingFormal machines for Streaming XML Querying
Formal machines for Streaming XML Querying
 
General guide in nlp
General guide in nlpGeneral guide in nlp
General guide in nlp
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 

Kürzlich hochgeladen

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Kürzlich hochgeladen (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Netflix Global Search - Lucene Revolution

  • 1. OCTOBER 11-14, 2016 • BOSTON, MA
  • 2. Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries Ivan Provalov Sr Software Engineer, Netflix
  • 3. • Use Case • Configuration, scoring • Language challenges • Character mapper • Query testing framework Overview
  • 4. • Netflix launched globally in January 2016 • 190 countries • Currently support 23 languages Going Global at Netflix
  • 5.
  • 6. Use Case • Video titles, person's names, genre names • Shorter documents should be ranked higher • Autocomplete • Recall over precision for lexical matches (click signal corrects this)
  • 7. Configuration • Solr 4.6.1 • Edismax: boosting, simple syntax, max field field score • Phrase: prevents from cross field search • Ngram: character ngram search
  • 8. “Breaking bad” b - 0 br - 0 bre - 0 brea - 0 break - 0 breaki - 0 breakin - 0 breaking - 0 b - 1 ba - 1 bad - 1 Character Ngram Search
  • 9. Scoring • Skewed data distribution (e.g. one field sparsely populated) • Doc length normalization • Unigram language model • Term Frequency / Terms in Doc • Log to avoid underflow errors • Negative score (5.5.2 Dismax Scorer breaks)
  • 10. Language Challenges • Multiple Scripts – Japanese: Kanji, Hiragana, Katakana, Romaji • No token delimiters: Japanese, Chinese • Korean character composition • Stopwords and autocomplete • Stemming
  • 11. Korean: Character Composition • input jamo ㄱ ㅗㅏ ㅇ • decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ • fully composed hangul 광
  • 12. Japanese: Multiple Scripts • ‘南極物語’ (‘Antarctic Story’) • Tokenizer: 南極 物語 • Reading form: ナンキョク モノガタリ • Query in Katakana: ナンキョク • Query in Hiragana: なんきょく • Transliteration required
  • 13. • Char Filter: pre-processes input characters • Tokenizer: breaks data into tokens • Filters: transform, remove, create new tokens Tokenization Pipelines
  • 14. Simple Pipeline Example: index • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory, EdgeNGramFilterFactory
  • 15. • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory Simple Pipeline Example: query
  • 17. • Prefix Removal – Arabic ‫ال‬ (alef lam) • Suffix folding – Japanese ァ (katakana small a) => ア (a) • Character decomposition – Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e) Character Mapping Filter Cases
  • 18. Character Mapping Filter Cases • Stemmer implementation, or extension – Character mapper reference implementation of the Russian stemmer • Patch to Lucene – LUCENE-7321
  • 19. Query Testing Framework • Open source project • Google Spreadsheets based UI • Unit tests for languages queries • Regression testing after changes, upgrades • 20K queries • 7K titles
  • 21. Google Spreadsheets as Detail Report Diff
  • 22. Google Spreadsheets as Summary Report Diff
  • 23. Summary • Use case: short fields, autocomplete, P/R • Configuration, scoring • Language challenges • Character Mapper patch (LUCENE-7321) • Query testing framework https://github.com/Netflix/q
  • 24. Query testing framework Chris Manning IR Book, LM Chapter Trey Grainger’s presentation on Semantic & Multilingual Strategies in Lucene/Solr Character Mapping Patch and Documentation Java Internationalization, March 25, 2001, by David Czarnecki, Andy Deitsch References