Netflix Global Search - Lucene Revolution

•

3 gefällt mir•2,539 views

Talking about the challenges of supporting autocomplete (instant) search in different languages. Search configuration in Solr, scoring, tokenization, custom components and testing issues are discussed.

Technologie

Autocomplete Multi-Language Search Using Ngram
and EDismax Phrase Queries
Ivan Provalov
Sr Software Engineer, Netflix

• Use Case
• Configuration, scoring
• Language challenges
• Character mapper
• Query testing framework
Overview

• Netflix launched globally in January 2016
• 190 countries
• Currently support 23 languages
Going Global at Netflix

Use Case
• Video titles, person's names, genre names
• Shorter documents should be ranked higher
• Autocomplete
• Recall over precision for lexical matches (click
signal corrects this)

Configuration
• Solr 4.6.1
• Edismax: boosting, simple syntax, max field
field score
• Phrase: prevents from cross field search
• Ngram: character ngram search

“Breaking bad”
b - 0
br - 0
bre - 0
brea - 0
break - 0
breaki - 0
breakin - 0
breaking - 0
b - 1
ba - 1
bad - 1
Character Ngram Search

Scoring
• Skewed data distribution (e.g. one field
sparsely populated)
• Doc length normalization
• Unigram language model
• Term Frequency / Terms in Doc
• Log to avoid underflow errors
• Negative score (5.5.2 Dismax Scorer breaks)

Language Challenges
• Multiple Scripts
– Japanese: Kanji, Hiragana, Katakana, Romaji
• No token delimiters: Japanese, Chinese
• Korean character composition
• Stopwords and autocomplete
• Stemming

Korean: Character Composition
• input jamo ㄱ ㅗㅏ ㅇ
• decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ
• fully composed hangul 광

Japanese: Multiple Scripts
• ‘南極物語’ (‘Antarctic Story’)
• Tokenizer: 南極物語
• Reading form: ナンキョクモノガタリ
• Query in Katakana: ナンキョク
• Query in Hiragana: なんきょく
• Transliteration required

• Char Filter: pre-processes input characters
• Tokenizer: breaks data into tokens
• Filters: transform, remove, create new tokens
Tokenization Pipelines

Simple Pipeline Example: index
• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory,
EdgeNGramFilterFactory

• CharFilters: PatternReplaceCharFilterFactory
– pattern: ([a-z]+)ing
• Tokenizer: StandardTokenizerFactory
• Filters: LowerCaseFilterFactory
Simple Pipeline Example: query

• Prefix Removal
– Arabic ‫ال‬ (alef lam)
• Suffix folding
– Japanese ァ (katakana small a) => ア (a)
• Character decomposition
– Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ
(e)
Character Mapping Filter Cases

Character Mapping Filter Cases
• Stemmer implementation, or extension
– Character mapper reference implementation of
the Russian stemmer
• Patch to Lucene
– LUCENE-7321

Query Testing Framework
• Open source project
• Google Spreadsheets based UI
• Unit tests for languages queries
• Regression testing after changes, upgrades
• 20K queries
• 7K titles

Google Spreadsheets as Detail
Report
Diff

Google Spreadsheets as Summary
Report
Diff

Summary
• Use case: short fields, autocomplete, P/R
• Configuration, scoring
• Language challenges
• Character Mapper patch (LUCENE-7321)
• Query testing framework
https://github.com/Netflix/q

Query testing framework
Chris Manning IR Book, LM Chapter
Trey Grainger’s presentation on Semantic & Multilingual
Strategies in Lucene/Solr
Character Mapping Patch and Documentation
Java Internationalization, March 25, 2001, by David Czarnecki,
Andy Deitsch
References

Empfohlen

Introduction to Named Entity RecognitionTomer Lieber

How to build a recommender system?blueace

Recommender system algorithm and architectureLiang Xiang

Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain

Natural language processing (NLP) introductionRobert Lujo

Boston ML - Architecting Recommender SystemsJames Kirk

Recommendation System ExplainedCrossing Minds

Feature Engineering odsc

Empfohlen

Introduction to Named Entity RecognitionTomer Lieber

How to build a recommender system?blueace

Recommender system algorithm and architectureLiang Xiang

Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain

Natural language processing (NLP) introductionRobert Lujo

Boston ML - Architecting Recommender SystemsJames Kirk

Recommendation System ExplainedCrossing Minds

Feature Engineering odsc

TensorFlow and Keras: An OverviewPoo Kuan Hoong

Introduction to Artificial Intelligenceananth

NLP PPT.pptxLipika Sharma Shrivastava

Recommender system introductionLiang Xiang

NltkAnirudh

Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG

NLTK - Natural Language Processing in Pythonshanbady

Kaggle presentationHJ van Veen

Information ExtractionRubén Izquierdo Beviá

Kaggle and data scienceAkira Shibata

Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras

Feature EngineeringSri Ambati

Recommendation at Netflix ScaleJustin Basilico

Introduction to pandasPiyush rai

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain

Context Aware Recommendations at NetflixLinas Baltrunas

[系列活動] 一天搞懂對話機器人台灣資料科學年會

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Databricks

Rasa AI: Building clever chatbotsTom Bocklisch

Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusAtsushi Keyaki

Perl5 VS JSONkarupanerura

Weitere ähnliche Inhalte

Was ist angesagt?

TensorFlow and Keras: An OverviewPoo Kuan Hoong

Introduction to Artificial Intelligenceananth

NLP PPT.pptxLipika Sharma Shrivastava

Recommender system introductionLiang Xiang

NltkAnirudh

Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG

NLTK - Natural Language Processing in Pythonshanbady

Kaggle presentationHJ van Veen

Information ExtractionRubén Izquierdo Beviá

Kaggle and data scienceAkira Shibata

Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras

Feature EngineeringSri Ambati

Recommendation at Netflix ScaleJustin Basilico

Introduction to pandasPiyush rai

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain

Context Aware Recommendations at NetflixLinas Baltrunas

[系列活動] 一天搞懂對話機器人台灣資料科學年會

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...Databricks

Rasa AI: Building clever chatbotsTom Bocklisch

Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi

Was ist angesagt? (20)

TensorFlow and Keras: An Overview

Introduction to Artificial Intelligence

NLP PPT.pptx

Recommender system introduction

Nltk

Tutorial: Context-awareness In Information Retrieval and Recommender Systems

NLTK - Natural Language Processing in Python

Kaggle presentation

Information Extraction

Kaggle and data science

Tutorial on Deep Learning in Recommender System, Lars summer school 2019

Feature Engineering

Recommendation at Netflix Scale

Introduction to pandas

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...

Context Aware Recommendations at Netflix

[系列活動] 一天搞懂對話機器人

Deep Learning for Domain-Specific Entity Extraction from Unstructured Text wi...

Rasa AI: Building clever chatbots

Explainable AI in Industry (KDD 2019 Tutorial)

Ähnlich wie Netflix Global Search - Lucene Revolution

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web CorpusAtsushi Keyaki

Perl5 VS JSONkarupanerura

Optimize perl5 code for perfomance freakskarupanerura

Apache Solr for TYPO3 at TYPO3 Usergroup Day NetherlandsIngo Renner

nGram full text search (by 이성욱)I Goo Lee.

Building Named Entity Recognition Models Efficiently using NERDSSujit Pal

Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP

Lexical analysis - Compiler DesignKuppusamy P

Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub

Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo

NLP for Everyday PeopleRebecca Bilbro

Json - ideal for data interchangeChristoph Santschi

Deep Learning Summit (DLS01-4)Amazon Web Services

Lexing and parsingElizabeth Smith

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...Fwdays

Message:Passing - lpw 2012Tomas Doran

萌典與零時政府Au Tang

Formal machines for Streaming XML QueryingJustAnotherAbstraction

General guide in nlpHeng-Xiu Xu

Messaging, interoperability and log aggregation - a new frameworkTomas Doran

Ähnlich wie Netflix Global Search - Lucene Revolution (20)

Part-of-speech Tagging for Web Search Queries Using a Large-scale Web Corpus

Perl5 VS JSON

Optimize perl5 code for perfomance freaks

Apache Solr for TYPO3 at TYPO3 Usergroup Day Netherlands

nGram full text search (by 이성욱)

Building Named Entity Recognition Models Efficiently using NERDS

Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic

Lexical analysis - Compiler Design

Alexey Golub - Writing parsers in c# | 3Shape Meetup

Lazy man's learning: How To Build Your Own Text Summarizer

NLP for Everyday People

Json - ideal for data interchange

Deep Learning Summit (DLS01-4)

Lexing and parsing

Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...

Message:Passing - lpw 2012

萌典與零時政府

Formal machines for Streaming XML Querying

General guide in nlp

Messaging, interoperability and log aggregation - a new framework

Kürzlich hochgeladen

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

How to convert PDF to text with Nanonetsnaman860154

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Histor y of HAM Radio presentation slidevu2urc

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

A Call to Action for Generative AI in 2024Results

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Kürzlich hochgeladen (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Injustice - Developers Among Us (SciFiDevCon 2024)

Driving Behavioral Change for Information Management through Data-Driven Gree...

How to convert PDF to text with Nanonets

Exploring the Future Potential of AI-Enabled Smartphone Processors

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Histor y of HAM Radio presentation slide

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

How to Troubleshoot Apps for the Modern Connected Worker

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Data Cloud, More than a CDP by Matt Robison

Automating Google Workspace (GWS) & more with Apps Script

Breaking the Kubernetes Kill Chain: Host Path Mount

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

A Call to Action for Generative AI in 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

GenCyber Cyber Security Day Presentation

Netflix Global Search - Lucene Revolution

1. OCTOBER 11-14, 2016 • BOSTON, MA

2. Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries Ivan Provalov Sr Software Engineer, Netflix

3. • Use Case • Configuration, scoring • Language challenges • Character mapper • Query testing framework Overview

4. • Netflix launched globally in January 2016 • 190 countries • Currently support 23 languages Going Global at Netflix

6. Use Case • Video titles, person's names, genre names • Shorter documents should be ranked higher • Autocomplete • Recall over precision for lexical matches (click signal corrects this)

7. Configuration • Solr 4.6.1 • Edismax: boosting, simple syntax, max field field score • Phrase: prevents from cross field search • Ngram: character ngram search

8. “Breaking bad” b - 0 br - 0 bre - 0 brea - 0 break - 0 breaki - 0 breakin - 0 breaking - 0 b - 1 ba - 1 bad - 1 Character Ngram Search

9. Scoring • Skewed data distribution (e.g. one field sparsely populated) • Doc length normalization • Unigram language model • Term Frequency / Terms in Doc • Log to avoid underflow errors • Negative score (5.5.2 Dismax Scorer breaks)

10. Language Challenges • Multiple Scripts – Japanese: Kanji, Hiragana, Katakana, Romaji • No token delimiters: Japanese, Chinese • Korean character composition • Stopwords and autocomplete • Stemming

11. Korean: Character Composition • input jamo ㄱ ㅗㅏ ㅇ • decomposed jamo ᄀ ᅟᅪ ᅟᅠᆼ • fully composed hangul 광

12. Japanese: Multiple Scripts • ‘南極物語’ (‘Antarctic Story’) • Tokenizer: 南極物語 • Reading form: ナンキョクモノガタリ • Query in Katakana: ナンキョク • Query in Hiragana: なんきょく • Transliteration required

13. • Char Filter: pre-processes input characters • Tokenizer: breaks data into tokens • Filters: transform, remove, create new tokens Tokenization Pipelines

14. Simple Pipeline Example: index • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory, EdgeNGramFilterFactory

15. • CharFilters: PatternReplaceCharFilterFactory – pattern: ([a-z]+)ing • Tokenizer: StandardTokenizerFactory • Filters: LowerCaseFilterFactory Simple Pipeline Example: query

16. Simple Pipeline Example

17. • Prefix Removal – Arabic ‫ال‬ (alef lam) • Suffix folding – Japanese ァ (katakana small a) => ア (a) • Character decomposition – Korean ᅟᅰ (jungseong we) => ㅜ (u) and ㅔ (e) Character Mapping Filter Cases

18. Character Mapping Filter Cases • Stemmer implementation, or extension – Character mapper reference implementation of the Russian stemmer • Patch to Lucene – LUCENE-7321

19. Query Testing Framework • Open source project • Google Spreadsheets based UI • Unit tests for languages queries • Regression testing after changes, upgrades • 20K queries • 7K titles

20. Google Spreadsheets as Input

21. Google Spreadsheets as Detail Report Diff

22. Google Spreadsheets as Summary Report Diff

23. Summary • Use case: short fields, autocomplete, P/R • Configuration, scoring • Language challenges • Character Mapper patch (LUCENE-7321) • Query testing framework https://github.com/Netflix/q

24. Query testing framework Chris Manning IR Book, LM Chapter Trey Grainger’s presentation on Semantic & Multilingual Strategies in Lucene/Solr Character Mapping Patch and Documentation Java Internationalization, March 25, 2001, by David Czarnecki, Andy Deitsch References