SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
More on "More Like This"
Recommendations in SOLR
Oana Brezai
Software Engineer @ eSolutions
Outline
Use Case
How does Search work
How does MLT work
A limitation of MLT
Quality of the results
Conclusions
Use Case:
Build a
Recommendation
Application
Requirements
● Movie Store
● ~ 85 K Movies
● Use Open Source Software
Solution
● Fast
● High Quality Results
Why
Apache SOLR ?
Solr (NoSQL DB)
● Popular
● Blazing-fast
● Highly scalable
● Open source enterprise search platform
● Built on Apache Lucene
Who Uses SOLR
“Movie Store”
Use Case
When
● A user visualizes the details of a
movie
Then
● The application recommends
“similar” movies
Example
Target Movie
● The Lord of the Rings: The
Fellowship of the Ring
Recommendations
1) The Lord of the Rings: The Return of
the King
2) The Lord of the Rings: The Two
Towers
3) The Lord of the Rings
4) Lord of War
5) The Lord Protector
What Does
“Similar”
Mean?
Target Movie
● “The Lord of the Rings: The
Fellowship of the Ring”

Action / Adventure / Drama

8.8 on IMDB
Recommended (Similar) Movies
● The same words in the title
● The same movie genre
● The same words in the description
● Similar IMDB vote
Questions
Questions for our
Recommendation System
● Do all the words have the
same importance?
● Do all the fields have the same
importance?
● How does the engine
differentiate between results?
Let’s START!
Add Data
to SOLR
Create a Collection (~Table)
● movie_content
Populate the Collection with
Data
● 85855 movies
Data
Structure
Movie Fields
● imdb_title_id (movie id)
● original_title
● description
● genre
● avg_vote (imdb vote)
Movie Fields -> with Types
● imdb_title_id -> string
● original_title -> “analyzed” text
● description -> “analyzed” text
● genre -> array of strings
● avg_vote -> number
String vs “Analyzed” Text Field Types
● Field Type: String
● Example: “Comedy” (field: genre)
 Indexed: “Comedy”
● Field Type: “Analyzed” Text
● Example: “The Lord of the Rings: The Fellowship of the Ring” (field:
original_title)
 Indexed (lowercased and without stopwords):
○ “lord”
○ “rings”
○ “fellowship”
○ “ring”
“The Lord of the Rings: The Fellowship of the
Ring”
● Movie Id (imdb_title_id): tt0120737
● Original Title
 “The Lord of the Rings: The Fellowship of the Ring”
● Description
 “A meek Hobbit from the Shire and eight companions set out on a
journey to destroy the powerful One Ring and save Middle-earth from the
Dark Lord Sauron.”
● Genre
 “Action, Adventure, Drama”
● Imdb vote (avg_vote): 8.8
“More Like
This” Feature
in SOLR
More Like This
● Given a movie id => list
“similar” movies
● Uses the “Search” functionality
How Does
“Search”
Work in SOLR?
“Search”
Example 1:
Query
original_title: “Lord of the Rings”
Results
● No movies found
“Search”
Example 2:
Query
original_title: “Lord” AND
original_title: “Rings”
Results (4)
1) "The Lord of the Rings"
2) "The Lord of the Rings: The
Fellowship of the Ring"
3) "The Lord of the Rings: The
Return of the King"
4) "The Lord of the Rings: The Two
Towers”
Execution time: 21 ms
How Does the Search original_title: “Lord”
AND original_title: “Rings” Function?
● Searches in the original_title index all the movies that contain
the words “lord” AND “rings” (lowercased!)
● Computes search score based on Boosting, Term Frequency (TF)
and Inverse Document Frequency (IDF)
● Displays the results in descending order of the score
The TF / IDF Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
where:
boost(field[j]) = custom weight given to the field j
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength/avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
word[i] = every word in the field, excluding stop words (in our case)
fieldLength = count of words in the field, excluding stop words (in our case)
avgFieldLength = average length of field
original_title = “The Lord of the Rings”
genre = “Animation, Adventure, Fantasy”
description = “The Fellowship of the Ring embark ...”
score = 1 * tf(“lord”) * idf(“lord”) +
1 * tf(“rings”) * idf(“rings”) +
1 * tf(“Animation”) * idf(“Animation”) + ...
Debug the Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
Debug the TF / IDF Formula for the
QUERY = original_title:Lord AND original_title:Rings
Original title CTF (Field)
Lord Rings
CDF (Corpus)
Lord Rings
Field
Length
Score
The Lord of the Rings 1 1 26 10 2 8.29
The Lord of the Rings:
The Fellowship of the Ring
1 1 26 10 4 6.06
The Lord of the Rings:
The Return of the King
1 1 26 10 4 6.06
The Lord of the Rings:
The Two Towers
1 1 26 10 4 6.06
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength / avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
“Search”
in SOLR
High Quality
● Scoring Formula
 TF / IDF
 Boosting
Fast
● Inverted Index
Inverted Index (original_title)
Id
(imdb_title_id)
Tile (original_title)
tt0120737 The Lord of the Rings:
The Fellowship of the Ring
tt0167260 The Lord of the Rings:
The Return of the King
tt0167261 The Lord of the Rings:
The Two Towers
tt0077869 The Lord of the Rings
Word Ids (imbd_title_id)
lord tt0120737,
tt0167260,
tt0167261, tt0077869
rings tt0120737,
tt0167260,
tt0167261, tt0077869
ring tt0120737
fellowship tt0120737
return tt0167260
king tt0167260
towers tt0167261
two tt0167261
How Does
“More Like This”
Work in SOLR?
“More Like
This”
Example
Query
● q = imdb_title_id:tt0120737
(“The Lord of the Rings: The
Fellowship of the Ring”)
● Other parameters:
 mlt = true
 mlt.fl=original_title,
description, genre, avg_vote
 mlt.mintf = 1
 mlt.count = 5
“More Like
This”
Example URL
http://localhost:8983/solr/movie_content
/select?
mlt=true&mlt.mintf=1
&mlt.fl=original_title,description,genre,av
g_vote
&q=imdb_title_id:tt0120737
&mlt.count=5
Results
Results (“The Lord of the
Rings: The Fellowship of the
Ring”)
● Execution Time: <100 ms
● Total Results: 62387
Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of
the Ring” (Action, Adventure, Drama - 8.8)
Improve Query:
Add Boosting
Boost Fields (Add Weight)
● original_title
● description
● genre
● avg_vote
Importance of Fields
avg_vote >> genre >> original_title >> description
Boosting factors:
● avg_vote -> 40
● genre -> 30
● original_title -> 20
● description -> 1
For every word in (original_title, description, genre)
do
score + = boosting(field) * tf(word) * idf(word)
Scoring Formula
genre = “Animation, Adventure, Fantasy” -- BOOSTING 30
original_title = “The Lord of the Rings” --- BOOSTING 20
description = “The Fellowship of the Ring embark ...” -- BOOSTING 1
score = 30 * tf(“Animation”) * idf(“Animation”) +
30 * tf(“Adventure”) * idf(“Adventure”) +
30 * tf(“Fantasy”) * idf(“Fantasy”) +
20 * tf(“lord”) * idf(“lord”) + ...
Debug Scoring Formula with Boosting
http://localhost:8983/solr/movie_content
/select?
mlt=true&mlt.mindf=1&mlt.mintf=1
&mlt.fl=original_title,description,genre,avg_vote
&q=imdb_title_id:tt0120737
&mlt.boost=true&mlt.qf=avg_vote^40 genre^30 original_title^20 description
&mlt.count=5
SOLR: More Like This URL Request
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
A Limitation of
“More Like This”
Numeric Fields
Ignored in MLT
Issue
● Only text fields are used in MLT
queries
Solution
● Rewrite the whole query as a
search query and include also
the numeric fields
More on
“More Like This”
in SOLR
“More Like This”
Steps
1) Extract the “interesting terms”
from the target movie
2) Add boostings / field (as given in
the query) for every interesting term
3) Perform a Search with those words
and boostings
“More Like This” Step 1
1) Extract the “interesting terms” from the target movie (from the field list in
the query): take all the words from all the fields and compute their relevance. Keep
the first 25.
Ex: word “ring” -> very relevant for the movie: “The Lord of the Rings: The
Fellowship of the Ring”:
- 2 occurrences: once in “original_title” and once in “description”
- in the whole corpus of 85855 movies:
- 35 times in the field “original_title” and
- 282 times in the field “description”
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
List of Interesting Terms for MovieId
tt0120737
genre:Drama
genre:Action
genre:Adventure
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship
original_title:ring
original_title:lord
original_title:rings
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
“More Like This” Step 2
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term:
avg_vote^40 genre^30 original_title^20 description
3) Perform a Search with those words and boostings
Interesting Terms for tt0120737 with Boosting
genre:Drama^30
genre:Action^30
genre:Adventure^30
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship^20
original_title:ring^20
original_title:lord^20
original_title:rings^20
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
“More Like This” Step 3
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
Add Numeric
Fields to
“More Like This”
1) SOLR Request 1: perform a MLT and
get the “interesting terms”
2) Add boostings
3) Add numeric fields with their
boostings
4) SOLR Request 2: perform a Search
with numeric fields and “interesting
terms” with their respective
boostings
Example of Numeric Field Syntax
Target movie: avg_vote = 8.8
=> a similar movie would have:
avg_vote: [8.8 - 1.5 TO 8.8 + 1.5]
=> add boosting factor:
avg_vote: [7.3 TO 10.3] ^ 40
Final SOLR Search Query
genre:Drama^30
genre:Action^30
genre:Adventure^30
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship^20
original_title:ring^20
original_title:lord^20
original_title:rings^20
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
avg_vote:[7.3 TO 10.3]^40
Q =
Final Results for “The Lord of the Rings: The Fellowship of
the Ring”(Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
249 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
246 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
222 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
161 Lord of War 2005 Action / Crime / Drama 7.6
157 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Quality of the
Results
Quality
Recommended Products
Ordered
● Based on history of sales
Recommended Products
Viewed
● Based on history of browsing
Conclusions
Conclusions
MLT in SOLR
● Inverted Index
● TF/IDF Scoring Formula
● Boosting
Quality Measurement
Feedback Loop
● Recommended Products Ordered
● Recommended Products Viewed
References
● https://solr.apache.org/
● https://lucidworks.com/post/who-uses-lucenesolr/
● https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv
● https://www.esolutions.ro/streaming-expressions-in-apache-solr
● https://github.com/oanabrezai/moreLikeThisSOLR
Thank you
Oana Brezai
oana.brezai@esolutions.ro

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Données liées et Web sémantique : quand le lien fait sens.
Données liées et Web sémantique : quand le lien fait sens. Données liées et Web sémantique : quand le lien fait sens.
Données liées et Web sémantique : quand le lien fait sens.
 
Sisteme de Operare: Sincronizare
Sisteme de Operare: SincronizareSisteme de Operare: Sincronizare
Sisteme de Operare: Sincronizare
 
Cloud Native PostgreSQL
Cloud Native PostgreSQLCloud Native PostgreSQL
Cloud Native PostgreSQL
 
Introduction to the Container Network Interface (CNI)
Introduction to the Container Network Interface (CNI)Introduction to the Container Network Interface (CNI)
Introduction to the Container Network Interface (CNI)
 
Tips of Malloc & Free
Tips of Malloc & FreeTips of Malloc & Free
Tips of Malloc & Free
 
Docker by Example - Basics
Docker by Example - Basics Docker by Example - Basics
Docker by Example - Basics
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
DeathNote of Microsoft Windows Kernel
DeathNote of Microsoft Windows KernelDeathNote of Microsoft Windows Kernel
DeathNote of Microsoft Windows Kernel
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
TDOH Conf-APP檢測之經驗分享
TDOH Conf-APP檢測之經驗分享TDOH Conf-APP檢測之經驗分享
TDOH Conf-APP檢測之經驗分享
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Mongo db
Mongo dbMongo db
Mongo db
 
Growing a Better Data Lake Together
Growing a Better Data Lake TogetherGrowing a Better Data Lake Together
Growing a Better Data Lake Together
 
Repositorios
Repositorios Repositorios
Repositorios
 
Debugging concurrency programs in go
Debugging concurrency programs in goDebugging concurrency programs in go
Debugging concurrency programs in go
 
Vue 뽀개기 1장 환경설정 및 spa설정
Vue 뽀개기 1장 환경설정 및 spa설정Vue 뽀개기 1장 환경설정 및 spa설정
Vue 뽀개기 1장 환경설정 및 spa설정
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
 
Debugging ansible modules
Debugging ansible modulesDebugging ansible modules
Debugging ansible modules
 
Contours of DITA 2.0
Contours of DITA 2.0Contours of DITA 2.0
Contours of DITA 2.0
 

Kürzlich hochgeladen

Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
MohammadAliNayeem
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Kürzlich hochgeladen (20)

Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
Complex plane, Modulus, Argument, Graphical representation of a complex numbe...
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Circuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineeringCircuit Breaker arc phenomenon.pdf engineering
Circuit Breaker arc phenomenon.pdf engineering
 

More on "More Like This" Recommendations in SOLR

  • 1. More on "More Like This" Recommendations in SOLR Oana Brezai Software Engineer @ eSolutions
  • 2. Outline Use Case How does Search work How does MLT work A limitation of MLT Quality of the results Conclusions
  • 3. Use Case: Build a Recommendation Application Requirements ● Movie Store ● ~ 85 K Movies ● Use Open Source Software Solution ● Fast ● High Quality Results
  • 4. Why Apache SOLR ? Solr (NoSQL DB) ● Popular ● Blazing-fast ● Highly scalable ● Open source enterprise search platform ● Built on Apache Lucene
  • 6. “Movie Store” Use Case When ● A user visualizes the details of a movie Then ● The application recommends “similar” movies
  • 7. Example Target Movie ● The Lord of the Rings: The Fellowship of the Ring Recommendations 1) The Lord of the Rings: The Return of the King 2) The Lord of the Rings: The Two Towers 3) The Lord of the Rings 4) Lord of War 5) The Lord Protector
  • 8. What Does “Similar” Mean? Target Movie ● “The Lord of the Rings: The Fellowship of the Ring”  Action / Adventure / Drama  8.8 on IMDB Recommended (Similar) Movies ● The same words in the title ● The same movie genre ● The same words in the description ● Similar IMDB vote
  • 9. Questions Questions for our Recommendation System ● Do all the words have the same importance? ● Do all the fields have the same importance? ● How does the engine differentiate between results?
  • 11. Add Data to SOLR Create a Collection (~Table) ● movie_content Populate the Collection with Data ● 85855 movies
  • 12. Data Structure Movie Fields ● imdb_title_id (movie id) ● original_title ● description ● genre ● avg_vote (imdb vote)
  • 13. Movie Fields -> with Types ● imdb_title_id -> string ● original_title -> “analyzed” text ● description -> “analyzed” text ● genre -> array of strings ● avg_vote -> number
  • 14. String vs “Analyzed” Text Field Types ● Field Type: String ● Example: “Comedy” (field: genre)  Indexed: “Comedy” ● Field Type: “Analyzed” Text ● Example: “The Lord of the Rings: The Fellowship of the Ring” (field: original_title)  Indexed (lowercased and without stopwords): ○ “lord” ○ “rings” ○ “fellowship” ○ “ring”
  • 15. “The Lord of the Rings: The Fellowship of the Ring” ● Movie Id (imdb_title_id): tt0120737 ● Original Title  “The Lord of the Rings: The Fellowship of the Ring” ● Description  “A meek Hobbit from the Shire and eight companions set out on a journey to destroy the powerful One Ring and save Middle-earth from the Dark Lord Sauron.” ● Genre  “Action, Adventure, Drama” ● Imdb vote (avg_vote): 8.8
  • 16.
  • 17. “More Like This” Feature in SOLR More Like This ● Given a movie id => list “similar” movies ● Uses the “Search” functionality
  • 19. “Search” Example 1: Query original_title: “Lord of the Rings” Results ● No movies found
  • 20. “Search” Example 2: Query original_title: “Lord” AND original_title: “Rings” Results (4) 1) "The Lord of the Rings" 2) "The Lord of the Rings: The Fellowship of the Ring" 3) "The Lord of the Rings: The Return of the King" 4) "The Lord of the Rings: The Two Towers” Execution time: 21 ms
  • 21. How Does the Search original_title: “Lord” AND original_title: “Rings” Function? ● Searches in the original_title index all the movies that contain the words “lord” AND “rings” (lowercased!) ● Computes search score based on Boosting, Term Frequency (TF) and Inverse Document Frequency (IDF) ● Displays the results in descending order of the score
  • 22. The TF / IDF Scoring Formula score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i])) where: boost(field[j]) = custom weight given to the field j tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength/avgFieldLength)) idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5)) word[i] = every word in the field, excluding stop words (in our case) fieldLength = count of words in the field, excluding stop words (in our case) avgFieldLength = average length of field
  • 23. original_title = “The Lord of the Rings” genre = “Animation, Adventure, Fantasy” description = “The Fellowship of the Ring embark ...” score = 1 * tf(“lord”) * idf(“lord”) + 1 * tf(“rings”) * idf(“rings”) + 1 * tf(“Animation”) * idf(“Animation”) + ... Debug the Scoring Formula score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
  • 24. Debug the TF / IDF Formula for the QUERY = original_title:Lord AND original_title:Rings Original title CTF (Field) Lord Rings CDF (Corpus) Lord Rings Field Length Score The Lord of the Rings 1 1 26 10 2 8.29 The Lord of the Rings: The Fellowship of the Ring 1 1 26 10 4 6.06 The Lord of the Rings: The Return of the King 1 1 26 10 4 6.06 The Lord of the Rings: The Two Towers 1 1 26 10 4 6.06 tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength / avgFieldLength)) idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
  • 25. “Search” in SOLR High Quality ● Scoring Formula  TF / IDF  Boosting Fast ● Inverted Index
  • 26. Inverted Index (original_title) Id (imdb_title_id) Tile (original_title) tt0120737 The Lord of the Rings: The Fellowship of the Ring tt0167260 The Lord of the Rings: The Return of the King tt0167261 The Lord of the Rings: The Two Towers tt0077869 The Lord of the Rings Word Ids (imbd_title_id) lord tt0120737, tt0167260, tt0167261, tt0077869 rings tt0120737, tt0167260, tt0167261, tt0077869 ring tt0120737 fellowship tt0120737 return tt0167260 king tt0167260 towers tt0167261 two tt0167261
  • 27. How Does “More Like This” Work in SOLR?
  • 28. “More Like This” Example Query ● q = imdb_title_id:tt0120737 (“The Lord of the Rings: The Fellowship of the Ring”) ● Other parameters:  mlt = true  mlt.fl=original_title, description, genre, avg_vote  mlt.mintf = 1  mlt.count = 5
  • 30. Results Results (“The Lord of the Rings: The Fellowship of the Ring”) ● Execution Time: <100 ms ● Total Results: 62387
  • 31. Score Title Year Genre Vote 24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 14.78 The Ring Thing 2004 Adventure / Comedy 3.5 13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2 12.65 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2 Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8)
  • 32. Score Title Year Genre Vote 24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 14.78 The Ring Thing 2004 Adventure / Comedy 3.5 13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2 12.65 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2 Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8)
  • 33. Improve Query: Add Boosting Boost Fields (Add Weight) ● original_title ● description ● genre ● avg_vote Importance of Fields avg_vote >> genre >> original_title >> description
  • 34. Boosting factors: ● avg_vote -> 40 ● genre -> 30 ● original_title -> 20 ● description -> 1 For every word in (original_title, description, genre) do score + = boosting(field) * tf(word) * idf(word) Scoring Formula
  • 35. genre = “Animation, Adventure, Fantasy” -- BOOSTING 30 original_title = “The Lord of the Rings” --- BOOSTING 20 description = “The Fellowship of the Ring embark ...” -- BOOSTING 1 score = 30 * tf(“Animation”) * idf(“Animation”) + 30 * tf(“Adventure”) * idf(“Adventure”) + 30 * tf(“Fantasy”) * idf(“Fantasy”) + 20 * tf(“lord”) * idf(“lord”) + ... Debug Scoring Formula with Boosting
  • 37. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 38. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 39. A Limitation of “More Like This”
  • 40. Numeric Fields Ignored in MLT Issue ● Only text fields are used in MLT queries Solution ● Rewrite the whole query as a search query and include also the numeric fields
  • 41. More on “More Like This” in SOLR
  • 42. “More Like This” Steps 1) Extract the “interesting terms” from the target movie 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 43. “More Like This” Step 1 1) Extract the “interesting terms” from the target movie (from the field list in the query): take all the words from all the fields and compute their relevance. Keep the first 25. Ex: word “ring” -> very relevant for the movie: “The Lord of the Rings: The Fellowship of the Ring”: - 2 occurrences: once in “original_title” and once in “description” - in the whole corpus of 85855 movies: - 35 times in the field “original_title” and - 282 times in the field “description” 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 44. List of Interesting Terms for MovieId tt0120737 genre:Drama genre:Action genre:Adventure description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship original_title:ring original_title:lord original_title:rings description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions
  • 45. “More Like This” Step 2 1) Extract the “interesting terms” from the target movie (from the field list in the query) 2) Add boostings / field (as given in the query) for every interesting term: avg_vote^40 genre^30 original_title^20 description 3) Perform a Search with those words and boostings
  • 46. Interesting Terms for tt0120737 with Boosting genre:Drama^30 genre:Action^30 genre:Adventure^30 description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship^20 original_title:ring^20 original_title:lord^20 original_title:rings^20 description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions
  • 47. “More Like This” Step 3 1) Extract the “interesting terms” from the target movie (from the field list in the query) 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 48. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 49. Add Numeric Fields to “More Like This” 1) SOLR Request 1: perform a MLT and get the “interesting terms” 2) Add boostings 3) Add numeric fields with their boostings 4) SOLR Request 2: perform a Search with numeric fields and “interesting terms” with their respective boostings
  • 50. Example of Numeric Field Syntax Target movie: avg_vote = 8.8 => a similar movie would have: avg_vote: [8.8 - 1.5 TO 8.8 + 1.5] => add boosting factor: avg_vote: [7.3 TO 10.3] ^ 40
  • 51. Final SOLR Search Query genre:Drama^30 genre:Action^30 genre:Adventure^30 description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship^20 original_title:ring^20 original_title:lord^20 original_title:rings^20 description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions avg_vote:[7.3 TO 10.3]^40 Q =
  • 52.
  • 53. Final Results for “The Lord of the Rings: The Fellowship of the Ring”(Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 249 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 246 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 222 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 161 Lord of War 2005 Action / Crime / Drama 7.6 157 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
  • 55. Quality Recommended Products Ordered ● Based on history of sales Recommended Products Viewed ● Based on history of browsing
  • 57. Conclusions MLT in SOLR ● Inverted Index ● TF/IDF Scoring Formula ● Boosting Quality Measurement Feedback Loop ● Recommended Products Ordered ● Recommended Products Viewed
  • 58. References ● https://solr.apache.org/ ● https://lucidworks.com/post/who-uses-lucenesolr/ ● https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv ● https://www.esolutions.ro/streaming-expressions-in-apache-solr ● https://github.com/oanabrezai/moreLikeThisSOLR