SlideShare ist ein Scribd-Unternehmen logo
More on "More Like This"
Recommendations in SOLR
Oana Brezai
Software Engineer @ eSolutions
Outline
Use Case
How does Search work
How does MLT work
A limitation of MLT
Quality of the results
Conclusions
Use Case:
Build a
Recommendation
Application
Requirements
● Movie Store
● ~ 85 K Movies
● Use Open Source Software
Solution
● Fast
● High Quality Results
Why
Apache SOLR ?
Solr (NoSQL DB)
● Popular
● Blazing-fast
● Highly scalable
● Open source enterprise search platform
● Built on Apache Lucene
Who Uses SOLR
“Movie Store”
Use Case
When
● A user visualizes the details of a
movie
Then
● The application recommends
“similar” movies
Example
Target Movie
● The Lord of the Rings: The
Fellowship of the Ring
Recommendations
1) The Lord of the Rings: The Return of
the King
2) The Lord of the Rings: The Two
Towers
3) The Lord of the Rings
4) Lord of War
5) The Lord Protector
What Does
“Similar”
Mean?
Target Movie
● “The Lord of the Rings: The
Fellowship of the Ring”

Action / Adventure / Drama

8.8 on IMDB
Recommended (Similar) Movies
● The same words in the title
● The same movie genre
● The same words in the description
● Similar IMDB vote
Questions
Questions for our
Recommendation System
● Do all the words have the
same importance?
● Do all the fields have the same
importance?
● How does the engine
differentiate between results?
Let’s START!
Add Data
to SOLR
Create a Collection (~Table)
● movie_content
Populate the Collection with
Data
● 85855 movies
Data
Structure
Movie Fields
● imdb_title_id (movie id)
● original_title
● description
● genre
● avg_vote (imdb vote)
Movie Fields -> with Types
● imdb_title_id -> string
● original_title -> “analyzed” text
● description -> “analyzed” text
● genre -> array of strings
● avg_vote -> number
String vs “Analyzed” Text Field Types
● Field Type: String
● Example: “Comedy” (field: genre)
 Indexed: “Comedy”
● Field Type: “Analyzed” Text
● Example: “The Lord of the Rings: The Fellowship of the Ring” (field:
original_title)
 Indexed (lowercased and without stopwords):
○ “lord”
○ “rings”
○ “fellowship”
○ “ring”
“The Lord of the Rings: The Fellowship of the
Ring”
● Movie Id (imdb_title_id): tt0120737
● Original Title
 “The Lord of the Rings: The Fellowship of the Ring”
● Description
 “A meek Hobbit from the Shire and eight companions set out on a
journey to destroy the powerful One Ring and save Middle-earth from the
Dark Lord Sauron.”
● Genre
 “Action, Adventure, Drama”
● Imdb vote (avg_vote): 8.8
“More Like
This” Feature
in SOLR
More Like This
● Given a movie id => list
“similar” movies
● Uses the “Search” functionality
How Does
“Search”
Work in SOLR?
“Search”
Example 1:
Query
original_title: “Lord of the Rings”
Results
● No movies found
“Search”
Example 2:
Query
original_title: “Lord” AND
original_title: “Rings”
Results (4)
1) "The Lord of the Rings"
2) "The Lord of the Rings: The
Fellowship of the Ring"
3) "The Lord of the Rings: The
Return of the King"
4) "The Lord of the Rings: The Two
Towers”
Execution time: 21 ms
How Does the Search original_title: “Lord”
AND original_title: “Rings” Function?
● Searches in the original_title index all the movies that contain
the words “lord” AND “rings” (lowercased!)
● Computes search score based on Boosting, Term Frequency (TF)
and Inverse Document Frequency (IDF)
● Displays the results in descending order of the score
The TF / IDF Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
where:
boost(field[j]) = custom weight given to the field j
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength/avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
word[i] = every word in the field, excluding stop words (in our case)
fieldLength = count of words in the field, excluding stop words (in our case)
avgFieldLength = average length of field
original_title = “The Lord of the Rings”
genre = “Animation, Adventure, Fantasy”
description = “The Fellowship of the Ring embark ...”
score = 1 * tf(“lord”) * idf(“lord”) +
1 * tf(“rings”) * idf(“rings”) +
1 * tf(“Animation”) * idf(“Animation”) + ...
Debug the Scoring Formula
score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
Debug the TF / IDF Formula for the
QUERY = original_title:Lord AND original_title:Rings
Original title CTF (Field)
Lord Rings
CDF (Corpus)
Lord Rings
Field
Length
Score
The Lord of the Rings 1 1 26 10 2 8.29
The Lord of the Rings:
The Fellowship of the Ring
1 1 26 10 4 6.06
The Lord of the Rings:
The Return of the King
1 1 26 10 4 6.06
The Lord of the Rings:
The Two Towers
1 1 26 10 4 6.06
tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength / avgFieldLength))
idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
“Search”
in SOLR
High Quality
● Scoring Formula
 TF / IDF
 Boosting
Fast
● Inverted Index
Inverted Index (original_title)
Id
(imdb_title_id)
Tile (original_title)
tt0120737 The Lord of the Rings:
The Fellowship of the Ring
tt0167260 The Lord of the Rings:
The Return of the King
tt0167261 The Lord of the Rings:
The Two Towers
tt0077869 The Lord of the Rings
Word Ids (imbd_title_id)
lord tt0120737,
tt0167260,
tt0167261, tt0077869
rings tt0120737,
tt0167260,
tt0167261, tt0077869
ring tt0120737
fellowship tt0120737
return tt0167260
king tt0167260
towers tt0167261
two tt0167261
How Does
“More Like This”
Work in SOLR?
“More Like
This”
Example
Query
● q = imdb_title_id:tt0120737
(“The Lord of the Rings: The
Fellowship of the Ring”)
● Other parameters:
 mlt = true
 mlt.fl=original_title,
description, genre, avg_vote
 mlt.mintf = 1
 mlt.count = 5
“More Like
This”
Example URL
http://localhost:8983/solr/movie_content
/select?
mlt=true&mlt.mintf=1
&mlt.fl=original_title,description,genre,av
g_vote
&q=imdb_title_id:tt0120737
&mlt.count=5
Results
Results (“The Lord of the
Rings: The Fellowship of the
Ring”)
● Execution Time: <100 ms
● Total Results: 62387
Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
14.78 The Ring Thing 2004 Adventure / Comedy 3.5
13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
12.65 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Results for “The Lord of the Rings: The Fellowship of
the Ring” (Action, Adventure, Drama - 8.8)
Improve Query:
Add Boosting
Boost Fields (Add Weight)
● original_title
● description
● genre
● avg_vote
Importance of Fields
avg_vote >> genre >> original_title >> description
Boosting factors:
● avg_vote -> 40
● genre -> 30
● original_title -> 20
● description -> 1
For every word in (original_title, description, genre)
do
score + = boosting(field) * tf(word) * idf(word)
Scoring Formula
genre = “Animation, Adventure, Fantasy” -- BOOSTING 30
original_title = “The Lord of the Rings” --- BOOSTING 20
description = “The Fellowship of the Ring embark ...” -- BOOSTING 1
score = 30 * tf(“Animation”) * idf(“Animation”) +
30 * tf(“Adventure”) * idf(“Adventure”) +
30 * tf(“Fantasy”) * idf(“Fantasy”) +
20 * tf(“lord”) * idf(“lord”) + ...
Debug Scoring Formula with Boosting
http://localhost:8983/solr/movie_content
/select?
mlt=true&mlt.mindf=1&mlt.mintf=1
&mlt.fl=original_title,description,genre,avg_vote
&q=imdb_title_id:tt0120737
&mlt.boost=true&mlt.qf=avg_vote^40 genre^30 original_title^20 description
&mlt.count=5
SOLR: More Like This URL Request
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
A Limitation of
“More Like This”
Numeric Fields
Ignored in MLT
Issue
● Only text fields are used in MLT
queries
Solution
● Rewrite the whole query as a
search query and include also
the numeric fields
More on
“More Like This”
in SOLR
“More Like This”
Steps
1) Extract the “interesting terms”
from the target movie
2) Add boostings / field (as given in
the query) for every interesting term
3) Perform a Search with those words
and boostings
“More Like This” Step 1
1) Extract the “interesting terms” from the target movie (from the field list in
the query): take all the words from all the fields and compute their relevance. Keep
the first 25.
Ex: word “ring” -> very relevant for the movie: “The Lord of the Rings: The
Fellowship of the Ring”:
- 2 occurrences: once in “original_title” and once in “description”
- in the whole corpus of 85855 movies:
- 35 times in the field “original_title” and
- 282 times in the field “description”
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
List of Interesting Terms for MovieId
tt0120737
genre:Drama
genre:Action
genre:Adventure
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship
original_title:ring
original_title:lord
original_title:rings
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
“More Like This” Step 2
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term:
avg_vote^40 genre^30 original_title^20 description
3) Perform a Search with those words and boostings
Interesting Terms for tt0120737 with Boosting
genre:Drama^30
genre:Action^30
genre:Adventure^30
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship^20
original_title:ring^20
original_title:lord^20
original_title:rings^20
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
“More Like This” Step 3
1) Extract the “interesting terms” from the target movie (from the field list in
the query)
2) Add boostings / field (as given in the query) for every interesting term
3) Perform a Search with those words and boostings
Results for “The Lord of the Rings: The Fellowship of the
Ring” (Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
894 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
881 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
667 Rings 2017 Drama / Horror / Mystery 4.5
661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
Add Numeric
Fields to
“More Like This”
1) SOLR Request 1: perform a MLT and
get the “interesting terms”
2) Add boostings
3) Add numeric fields with their
boostings
4) SOLR Request 2: perform a Search
with numeric fields and “interesting
terms” with their respective
boostings
Example of Numeric Field Syntax
Target movie: avg_vote = 8.8
=> a similar movie would have:
avg_vote: [8.8 - 1.5 TO 8.8 + 1.5]
=> add boosting factor:
avg_vote: [7.3 TO 10.3] ^ 40
Final SOLR Search Query
genre:Drama^30
genre:Action^30
genre:Adventure^30
description:one
description:set
description:save
description:journey
description:middle
description:meek
description:hobbit
description:shire
description:sauron
original_title:fellowship^20
original_title:ring^20
original_title:lord^20
original_title:rings^20
description:dark
description:earth
description:powerful
description:destroy
description:lord
description:ring
description:eight
description:companions
avg_vote:[7.3 TO 10.3]^40
Q =
Final Results for “The Lord of the Rings: The Fellowship of
the Ring”(Action, Adventure, Drama - 8.8)
Score Title Year Genre Vote
249 The Lord of the Rings:
The Return of the King
2003 Action / Adventure / Drama 8.9
246 The Lord of the Rings:
The Two Towers
2002 Action / Adventure / Drama 8.7
222 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2
161 Lord of War 2005 Action / Crime / Drama 7.6
157 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
Quality of the
Results
Quality
Recommended Products
Ordered
● Based on history of sales
Recommended Products
Viewed
● Based on history of browsing
Conclusions
Conclusions
MLT in SOLR
● Inverted Index
● TF/IDF Scoring Formula
● Boosting
Quality Measurement
Feedback Loop
● Recommended Products Ordered
● Recommended Products Viewed
References
● https://solr.apache.org/
● https://lucidworks.com/post/who-uses-lucenesolr/
● https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv
● https://www.esolutions.ro/streaming-expressions-in-apache-solr
● https://github.com/oanabrezai/moreLikeThisSOLR
Thank you
Oana Brezai
oana.brezai@esolutions.ro

Weitere ähnliche Inhalte

Was ist angesagt?

ESP32 IoT presentation @ dev.bg
ESP32 IoT presentation @ dev.bgESP32 IoT presentation @ dev.bg
ESP32 IoT presentation @ dev.bg
Martin Harizanov
 
History of-microprocessors
History of-microprocessorsHistory of-microprocessors
History of-microprocessors
mudulin
 
Introduction to Phaser.js
Introduction to Phaser.jsIntroduction to Phaser.js
Introduction to Phaser.js
Francesco Raimondo
 
Training Report on Embedded System
Training Report on Embedded SystemTraining Report on Embedded System
Training Report on Embedded System
Roshan Mani
 
iot based low cost smart irrigation system
iot based low cost smart irrigation systemiot based low cost smart irrigation system
iot based low cost smart irrigation system
CloudTechnologies
 
Io t based smart agriculture
Io t based smart agricultureIo t based smart agriculture
Io t based smart agriculture
Vijay Kumar
 
Case Study of Embedded Systems
Case Study of Embedded SystemsCase Study of Embedded Systems
Case Study of Embedded Systems
anand hd
 
IoT Based Weather Monitoring System for Effective Analytics
IoT Based Weather Monitoring System for Effective AnalyticsIoT Based Weather Monitoring System for Effective Analytics
IoT Based Weather Monitoring System for Effective Analytics
Ferdin Joe John Joseph PhD
 

Was ist angesagt? (8)

ESP32 IoT presentation @ dev.bg
ESP32 IoT presentation @ dev.bgESP32 IoT presentation @ dev.bg
ESP32 IoT presentation @ dev.bg
 
History of-microprocessors
History of-microprocessorsHistory of-microprocessors
History of-microprocessors
 
Introduction to Phaser.js
Introduction to Phaser.jsIntroduction to Phaser.js
Introduction to Phaser.js
 
Training Report on Embedded System
Training Report on Embedded SystemTraining Report on Embedded System
Training Report on Embedded System
 
iot based low cost smart irrigation system
iot based low cost smart irrigation systemiot based low cost smart irrigation system
iot based low cost smart irrigation system
 
Io t based smart agriculture
Io t based smart agricultureIo t based smart agriculture
Io t based smart agriculture
 
Case Study of Embedded Systems
Case Study of Embedded SystemsCase Study of Embedded Systems
Case Study of Embedded Systems
 
IoT Based Weather Monitoring System for Effective Analytics
IoT Based Weather Monitoring System for Effective AnalyticsIoT Based Weather Monitoring System for Effective Analytics
IoT Based Weather Monitoring System for Effective Analytics
 

Kürzlich hochgeladen

An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
Madan Karki
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
architagupta876
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 

Kürzlich hochgeladen (20)

An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Seminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptxSeminar on Distillation study-mafia.pptx
Seminar on Distillation study-mafia.pptx
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
AI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptxAI assisted telemedicine KIOSK for Rural India.pptx
AI assisted telemedicine KIOSK for Rural India.pptx
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 

More on "More Like This" Recommendations in SOLR

  • 1. More on "More Like This" Recommendations in SOLR Oana Brezai Software Engineer @ eSolutions
  • 2. Outline Use Case How does Search work How does MLT work A limitation of MLT Quality of the results Conclusions
  • 3. Use Case: Build a Recommendation Application Requirements ● Movie Store ● ~ 85 K Movies ● Use Open Source Software Solution ● Fast ● High Quality Results
  • 4. Why Apache SOLR ? Solr (NoSQL DB) ● Popular ● Blazing-fast ● Highly scalable ● Open source enterprise search platform ● Built on Apache Lucene
  • 6. “Movie Store” Use Case When ● A user visualizes the details of a movie Then ● The application recommends “similar” movies
  • 7. Example Target Movie ● The Lord of the Rings: The Fellowship of the Ring Recommendations 1) The Lord of the Rings: The Return of the King 2) The Lord of the Rings: The Two Towers 3) The Lord of the Rings 4) Lord of War 5) The Lord Protector
  • 8. What Does “Similar” Mean? Target Movie ● “The Lord of the Rings: The Fellowship of the Ring”  Action / Adventure / Drama  8.8 on IMDB Recommended (Similar) Movies ● The same words in the title ● The same movie genre ● The same words in the description ● Similar IMDB vote
  • 9. Questions Questions for our Recommendation System ● Do all the words have the same importance? ● Do all the fields have the same importance? ● How does the engine differentiate between results?
  • 11. Add Data to SOLR Create a Collection (~Table) ● movie_content Populate the Collection with Data ● 85855 movies
  • 12. Data Structure Movie Fields ● imdb_title_id (movie id) ● original_title ● description ● genre ● avg_vote (imdb vote)
  • 13. Movie Fields -> with Types ● imdb_title_id -> string ● original_title -> “analyzed” text ● description -> “analyzed” text ● genre -> array of strings ● avg_vote -> number
  • 14. String vs “Analyzed” Text Field Types ● Field Type: String ● Example: “Comedy” (field: genre)  Indexed: “Comedy” ● Field Type: “Analyzed” Text ● Example: “The Lord of the Rings: The Fellowship of the Ring” (field: original_title)  Indexed (lowercased and without stopwords): ○ “lord” ○ “rings” ○ “fellowship” ○ “ring”
  • 15. “The Lord of the Rings: The Fellowship of the Ring” ● Movie Id (imdb_title_id): tt0120737 ● Original Title  “The Lord of the Rings: The Fellowship of the Ring” ● Description  “A meek Hobbit from the Shire and eight companions set out on a journey to destroy the powerful One Ring and save Middle-earth from the Dark Lord Sauron.” ● Genre  “Action, Adventure, Drama” ● Imdb vote (avg_vote): 8.8
  • 16.
  • 17. “More Like This” Feature in SOLR More Like This ● Given a movie id => list “similar” movies ● Uses the “Search” functionality
  • 19. “Search” Example 1: Query original_title: “Lord of the Rings” Results ● No movies found
  • 20. “Search” Example 2: Query original_title: “Lord” AND original_title: “Rings” Results (4) 1) "The Lord of the Rings" 2) "The Lord of the Rings: The Fellowship of the Ring" 3) "The Lord of the Rings: The Return of the King" 4) "The Lord of the Rings: The Two Towers” Execution time: 21 ms
  • 21. How Does the Search original_title: “Lord” AND original_title: “Rings” Function? ● Searches in the original_title index all the movies that contain the words “lord” AND “rings” (lowercased!) ● Computes search score based on Boosting, Term Frequency (TF) and Inverse Document Frequency (IDF) ● Displays the results in descending order of the score
  • 22. The TF / IDF Scoring Formula score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i])) where: boost(field[j]) = custom weight given to the field j tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength/avgFieldLength)) idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5)) word[i] = every word in the field, excluding stop words (in our case) fieldLength = count of words in the field, excluding stop words (in our case) avgFieldLength = average length of field
  • 23. original_title = “The Lord of the Rings” genre = “Animation, Adventure, Fantasy” description = “The Fellowship of the Ring embark ...” score = 1 * tf(“lord”) * idf(“lord”) + 1 * tf(“rings”) * idf(“rings”) + 1 * tf(“Animation”) * idf(“Animation”) + ... Debug the Scoring Formula score[movie] =∑(boost(field[j]) * tf(word[i]) * idf(word[i]))
  • 24. Debug the TF / IDF Formula for the QUERY = original_title:Lord AND original_title:Rings Original title CTF (Field) Lord Rings CDF (Corpus) Lord Rings Field Length Score The Lord of the Rings 1 1 26 10 2 8.29 The Lord of the Rings: The Fellowship of the Ring 1 1 26 10 4 6.06 The Lord of the Rings: The Return of the King 1 1 26 10 4 6.06 The Lord of the Rings: The Two Towers 1 1 26 10 4 6.06 tf(word[i]) = countTermFreq/(countTermFreq + 1.2 * (1 - 0.75 + 0.75 * fieldLength / avgFieldLength)) idf(word[i]) = log(1 + (countDocumentFreq - countTermFreq + 0.5) / (countTermFreq + 0.5))
  • 25. “Search” in SOLR High Quality ● Scoring Formula  TF / IDF  Boosting Fast ● Inverted Index
  • 26. Inverted Index (original_title) Id (imdb_title_id) Tile (original_title) tt0120737 The Lord of the Rings: The Fellowship of the Ring tt0167260 The Lord of the Rings: The Return of the King tt0167261 The Lord of the Rings: The Two Towers tt0077869 The Lord of the Rings Word Ids (imbd_title_id) lord tt0120737, tt0167260, tt0167261, tt0077869 rings tt0120737, tt0167260, tt0167261, tt0077869 ring tt0120737 fellowship tt0120737 return tt0167260 king tt0167260 towers tt0167261 two tt0167261
  • 27. How Does “More Like This” Work in SOLR?
  • 28. “More Like This” Example Query ● q = imdb_title_id:tt0120737 (“The Lord of the Rings: The Fellowship of the Ring”) ● Other parameters:  mlt = true  mlt.fl=original_title, description, genre, avg_vote  mlt.mintf = 1  mlt.count = 5
  • 30. Results Results (“The Lord of the Rings: The Fellowship of the Ring”) ● Execution Time: <100 ms ● Total Results: 62387
  • 31. Score Title Year Genre Vote 24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 14.78 The Ring Thing 2004 Adventure / Comedy 3.5 13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2 12.65 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2 Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8)
  • 32. Score Title Year Genre Vote 24.49 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 14.78 The Ring Thing 2004 Adventure / Comedy 3.5 13.11 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2 12.65 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 11.23 The Lord Protector 1996 Action / Adventure / Fantasy 4.2 Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8)
  • 33. Improve Query: Add Boosting Boost Fields (Add Weight) ● original_title ● description ● genre ● avg_vote Importance of Fields avg_vote >> genre >> original_title >> description
  • 34. Boosting factors: ● avg_vote -> 40 ● genre -> 30 ● original_title -> 20 ● description -> 1 For every word in (original_title, description, genre) do score + = boosting(field) * tf(word) * idf(word) Scoring Formula
  • 35. genre = “Animation, Adventure, Fantasy” -- BOOSTING 30 original_title = “The Lord of the Rings” --- BOOSTING 20 description = “The Fellowship of the Ring embark ...” -- BOOSTING 1 score = 30 * tf(“Animation”) * idf(“Animation”) + 30 * tf(“Adventure”) * idf(“Adventure”) + 30 * tf(“Fantasy”) * idf(“Fantasy”) + 20 * tf(“lord”) * idf(“lord”) + ... Debug Scoring Formula with Boosting
  • 37. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 38. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 39. A Limitation of “More Like This”
  • 40. Numeric Fields Ignored in MLT Issue ● Only text fields are used in MLT queries Solution ● Rewrite the whole query as a search query and include also the numeric fields
  • 41. More on “More Like This” in SOLR
  • 42. “More Like This” Steps 1) Extract the “interesting terms” from the target movie 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 43. “More Like This” Step 1 1) Extract the “interesting terms” from the target movie (from the field list in the query): take all the words from all the fields and compute their relevance. Keep the first 25. Ex: word “ring” -> very relevant for the movie: “The Lord of the Rings: The Fellowship of the Ring”: - 2 occurrences: once in “original_title” and once in “description” - in the whole corpus of 85855 movies: - 35 times in the field “original_title” and - 282 times in the field “description” 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 44. List of Interesting Terms for MovieId tt0120737 genre:Drama genre:Action genre:Adventure description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship original_title:ring original_title:lord original_title:rings description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions
  • 45. “More Like This” Step 2 1) Extract the “interesting terms” from the target movie (from the field list in the query) 2) Add boostings / field (as given in the query) for every interesting term: avg_vote^40 genre^30 original_title^20 description 3) Perform a Search with those words and boostings
  • 46. Interesting Terms for tt0120737 with Boosting genre:Drama^30 genre:Action^30 genre:Adventure^30 description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship^20 original_title:ring^20 original_title:lord^20 original_title:rings^20 description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions
  • 47. “More Like This” Step 3 1) Extract the “interesting terms” from the target movie (from the field list in the query) 2) Add boostings / field (as given in the query) for every interesting term 3) Perform a Search with those words and boostings
  • 48. Results for “The Lord of the Rings: The Fellowship of the Ring” (Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 1132 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 894 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 881 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 667 Rings 2017 Drama / Horror / Mystery 4.5 661 The Dork of the Rings 2006 Adventure / Comedy / Fantasy 3.2
  • 49. Add Numeric Fields to “More Like This” 1) SOLR Request 1: perform a MLT and get the “interesting terms” 2) Add boostings 3) Add numeric fields with their boostings 4) SOLR Request 2: perform a Search with numeric fields and “interesting terms” with their respective boostings
  • 50. Example of Numeric Field Syntax Target movie: avg_vote = 8.8 => a similar movie would have: avg_vote: [8.8 - 1.5 TO 8.8 + 1.5] => add boosting factor: avg_vote: [7.3 TO 10.3] ^ 40
  • 51. Final SOLR Search Query genre:Drama^30 genre:Action^30 genre:Adventure^30 description:one description:set description:save description:journey description:middle description:meek description:hobbit description:shire description:sauron original_title:fellowship^20 original_title:ring^20 original_title:lord^20 original_title:rings^20 description:dark description:earth description:powerful description:destroy description:lord description:ring description:eight description:companions avg_vote:[7.3 TO 10.3]^40 Q =
  • 52.
  • 53. Final Results for “The Lord of the Rings: The Fellowship of the Ring”(Action, Adventure, Drama - 8.8) Score Title Year Genre Vote 249 The Lord of the Rings: The Return of the King 2003 Action / Adventure / Drama 8.9 246 The Lord of the Rings: The Two Towers 2002 Action / Adventure / Drama 8.7 222 The Lord of the Rings 1978 Animation / Adventure / Fantasy 6.2 161 Lord of War 2005 Action / Crime / Drama 7.6 157 The Lord Protector 1996 Action / Adventure / Fantasy 4.2
  • 55. Quality Recommended Products Ordered ● Based on history of sales Recommended Products Viewed ● Based on history of browsing
  • 57. Conclusions MLT in SOLR ● Inverted Index ● TF/IDF Scoring Formula ● Boosting Quality Measurement Feedback Loop ● Recommended Products Ordered ● Recommended Products Viewed
  • 58. References ● https://solr.apache.org/ ● https://lucidworks.com/post/who-uses-lucenesolr/ ● https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset?select=IMDb+ratings.csv ● https://www.esolutions.ro/streaming-expressions-in-apache-solr ● https://github.com/oanabrezai/moreLikeThisSOLR