SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Series-O-RamaSeries-O-Rama
Search & Recommend TV series with SQLSearch & Recommend TV series with SQL
http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama
Guillaume Cabanac
guillaume.cabanac@univ-tlse3.fr
Toulouse: A Picture is Worth a Thousand Words
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
2
1
2
3
4
Capbreton
3h ride
Toulouse
population: 437 000
students: 97 000
Ax-les-Thermes
1h40 ride
Collioure
2h30 ride
en.wikipedia.org
Telly Addicts Need Help to Find TV Series
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
amazon.com →
3
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Text Mining: Let’s Crunch Subtitles
4
 Main Topics of Grey’s AnatomyGrey’s Anatomy?
 Text mining, Visualization
 Series about ‘plane crash islandplane crash island’
 Search engine
 What should I watch next?
 Recommender system
Cold CaseCold Case
GreyGrey’s Anatomy’s Anatomy
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
What’s in a Subtitle File?
5
 Title – Season – Episode – Language.srt
 1 episode = 1 plain text file
 Synchronization
 start --> stop
 Dialogue
 We can easily extract words
[ a, again*2, and, but, com, cuban,
different, favorite, food, for*2, forum,
going, great, happen*2, has, hungry, i*2,
is, it, love, m, my, nice, night*2, miami,
now, pork, s*2, sandwiches, something, the,
to*2, tonight, town, www ]
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
6
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Home]
7 527 files = 337 MB
100% Java and Oracle
DB technology at Work! [Search engine]
7
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list
of results
DB technology at Work! [Infos]
8
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Most
popular
terms
Most
related
series
DB technology at Work! [Recommendations]
9
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB technology at Work! [Recommendations]
10
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
I liked I disliked
What should
I watch next?
DB technology at Work! [Recommendations]
11
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Ranked list of
recommendations
How Does this Work?
12
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Architecture and Data Model
13
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
DB
subtitles
indexing
searching
browsing
recommending
GUI
offline
online
Dict = { idT, term}
8 plane
27 killer
29 crash
Posting = { idT*, idS*, nb}
27 45 89
8 45 3
8 12 90
⊆
⊆
Theory − Text Indexing Pipeline
14
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
[the, plane, crashed, ..., planes, ..., is]
[plane, crashed, ..., planes, ...]
[plane, crash, ..., plane, ...]
{(plane, 48), (crash, 15) ...}
Tokenization +
lowercase
Stopwords removal
Stemming
PorterPorter’s Stemmer (1980)’s Stemmer (1980)
http://qaa.ath.cx/porter_js_demo.html
In 1720 Robert Gordon retired to Aberdeen having amassed a
considerable fortune in Poland. On his death 11 years later he willed his
entire estate to build a residential school for educating young boys. In
the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this
was converted into a day school to be known as Robert Gordon’s
College. This school also began to hold day and evening classes for boys
girls and adults in primary secondary mechanical and other subjects …
Counting
Theory − Similarity of Paired Series
15
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
A Big Limitation
 The distribution of terms among series is ignored
It makes no difference that a term occurs 1 time or 1,000,000 times
 Dice’s Coefficient (1945)
 Based on the Set Theory
 Example: Let us Model a Series as a Set of Terms
House = {hospital, doctor, crazy, psycho}
Grey’s = {doctor, care, hospital}
Vocabulary
Theory − Vector Space Model, Term Weighting
16
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Raw TF
dexter > lost
max
max
 Normalization
TF / max(TF)
survive ?
max
max
dexter < lost
Theory − Best Match Retrieval
17
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 TV series = 1 vector
1 45 1467 6790 n
Now, we know how to:
 Find most popular termspopular terms for a TV series
 Compute similaritysimilarity between TV series
 Find TV series matching a querymatching a query
Theory − More on Term Weighting
18
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
1 45 1467 6790 n
1 TV series = 1 vector
 All terms are supposed to be equally representative
… but ‘survive’ is way more unusual than ‘people’
⇒ ‘survive’ better represents Lost than ‘people’ does
IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
Theory − The Big Picture: TF*IDF
19
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
Theory … and Practice
20
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Series = { idS, name, maxNb}
12 Lost 540
45 Dexter 125
Dict = { idT, term idf }
8 plane 1.25
27 killer 2.87
29 crash 3.07
Posting = { idT*, idS*, nb, tf }
27 45 89 0.71
8 45 3 0.02
8 12 90 0.16
⊆
⊆
Description of a TV Series
21
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
Lost
⋈
 Many surnames need to be filtered out
Retrieval of TV Series − queries with 1 term
22
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive ⋈
Importance of normalization
• Stargate Atlantis
nb/maxNb = 63/1116 = 0.05645
• Blade
nb/maxNb = 9/163 = 0.05521
Retrieval of TV Series − queries with n terms
23
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
survive mulder ⋈
67|The Vampire Diaries
survive|0.028|0.107 = 0.028 * 0.107 = 0.003
mulder|0.007|3.977 = 0.007 * 3.977 = 0.028
+ 0.031
18| X-Files
survive|0.014|0.107 = 0.014 * 0.107 = 0.001
mulder|1.000|3.977 = 1.000 * 3.977 = 3.977
+ 3.978
⁞
Similar to House?
Computing Similarities Among TV Series 1/2
24
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
First, let’s compute the numerator where:
Ai = Terms from House
Bi = Terms from Another TV series Ai Bi
Similar to House?
Computing Similarities Among TV Series 2/2
Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
⋈
⋈
⋈
25
Thank you
http://www.irit.fr/~Guillaume.Cabanachttp://www.irit.fr/~Guillaume.Cabanac

Weitere ähnliche Inhalte

Mehr von Guillaume Cabanac

Interroger la science
Interroger la scienceInterroger la science
Interroger la science
Guillaume Cabanac
 
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
Guillaume Cabanac
 

Mehr von Guillaume Cabanac (20)

Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...Dépollution de la littérature scientifique : traque d’expression torturées ...
Dépollution de la littérature scientifique : traque d’expression torturées ...
 
Interroger la science
Interroger la scienceInterroger la science
Interroger la science
 
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
Valoriser le capital documentaire (en sommeil) d’une organisation : exploitat...
 
Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...Comment analyser une mobilisation collective dans les réseaux socionumériques...
Comment analyser une mobilisation collective dans les réseaux socionumériques...
 
Gender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic WritingGender as a Variable to Study Academic Writing
Gender as a Variable to Study Academic Writing
 
Prospection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospectiveProspection de textes scientifiques : vision prospective
Prospection de textes scientifiques : vision prospective
 
Questionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovationQuestionner le texte scientifique pour caractériser la science et l'innovation
Questionner le texte scientifique pour caractériser la science et l'innovation
 
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
Le carnet de l'avent de la sociologie francophone sur Twitter : réseaux et al...
 
The promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artistsThe promises of web scrapping: Mining the web for relational data about artists
The promises of web scrapping: Mining the web for relational data about artists
 
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
Émergence de l’open access « gris » : LibGen et Sci-Hub comme filières clande...
 
Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres Confrontation à la perception humaine de mesures de similarité entre membres
Confrontation à la perception humaine de mesures de similarité entre membres
 
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
« T'as pensé à retweeter mon article ? » Enjeux, limites et critique de la bi...
 
Émergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-HubÉmergence de l’open access « gris » : LibGen et Sci-Hub
Émergence de l’open access « gris » : LibGen et Sci-Hub
 
Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines: Sur les étagères des bibliothèques numériques clandestines:
Sur les étagères des bibliothèques numériques clandestines:
 
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociauxLes altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
Les altmetrics : estimer l'engouement pour la recherche sur les médias sociaux
 
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifiqueBibliogifts ? Les bibliothèques clandestines de l'édition scientifique
Bibliogifts ? Les bibliothèques clandestines de l'édition scientifique
 
Le renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorshipLe renfort des liens forts - dynamique relationnelle du coauthorship
Le renfort des liens forts - dynamique relationnelle du coauthorship
 
Médias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheursMédias sociaux et visibilité des chercheurs
Médias sociaux et visibilité des chercheurs
 
In Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through ScientometricsIn Praise of Interdisciplinary Research through Scientometrics
In Praise of Interdisciplinary Research through Scientometrics
 
Programmation événementielle avec Windev
Programmation événementielle avec WindevProgrammation événementielle avec Windev
Programmation événementielle avec Windev
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 

Searching and Recommending TV series with SQL

  • 1. Series-O-RamaSeries-O-Rama Search & Recommend TV series with SQLSearch & Recommend TV series with SQL http://bit.ly/series-o-ramahttp://bit.ly/series-o-rama Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr
  • 2. Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population: 437 000 students: 97 000 Ax-les-Thermes 1h40 ride Collioure 2h30 ride
  • 3. en.wikipedia.org Telly Addicts Need Help to Find TV Series  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system amazon.com → 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 4. Text Mining: Let’s Crunch Subtitles 4  Main Topics of Grey’s AnatomyGrey’s Anatomy?  Text mining, Visualization  Series about ‘plane crash islandplane crash island’  Search engine  What should I watch next?  Recommender system Cold CaseCold Case GreyGrey’s Anatomy’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 5. What’s in a Subtitle File? 5  Title – Season – Episode – Language.srt  1 episode = 1 plain text file  Synchronization  start --> stop  Dialogue  We can easily extract words [ a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 6. 6 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle
  • 7. DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results
  • 8. DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series
  • 9. DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 10. DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?
  • 11. DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations
  • 12. How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac
  • 13. Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Dict = { idT, term} 8 plane 27 killer 29 crash Posting = { idT*, idS*, nb} 27 45 89 8 45 3 8 12 90 ⊆ ⊆
  • 14. Theory − Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed, ..., planes, ..., is] [plane, crashed, ..., planes, ...] [plane, crash, ..., plane, ...] {(plane, 48), (crash, 15) ...} Tokenization + lowercase Stopwords removal Stemming PorterPorter’s Stemmer (1980)’s Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting
  • 15. Theory − Similarity of Paired Series 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac A Big Limitation  The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times  Dice’s Coefficient (1945)  Based on the Set Theory  Example: Let us Model a Series as a Set of Terms House = {hospital, doctor, crazy, psycho} Grey’s = {doctor, care, hospital}
  • 16. Vocabulary Theory − Vector Space Model, Term Weighting 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max max  Normalization TF / max(TF) survive ? max max dexter < lost
  • 17. Theory − Best Match Retrieval 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 1 45 1467 6790 n Now, we know how to:  Find most popular termspopular terms for a TV series  Compute similaritysimilarity between TV series  Find TV series matching a querymatching a query
  • 18. Theory − More on Term Weighting 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 45 1467 6790 n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’ ⇒ ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document FrequencyIDF: Inverse Document Frequency
  • 19. Theory − The Big Picture: TF*IDF 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac An important term for series S is frequent in Sis frequent in S and globally unusualglobally unusual.
  • 20. Theory … and Practice 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = { idS, name, maxNb} 12 Lost 540 45 Dexter 125 Dict = { idT, term idf } 8 plane 1.25 27 killer 2.87 29 crash 3.07 Posting = { idT*, idS*, nb, tf } 27 45 89 0.71 8 45 3 0.02 8 12 90 0.16 ⊆ ⊆
  • 21. Description of a TV Series 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out
  • 22. Retrieval of TV Series − queries with 1 term 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization • Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 • Blade nb/maxNb = 9/163 = 0.05521
  • 23. Retrieval of TV Series − queries with n terms 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 3.978 ⁞
  • 24. Similar to House? Computing Similarities Among TV Series 1/2 24 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: Ai = Terms from House Bi = Terms from Another TV series Ai Bi
  • 25. Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 25

Hinweis der Redaktion

  1. select term, tf*idf score from posting p, dict d where p.idT = d.idT and idS = (select idS from series where name = &apos;Lost&apos;) order by 2 desc, 1 ;
  2. select name, term, nb, tf from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term = &apos;survive&apos; order by tf desc, name ;
  3. select name, sum(tf*idf) rsv from posting p, series s, dict d where p.idS = s.idS and p.idT = d.idT and term in (&apos;survive&apos;, &apos;mulder&apos;) group by p.idS, name order by 2 desc, 1 ;
  4. with numerator as ( select pLost.idS idLostS, pOther.idS idOtherS, sum(pLost.tf*idf * pOther.tf*idf) numValue from posting pLost, posting pOther, dict d where pLost.idT = pOther.idT -- common terms and pLost.idT = d.idT -- for IDF and pLost.idS &lt;&gt; pOther.idS and pLost.idS = (select idS from series where name = &apos;House&apos;) group by pLost.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idLostS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;
  5. with numerator as ( select pHouse.idS idHouseS, pOther.idS idOtherS, sum(pHouse.tf*idf * pOther.tf*idf) numValue from posting pHouse, posting pOther, dict d where pHouse.idT = pOther.idT -- common terms and pHouse.idT = d.idT -- for IDF and pHouse.idS &lt;&gt; pOther.idS and pHouse.idS = (select idS from series where name = &apos;House&apos;) group by pHouse.idS, pOther.idS ) select name, numValue / ( sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idHouseS)) * sqrt((select sum(power(tf*idf, 2)) from posting p, dict d where p.idT = d.idT and p.idS = n.idOtherS))) score from numerator n, series s where n.idOtherS = s.idS order by 2 desc, 1 ;