SlideShare ist ein Scribd-Unternehmen logo
Search Engines
How They Work and
Why You Need Them
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
What do you
even do all day?
We have Google.
@scarletdrive
Not all search engines are
web search engines.
@scarletdrive
google.com potatoparcel.com
Large scope
(entire internet)
Small scope
(just a few potatoes)
No control
over content
Total control over content
Many use cases
Optimize for selling
potatoes
Most websites have a
custom search engine.
@scarletdrive
Why build search engines?
● Keep it local and customize it
Let’s try to
search my store.
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
n = items in database
m = max length of title strings
n·m
n = items in database
m = max length of title strings = 250
O(n)
n n · m (m=250)
10 2 500
100 25 000
1 000 250 000
10 000 2 500 000
100 000 25 000 000
1 000 000 250 000 000
Why build search engines?
● Keep it local and customize it
● Improve performance
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
Why build search engines?
● Keep it local and customize it
● Improve performance
● Improve quality of results
But how?
@scarletdrive
Agenda
1. Why build search engines? ✓
2. Search indexes
3. Open source tools
4. Interesting challenges
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Inverted
Index
Terminology
● A document is a single searchable unit
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
● An index is a collection of documents
(including many inverted indexes)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
... ...
5.00 [5]
8.00 [3]
0-10.00 [3, 5]
11.99 [7, 8]
... ...
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
... ... ...
items indexTerminology
● A search index can have
many inverted indexes
● A search engine can have
many search indexes
title inverted index
price inverted index
blog-posts index
title inverted index
post inverted index
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance
● Improve quality of results
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
O(1)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
r = number of results found
O(1+r)
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
● Improve quality of results
But at
what cost?
@scarletdrive
Trade-offs
● Space
● System complexity
● Pre-processing time
O(1)
Query
time
O(n·m·p)
Index
time
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results
Let’s talk about
how we build it.
@scarletdrive
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
How did we do this??
Step 1:
Tokenization
string: “cat hat”
array: [“cat”, “hat”]
Image from aliexpress.com
Image from aliexpress.com
Step 2:
Normalization
● Stemming
○ “cats” → “cat”
○ “walking” → “walk”
● Stop words
○ Remove “the”, “and”, “to”, etc...
Image from aliexpress.com
Step 3: Filters
● Lowercase
○ “Dog” → “dog”
● Synonyms
○ “colour” → “color”
○ “t-shirt” → “tshirt”
○ “canadian” → “canada”
○ “kitten” → “cat”
Quality Problems
1. “cat” search returned “vacation hat for dog”
Quality Problems
1. “cat” search returned “vacation hat for dog”
id title price
4 vacation hat for dog 12.99
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
Quality Problems
1. “cat” search returned “vacation hat for dog”
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
Quality Problems
1. “cat” search returned “vacation hat for dog”
2. “cats” search does not return “red cat mittens”
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
→
All transformations performed on
the input data for the index
are also performed on the query
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
cats cat
Quality Problems
1. “cat” search returned “vacation hat for dogs”
2. “cats” search does not return “red cat mittens”
3. “cat” search does not return “kitten mittens”
Quality Problems
3. “cat” search does not return “kitten mittens”
id title price
7 kitten mittens 11.99
cat [7]
mitten [7]
Quality Problems
3. “cat” search does not return “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
cat
Quality Problems
3 ½ search for “kitten” still returns “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
kitten cat
Did we solve it?
● Keep it local ✓ and customize it ✓
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results ✓
○ By performing special pre-processing steps
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools
4. Interesting challenges
I want a search engine...
do I have to build it myself?
@scarletdrive
● Inverted index
● Basic tokenization,
normalization, and filters
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization,
normalization, and filters
● Plugins
● ...and more!
Which one should I pick?
It doesn’t matter
Which one should I pick?
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges
Interesting Challenge:
Scalability
Too much traffic?
Replication
Too much traffic?
Replication
update
Too much data?
Sharding
Distribution
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Interesting Challenge:
Relevance
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
tf-idf
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
Query: “orange cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
3/7 = 0.43
2/5 = 0.40
1/7 = 0.14
1/5 = 0.20
tf-idf
bm25
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
Relevance Challenges
● Prevent keyword stuffing or other “gaming the system”
● Phrase matching
● Fuzzy matching
● User factors: language, location
● Other factors: quality, recency, randomness, diversity
Interesting Challenges
● Scalability
● Relevance
● Query understanding
● Numeric range search
● Faceted search
● Autocomplete
We covered: We did not cover:
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges ✓
Thanks!

Weitere ähnliche Inhalte

Ähnlich wie Search Engines: How They Work and Why You Need Them

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicants
Dominik Wojciechowski
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
George Awad
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
Brian Nauheimer
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
Erin Shellman
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
Joe Stein
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potx
PhanTien25
 
Agile Estimating
Agile EstimatingAgile Estimating
Agile Estimating
Robert Dempsey
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Rob Snell
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
J On The Beach
 
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Johannes Radig
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO Tactics
PJ Howland
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learning
mathias-brandewinder
 
Adp scrum multiple product logs
Adp scrum multiple product logsAdp scrum multiple product logs
Adp scrum multiple product logs
Akkiraju Bhattiprolu
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
Siva Arunachalam
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling Content
Rob Snell
 

Ähnlich wie Search Engines: How They Work and Why You Need Them (15)

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicants
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potx
 
Agile Estimating
Agile EstimatingAgile Estimating
Agile Estimating
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO Tactics
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learning
 
Adp scrum multiple product logs
Adp scrum multiple product logsAdp scrum multiple product logs
Adp scrum multiple product logs
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling Content
 

Kürzlich hochgeladen

AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 

Kürzlich hochgeladen (20)

AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 

Search Engines: How They Work and Why You Need Them

  • 1. Search Engines How They Work and Why You Need Them
  • 2.
  • 3.
  • 4. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 5. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 6. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 7. What do you even do all day? We have Google. @scarletdrive
  • 8. Not all search engines are web search engines. @scarletdrive
  • 9. google.com potatoparcel.com Large scope (entire internet) Small scope (just a few potatoes) No control over content Total control over content Many use cases Optimize for selling potatoes
  • 10.
  • 11.
  • 12. Most websites have a custom search engine. @scarletdrive
  • 13. Why build search engines? ● Keep it local and customize it
  • 14.
  • 15. Let’s try to search my store. @scarletdrive
  • 16. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 17. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 18. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 19. n = items in database m = max length of title strings n·m
  • 20. n = items in database m = max length of title strings = 250 O(n)
  • 21. n n · m (m=250) 10 2 500 100 25 000 1 000 250 000 10 000 2 500 000 100 000 25 000 000 1 000 000 250 000 000
  • 22. Why build search engines? ● Keep it local and customize it ● Improve performance
  • 23. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 24. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 25. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 26. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 27. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ...
  • 28. Why build search engines? ● Keep it local and customize it ● Improve performance ● Improve quality of results
  • 30. Agenda 1. Why build search engines? ✓ 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 31. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 32. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Inverted Index
  • 33. Terminology ● A document is a single searchable unit red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] 7 kitten mittens 11.99
  • 34. Terminology ● A document is a single searchable unit ● A field is a defined value in a document red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 35. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 36. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 37. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs ● An index is a collection of documents (including many inverted indexes) red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] ... ... 5.00 [5] 8.00 [3] 0-10.00 [3, 5] 11.99 [7, 8] ... ... id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 ... ... ...
  • 38. items indexTerminology ● A search index can have many inverted indexes ● A search engine can have many search indexes title inverted index price inverted index blog-posts index title inverted index post inverted index
  • 39. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ● Improve quality of results
  • 40. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat
  • 41. O(1)
  • 42. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 43. r = number of results found O(1+r)
  • 44. ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 45. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ● Improve quality of results
  • 47. Trade-offs ● Space ● System complexity ● Pre-processing time
  • 49. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results
  • 50. Let’s talk about how we build it. @scarletdrive
  • 51. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 How did we do this??
  • 52. Step 1: Tokenization string: “cat hat” array: [“cat”, “hat”] Image from aliexpress.com
  • 53. Image from aliexpress.com Step 2: Normalization ● Stemming ○ “cats” → “cat” ○ “walking” → “walk” ● Stop words ○ Remove “the”, “and”, “to”, etc...
  • 54. Image from aliexpress.com Step 3: Filters ● Lowercase ○ “Dog” → “dog” ● Synonyms ○ “colour” → “color” ○ “t-shirt” → “tshirt” ○ “canadian” → “canada” ○ “kitten” → “cat”
  • 55. Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 56. Quality Problems 1. “cat” search returned “vacation hat for dog” id title price 4 vacation hat for dog 12.99 cat [1, 3, 5] hat [4] dog [4] vacation [4]
  • 57. Quality Problems 1. “cat” search returned “vacation hat for dog” cat [1, 3, 5] hat [4] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99
  • 58. Quality Problems 1. “cat” search returned “vacation hat for dog” 2. “cats” search does not return “red cat mittens”
  • 59. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] →
  • 60. All transformations performed on the input data for the index are also performed on the query
  • 61. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] cats cat
  • 62. Quality Problems 1. “cat” search returned “vacation hat for dogs” 2. “cats” search does not return “red cat mittens” 3. “cat” search does not return “kitten mittens”
  • 63. Quality Problems 3. “cat” search does not return “kitten mittens” id title price 7 kitten mittens 11.99 cat [7] mitten [7]
  • 64. Quality Problems 3. “cat” search does not return “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 cat
  • 65. Quality Problems 3 ½ search for “kitten” still returns “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 kitten cat
  • 66. Did we solve it? ● Keep it local ✓ and customize it ✓ ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results ✓ ○ By performing special pre-processing steps
  • 67. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools 4. Interesting challenges
  • 68. I want a search engine... do I have to build it myself? @scarletdrive
  • 69.
  • 70. ● Inverted index ● Basic tokenization, normalization, and filters ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization, normalization, and filters ● Plugins ● ...and more!
  • 71. Which one should I pick? It doesn’t matter
  • 72. Which one should I pick? ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 73. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~
  • 74. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges
  • 79. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 80. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 82. id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in?
  • 84. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: “cat”
  • 85. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: “cat”
  • 86. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 Query: “orange cat”
  • 87. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
  • 88. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 3/7 = 0.43 2/5 = 0.40 1/7 = 0.14 1/5 = 0.20
  • 90. Relevance Challenges ● Prevent keyword stuffing or other “gaming the system” ● Phrase matching ● Fuzzy matching ● User factors: language, location ● Other factors: quality, recency, randomness, diversity
  • 91. Interesting Challenges ● Scalability ● Relevance ● Query understanding ● Numeric range search ● Faceted search ● Autocomplete We covered: We did not cover:
  • 92. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges ✓