SlideShare a Scribd company logo
1 of 93
Download to read offline
Search Engines
How They Work and
Why You Need Them
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges
What do you
even do all day?
We have Google.
@scarletdrive
Not all search engines are
web search engines.
@scarletdrive
google.com potatoparcel.com
Large scope
(entire internet)
Small scope
(just a few potatoes)
No control
over content
Total control over content
Many use cases
Optimize for selling
potatoes
Most websites have a
custom search engine.
@scarletdrive
Why build search engines?
● Keep it local and customize it
Let’s try to
search my store.
@scarletdrive
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
n = items in database
m = max length of title strings
n·m
n = items in database
m = max length of title strings = 250
O(n)
n n · m (m=250)
10 2 500
100 25 000
1 000 250 000
10 000 2 500 000
100 000 25 000 000
1 000 000 250 000 000
Why build search engines?
● Keep it local and customize it
● Improve performance
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cat%’
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cats%’
SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...
Why build search engines?
● Keep it local and customize it
● Improve performance
● Improve quality of results
But how?
@scarletdrive
Agenda
1. Why build search engines? ✓
2. Search indexes
3. Open source tools
4. Interesting challenges
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Inverted
Index
Terminology
● A document is a single searchable unit
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
7 kitten mittens 11.99
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Terminology
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the
source in order to build the index
● An inverted index is an internal data
structure which maps terms to IDs
● An index is a collection of documents
(including many inverted indexes)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
... ...
5.00 [5]
8.00 [3]
0-10.00 [3, 5]
11.99 [7, 8]
... ...
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
... ... ...
items indexTerminology
● A search index can have
many inverted indexes
● A search engine can have
many search indexes
title inverted index
price inverted index
blog-posts index
title inverted index
post inverted index
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance
● Improve quality of results
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
O(1)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
r = number of results found
O(1+r)
...but we usually only ask for a fixed
number of results at a time
O(25) → O(1)
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
● Improve quality of results
But at
what cost?
@scarletdrive
Trade-offs
● Space
● System complexity
● Pre-processing time
O(1)
Query
time
O(n·m·p)
Index
time
Did we solve it?
● Keep it local ✓ and customize it
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results
Let’s talk about
how we build it.
@scarletdrive
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99
How did we do this??
Step 1:
Tokenization
string: “cat hat”
array: [“cat”, “hat”]
Image from aliexpress.com
Image from aliexpress.com
Step 2:
Normalization
● Stemming
○ “cats” → “cat”
○ “walking” → “walk”
● Stop words
○ Remove “the”, “and”, “to”, etc...
Image from aliexpress.com
Step 3: Filters
● Lowercase
○ “Dog” → “dog”
● Synonyms
○ “colour” → “color”
○ “t-shirt” → “tshirt”
○ “canadian” → “canada”
○ “kitten” → “cat”
Quality Problems
1. “cat” search returned “vacation hat for dog”
Quality Problems
1. “cat” search returned “vacation hat for dog”
id title price
4 vacation hat for dog 12.99
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
Quality Problems
1. “cat” search returned “vacation hat for dog”
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
cat
id title price
4 vacation hat for dog 12.99
Quality Problems
1. “cat” search returned “vacation hat for dog”
2. “cats” search does not return “red cat mittens”
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
→
All transformations performed on
the input data for the index
are also performed on the query
Quality Problems
2. “cats” search does not return “red cat mittens”
id title price
1 red cat mittens 14.99
red [1]
cat [1]
mitten [1]
cats cat
Quality Problems
1. “cat” search returned “vacation hat for dogs”
2. “cats” search does not return “red cat mittens”
3. “cat” search does not return “kitten mittens”
Quality Problems
3. “cat” search does not return “kitten mittens”
id title price
7 kitten mittens 11.99
cat [7]
mitten [7]
Quality Problems
3. “cat” search does not return “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
cat
Quality Problems
3 ½ search for “kitten” still returns “kitten mittens”
cat [7]
mitten [7]
id title price
7 kitten mittens 11.99
kitten cat
Did we solve it?
● Keep it local ✓ and customize it ✓
● Improve performance ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results ✓
○ By performing special pre-processing steps
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools
4. Interesting challenges
I want a search engine...
do I have to build it myself?
@scarletdrive
● Inverted index
● Basic tokenization,
normalization, and filters
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization,
normalization, and filters
● Plugins
● ...and more!
Which one should I pick?
It doesn’t matter
Which one should I pick?
● Most projects work well with either
● Getting configuration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/
Which one should I pick?
Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges
Interesting Challenge:
Scalability
Too much traffic?
Replication
Too much traffic?
Replication
update
Too much data?
Sharding
Distribution
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers
Interesting Challenge:
Relevance
id title price
1 red cat mittens 14.99
3 blue hat for cats 8.00
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?
tf-idf
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: “cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
(assume 100 records which all contain
“cat” in them)
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
Query: “orange cat”
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a good cat.
2. My cat ate an orange.
Result order = [2, 1]Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55
score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
3/7 = 0.43
2/5 = 0.40
1/7 = 0.14
1/5 = 0.20
tf-idf
bm25
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
Relevance Challenges
● Prevent keyword stuffing or other “gaming the system”
● Phrase matching
● Fuzzy matching
● User factors: language, location
● Other factors: quality, recency, randomness, diversity
Interesting Challenges
● Scalability
● Relevance
● Query understanding
● Numeric range search
● Faceted search
● Autocomplete
We covered: We did not cover:
Agenda
1. Why build search engines? ✓
2. Search indexes ✓
3. Open source tools ✓
4. Interesting challenges ✓
Thanks!

More Related Content

Similar to Search Engines: How They Work and Why You Need Them

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicantsDominik Wojciechowski
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchGeorge Awad
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Brian Nauheimer
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsJoe Stein
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxPhanTien25
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresRob Snell
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#J On The Beach
 
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18Johannes Radig
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsPJ Howland
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learningmathias-brandewinder
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School ProgrammersSiva Arunachalam
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling ContentRob Snell
 
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIAmanda Lam
 

Similar to Search Engines: How They Work and Why You Need Them (16)

Most common mistakes of workshops applicants
Most common mistakes of workshops applicantsMost common mistakes of workshops applicants
Most common mistakes of workshops applicants
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020Elastic Relevance Presentation feb4 2020
Elastic Relevance Presentation feb4 2020
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Storing Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite ColumnsStoring Time Series Metrics With Cassandra and Composite Columns
Storing Time Series Metrics With Cassandra and Composite Columns
 
Microsoft_brand_template_blue.potx
Microsoft_brand_template_blue.potxMicrosoft_brand_template_blue.potx
Microsoft_brand_template_blue.potx
 
Agile Estimating
Agile EstimatingAgile Estimating
Agile Estimating
 
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! StoresGoogle INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
Google INSTANT SEO -- Ecommerce Search Engine Optimization for Yahoo! Stores
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18Website Personalisation DIY with Google Tag Manager - AllThingsData '18
Website Personalisation DIY with Google Tag Manager - AllThingsData '18
 
Crush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO TacticsCrush Competitors with Deep On-Page SEO Tactics
Crush Competitors with Deep On-Page SEO Tactics
 
Agile Experiments in Machine Learning
Agile Experiments in Machine LearningAgile Experiments in Machine Learning
Agile Experiments in Machine Learning
 
Adp scrum multiple product logs
Adp scrum multiple product logsAdp scrum multiple product logs
Adp scrum multiple product logs
 
Python for High School Programmers
Python for High School ProgrammersPython for High School Programmers
Python for High School Programmers
 
SEO: Create Compelling Content
SEO: Create Compelling ContentSEO: Create Compelling Content
SEO: Create Compelling Content
 
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AIProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
ProductTank HK #31 - Maximizing Product Ops Efficiency with Generative AI
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Search Engines: How They Work and Why You Need Them

  • 1. Search Engines How They Work and Why You Need Them
  • 2.
  • 3.
  • 4. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 5. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 6. Agenda 1. Why build search engines? 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 7. What do you even do all day? We have Google. @scarletdrive
  • 8. Not all search engines are web search engines. @scarletdrive
  • 9. google.com potatoparcel.com Large scope (entire internet) Small scope (just a few potatoes) No control over content Total control over content Many use cases Optimize for selling potatoes
  • 10.
  • 11.
  • 12. Most websites have a custom search engine. @scarletdrive
  • 13. Why build search engines? ● Keep it local and customize it
  • 14.
  • 15. Let’s try to search my store. @scarletdrive
  • 16. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 17. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 18. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 cat SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 19. n = items in database m = max length of title strings n·m
  • 20. n = items in database m = max length of title strings = 250 O(n)
  • 21. n n · m (m=250) 10 2 500 100 25 000 1 000 250 000 10 000 2 500 000 100 000 25 000 000 1 000 000 250 000 000
  • 22. Why build search engines? ● Keep it local and customize it ● Improve performance
  • 23. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 24. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 25. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” SELECT * FROM items WHERE title LIKE ‘%cat%’
  • 26. id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 ● Search for “cat” incorrectly returns “vacation hat for dog” ● Search for “cat” doesn’t return “kitten mittens” ● Search for “cats” doesn’t return “cat hat” or “red cat mittens” SELECT * FROM items WHERE title LIKE ‘%cats%’
  • 27. SELECT * FROM items WHERE title LIKE ‘cat’ OR title LIKE ‘cats’ OR title LIKE ‘cat %’ OR title LIKE ‘cats %’ OR title LIKE ‘% cat’ OR title LIKE ‘% cats’ OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’ OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’ OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’ OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’ OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’ OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’ OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’ OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’ OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’ OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’ ...
  • 28. Why build search engines? ● Keep it local and customize it ● Improve performance ● Improve quality of results
  • 30. Agenda 1. Why build search engines? ✓ 2. Search indexes 3. Open source tools 4. Interesting challenges
  • 31. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99
  • 32. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] Inverted Index
  • 33. Terminology ● A document is a single searchable unit red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] 7 kitten mittens 11.99
  • 34. Terminology ● A document is a single searchable unit ● A field is a defined value in a document red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 35. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 7 kitten mittens 11.99
  • 36. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8]
  • 37. Terminology ● A document is a single searchable unit ● A field is a defined value in a document ● A term is a value extracted from the source in order to build the index ● An inverted index is an internal data structure which maps terms to IDs ● An index is a collection of documents (including many inverted indexes) red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] ... ... 5.00 [5] 8.00 [3] 0-10.00 [3, 5] 11.99 [7, 8] ... ... id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 ... ... ...
  • 38. items indexTerminology ● A search index can have many inverted indexes ● A search engine can have many search indexes title inverted index price inverted index blog-posts index title inverted index post inverted index
  • 39. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ● Improve quality of results
  • 40. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat
  • 41. O(1)
  • 42. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] cat id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00
  • 43. r = number of results found O(1+r)
  • 44. ...but we usually only ask for a fixed number of results at a time O(25) → O(1)
  • 45. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ● Improve quality of results
  • 47. Trade-offs ● Space ● System complexity ● Pre-processing time
  • 49. Did we solve it? ● Keep it local ✓ and customize it ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results
  • 50. Let’s talk about how we build it. @scarletdrive
  • 51. red [1, 6] cat [1, 3, 5] mitten [2, 7] blue [2, 3, 6] hat [3, 4, 5, 6] dog [2, 4, 6, 8] vacation [4] kitten [7] boot [8] id title price 1 red cat mittens 14.99 2 blue dog mittens 24.99 3 blue hat for cats 8.00 4 vacation hat for dog 12.99 5 cat hat 5.00 6 red and blue dog hat 10.49 7 kitten mittens 11.99 8 dog booties 11.99 How did we do this??
  • 52. Step 1: Tokenization string: “cat hat” array: [“cat”, “hat”] Image from aliexpress.com
  • 53. Image from aliexpress.com Step 2: Normalization ● Stemming ○ “cats” → “cat” ○ “walking” → “walk” ● Stop words ○ Remove “the”, “and”, “to”, etc...
  • 54. Image from aliexpress.com Step 3: Filters ● Lowercase ○ “Dog” → “dog” ● Synonyms ○ “colour” → “color” ○ “t-shirt” → “tshirt” ○ “canadian” → “canada” ○ “kitten” → “cat”
  • 55. Quality Problems 1. “cat” search returned “vacation hat for dog”
  • 56. Quality Problems 1. “cat” search returned “vacation hat for dog” id title price 4 vacation hat for dog 12.99 cat [1, 3, 5] hat [4] dog [4] vacation [4]
  • 57. Quality Problems 1. “cat” search returned “vacation hat for dog” cat [1, 3, 5] hat [4] dog [4] vacation [4] cat id title price 4 vacation hat for dog 12.99
  • 58. Quality Problems 1. “cat” search returned “vacation hat for dog” 2. “cats” search does not return “red cat mittens”
  • 59. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] →
  • 60. All transformations performed on the input data for the index are also performed on the query
  • 61. Quality Problems 2. “cats” search does not return “red cat mittens” id title price 1 red cat mittens 14.99 red [1] cat [1] mitten [1] cats cat
  • 62. Quality Problems 1. “cat” search returned “vacation hat for dogs” 2. “cats” search does not return “red cat mittens” 3. “cat” search does not return “kitten mittens”
  • 63. Quality Problems 3. “cat” search does not return “kitten mittens” id title price 7 kitten mittens 11.99 cat [7] mitten [7]
  • 64. Quality Problems 3. “cat” search does not return “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 cat
  • 65. Quality Problems 3 ½ search for “kitten” still returns “kitten mittens” cat [7] mitten [7] id title price 7 kitten mittens 11.99 kitten cat
  • 66. Did we solve it? ● Keep it local ✓ and customize it ✓ ● Improve performance ✓ ○ At the expense of space, complexity, and pre-processing effort ● Improve quality of results ✓ ○ By performing special pre-processing steps
  • 67. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools 4. Interesting challenges
  • 68. I want a search engine... do I have to build it myself? @scarletdrive
  • 69.
  • 70. ● Inverted index ● Basic tokenization, normalization, and filters ● Replication, sharding, and distribution ● Caching and warming ● Advanced tokenization, normalization, and filters ● Plugins ● ...and more!
  • 71. Which one should I pick? It doesn’t matter
  • 72. Which one should I pick? ● Most projects work well with either ● Getting configuration right is most important ● Test with your own data, your own queries Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability Solr vs. Elasticsearch by Kelvin Tan http://solr-vs-elasticsearch.com/
  • 73. Which one should I pick? Better for advanced customization Easier to learn, faster to start up, better docs ~ ~ WARNING: Toria’s personal opinion ~ ~
  • 74. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges
  • 79. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 80. Replication, Sharding, and Distribution 8 shards (A,B,C,D,E,F,G,H) 3 replicas each 6 servers
  • 82. id title price 1 red cat mittens 14.99 3 blue hat for cats 8.00 5 cat hat 5.00 22 feather cat toy 7.99 124 cat and mouse t-shirt 24.50 128 cat t-shirt 31.80 329 “cats rule” sticker 0.99 420 catnip joint for cats 5.99 455 cat toy 7.00 ... ... ... When there are many results, what order should we display them in?
  • 84. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 1/5 = 0.20 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [1, 3, 2]Query: “cat”
  • 85. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a very good cat. 2. My cat ate an orange. Cat cat cat! 3. Cats are the best and I will give every cat a special cat toy. 1. TF(cat) = 2/8 = 0.25 2. TF(cat) = 4/8 = 0.50 3. TF(cat) = 3/14 = 0.21 IDF(cat) = loge (3/3) Result order = [2, 1, 3]Query: “cat”
  • 86. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. (assume 100 records which all contain “cat” in them) IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 Query: “orange cat”
  • 87. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78
  • 88. TF(term) = # times this term appears in doc / total # terms in doc IDF(term) = loge (total number of docs / # docs which contain this term) Relevance with tf-idf 1. The orange cat is a good cat. 2. My cat ate an orange. Result order = [2, 1]Query: “orange cat” IDF(cat) = loge (100/100) = 0.0 IDF(orange) = loge (100/2) = 3.9 score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55 score = score(cat, doc2) + s(orange, doc2) = 0.20*0.0 + 0.20*3.9 = 0.78 3/7 = 0.43 2/5 = 0.40 1/7 = 0.14 1/5 = 0.20
  • 90. Relevance Challenges ● Prevent keyword stuffing or other “gaming the system” ● Phrase matching ● Fuzzy matching ● User factors: language, location ● Other factors: quality, recency, randomness, diversity
  • 91. Interesting Challenges ● Scalability ● Relevance ● Query understanding ● Numeric range search ● Faceted search ● Autocomplete We covered: We did not cover:
  • 92. Agenda 1. Why build search engines? ✓ 2. Search indexes ✓ 3. Open source tools ✓ 4. Interesting challenges ✓