Search Engines: How They Work and Why You Need Them

Search Engines
How They Work and
Why You Need Them

Agenda
1. Why build search engines?
2. Search indexes
3. Open source tools
4. Interesting challenges

What do you
even do all day?
We have Google.
@scarletdrive

Not all search engines are
web search engines.
@scarletdrive

google.com potatoparcel.com
Large scope
(entire internet)
Small scope
(just a few potatoes)
No control
over content
Total control over content
Many use cases
Optimize for selling
potatoes

Most websites have a
custom search engine.
@scarletdrive

Why build search engines?
● Keep it local and customize it

Let’s try to
search my store.
@scarletdrive

id title price
1 red cat mittens 14.99
2 blue dog mittens 24.99
3 blue hat for cats 8.00
4 vacation hat for dog 12.99
5 cat hat 5.00
6 red and blue dog hat 10.49
7 kitten mittens 11.99
8 dog booties 11.99

id title price
5 cat hat 5.00
8 dog booties 11.99
cat
SELECT *
FROM items
WHERE title LIKE ‘%cat%’

n = items in database
m = max length of title strings
n·m

n = items in database
m = max length of title strings = 250
O(n)

n n · m (m=250)
10 2 500
100 25 000
1 000 250 000
10 000 2 500 000
100 000 25 000 000
1 000 000 250 000 000

● Improve performance

id title price
5 cat hat 5.00
8 dog booties 11.99
SELECT *
FROM items

id title price
5 cat hat 5.00
8 dog booties 11.99
● Search for “cat” incorrectly
returns “vacation hat for dog”
SELECT *
FROM items

id title price
5 cat hat 5.00
8 dog booties 11.99
● Search for “cat” doesn’t return
“kitten mittens”
SELECT *
FROM items

id title price
5 cat hat 5.00
8 dog booties 11.99
● Search for “cat” doesn’t return
“kitten mittens”
● Search for “cats” doesn’t return
“cat hat” or “red cat mittens”
SELECT *
FROM items
WHERE title LIKE ‘%cats%’

SELECT * FROM items
WHERE title LIKE ‘cat’ OR title LIKE ‘cats’
OR title LIKE ‘cat %’ OR title LIKE ‘cats %’
OR title LIKE ‘% cat’ OR title LIKE ‘% cats’
OR title LIKE ‘% cat %’ OR title LIKE ‘% cats %’
OR title LIKE ‘% cat.%’ OR title LIKE ‘% cats.%’
OR title LIKE ‘%.cat %’ OR title LIKE ‘%.cats %’
OR title LIKE ‘%.cat.%’ OR title LIKE ‘%.cats.%’
OR title LIKE ‘% cat,%’ OR title LIKE ‘% cats,%’
OR title LIKE ‘%,cat %’ OR title LIKE ‘%,cats %’
OR title LIKE ‘%,cat,%’ OR title LIKE ‘%,cats,%’
OR title LIKE ‘% cat-%’ OR title LIKE ‘% cats-%’
OR title LIKE ‘%-cat %’ OR title LIKE ‘%-cats %’
OR title LIKE ‘%-cat-%’ OR title LIKE ‘%-cats-%’
...

● Improve quality of results

Agenda
1. Why build search engines? ✓
2. Search indexes

red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
5 cat hat 5.00
8 dog booties 11.99

red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
Inverted
Index

Terminology
● A document is a single searchable unit
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]

Terminology
● A ﬁeld is a deﬁned value in a document
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price

Terminology
● A term is a value extracted from the
source in order to build the index
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price

Terminology
● An inverted index is an internal data
structure which maps terms to IDs
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]

Terminology
● An inverted index is an internal data
structure which maps terms to IDs
● An index is a collection of documents
(including many inverted indexes)
red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
... ...
5.00 [5]
8.00 [3]
0-10.00 [3, 5]
11.99 [7, 8]
... ...
id title price
... ... ...

items indexTerminology
● A search index can have
many inverted indexes
● A search engine can have
many search indexes
title inverted index
price inverted index
blog-posts index
title inverted index
post inverted index

Did we solve it?
● Keep it local ✓ and customize it

red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat

red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
cat
id title price
5 cat hat 5.00

r = number of results found
O(1+r)

...but we usually only ask for a ﬁxed
number of results at a time
O(25) → O(1)

Did we solve it?
● Improve performance ✓

But at
what cost?
@scarletdrive

Trade-offs
● Space
● System complexity
● Pre-processing time

O(1)
Query
time
O(n·m·p)
Index
time

Did we solve it?
○ At the expense of space, complexity, and pre-processing effort

Let’s talk about
how we build it.
@scarletdrive

red [1, 6]
cat [1, 3, 5]
mitten [2, 7]
blue [2, 3, 6]
hat [3, 4, 5, 6]
dog [2, 4, 6, 8]
vacation [4]
kitten [7]
boot [8]
id title price
5 cat hat 5.00
8 dog booties 11.99
How did we do this??

Step 1:
Tokenization
string: “cat hat”
array: [“cat”, “hat”]
Image from aliexpress.com

Step 2:
Normalization
● Stemming
○ “cats” → “cat”
○ “walking” → “walk”
● Stop words
○ Remove “the”, “and”, “to”, etc...

Step 3: Filters
● Lowercase
○ “Dog” → “dog”
● Synonyms
○ “colour” → “color”
○ “t-shirt” → “tshirt”
○ “canadian” → “canada”
○ “kitten” → “cat”

Quality Problems
1. “cat” search returned “vacation hat for dog”

Quality Problems
id title price
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]

Quality Problems
cat [1, 3, 5]
hat [4]
dog [4]
vacation [4]
cat
id title price

Quality Problems
2. “cats” search does not return “red cat mittens”

Quality Problems
id title price
red [1]
cat [1]
mitten [1]
→

All transformations performed on
the input data for the index
are also performed on the query

Quality Problems
id title price
red [1]
cat [1]
mitten [1]
cats cat

Quality Problems
1. “cat” search returned “vacation hat for dogs”
3. “cat” search does not return “kitten mittens”

Quality Problems
id title price
cat [7]
mitten [7]

Quality Problems
cat [7]
mitten [7]
id title price
cat

Quality Problems
3 ½ search for “kitten” still returns “kitten mittens”
cat [7]
mitten [7]
id title price
kitten cat

Did we solve it?
● Keep it local ✓ and customize it ✓
○ At the expense of space, complexity, and pre-processing effort
● Improve quality of results ✓
○ By performing special pre-processing steps

Agenda
2. Search indexes ✓

I want a search engine...
do I have to build it myself?
@scarletdrive

● Inverted index
● Basic tokenization,
normalization, and ﬁlters
● Replication, sharding, and
distribution
● Caching and warming
● Advanced tokenization,
normalization, and ﬁlters
● Plugins
● ...and more!

Which one should I pick?
It doesn’t matter

● Most projects work well with either
● Getting conﬁguration right is most important
● Test with your own data, your own queries
Side by Side with Elasticsearch and Solr by Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
Solr vs. Elasticsearch by Kelvin Tan
http://solr-vs-elasticsearch.com/

Better for advanced
customization
Easier to learn, faster to
start up, better docs
~ ~ WARNING: Toria’s personal opinion ~ ~

Agenda
3. Open source tools ✓

Interesting Challenge:
Scalability

Too much traﬃc?
Replication
update

Too much data?
Sharding
Distribution

Replication, Sharding, and Distribution
8 shards
(A,B,C,D,E,F,G,H)
3 replicas each
6 servers

Interesting Challenge:
Relevance

id title price
5 cat hat 5.00
22 feather cat toy 7.99
124 cat and mouse t-shirt 24.50
128 cat t-shirt 31.80
329 “cats rule” sticker 0.99
420 catnip joint for cats 5.99
455 cat toy 7.00
... ... ...
When there are
many results, what
order should we
display them in?

TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
Relevance with tf-idf
1. The orange cat is a very good cat.
2. My cat ate an orange.
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 1/5 = 0.20
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [1, 3, 2]Query: “cat”

IDF(term) = loge
1. The orange cat is a very good cat.
2. My cat ate an orange. Cat cat cat!
3. Cats are the best and I will give
every cat a special cat toy.
1. TF(cat) = 2/8 = 0.25
2. TF(cat) = 4/8 = 0.50
3. TF(cat) = 3/14 = 0.21
IDF(cat) = loge
(3/3)
Result order = [2, 1, 3]Query: “cat”

IDF(term) = loge
1. The orange cat is a good cat.
(assume 100 records which all contain
“cat” in them)
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
Query: “orange cat”

IDF(term) = loge
Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
score = score(cat, doc1) + s(orange, doc1) = 0.29*0.0 + 0.14*3.9 = 0.55

IDF(term) = loge
Result order = [2, 1]Query: “orange cat”
IDF(cat) = loge
(100/100) = 0.0
IDF(orange) = loge
(100/2) = 3.9
3/7 = 0.43
2/5 = 0.40
1/7 = 0.14
1/5 = 0.20

tf-idf
bm25
https://elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables

Relevance Challenges
● Prevent keyword stuﬃng or other “gaming the system”
● Phrase matching
● Fuzzy matching
● User factors: language, location
● Other factors: quality, recency, randomness, diversity

Interesting Challenges
● Scalability
● Relevance
● Query understanding
● Numeric range search
● Faceted search
● Autocomplete
We covered: We did not cover:

Agenda
3. Open source tools ✓
4. Interesting challenges ✓

Search Engines: How They Work and Why You Need Them

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Search Engines: How They Work and Why You Need Them

Ähnlich wie Search Engines: How They Work and Why You Need Them (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Search Engines: How They Work and Why You Need Them