Often when a new user arrives on your website, the first place they go to find information is the search box! Whether they are searching for hotels on your travel site, products on your e-commerce site, or friends to connect with on your social media site, it is important to have fast, effective search in order to engage the user.
6. Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
8. “Isn’t search a solved problem? We have Google!”
All my friends
Photo by Alissa
loveherbyalissa.etsy.com
9. title
• Title • Title
Very very large scope Medium scope
No control over content Some control over content
High intent Low intent
Optimize for Google users Optimize for Etsy users
9
Google Etsy
10. Why build search systems?
1. Customize the solution (your users, your data, your algorithms)
10
11. id description price
001 red cat mittens 40.00
002 blue mittens 19.99
003 blue hat for cats 12.50
004 cat hat 25.00
005 red and blue hat 30.00
11
Database Example
q=“cat”
SELECT * FROM items
WHERE description
LIKE ‘%cat%’
12. 12
n = items in database
m = length of string
SUBSTRING SEARCH
O(n·m)
18. Inverted Index
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
18
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
19. Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
19
● A document is a single searchable unit
001 red cat mittens 40.00
20. Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
20
● A document is a single searchable unit
● A field is a defined value in a document
id description price
001 red cat mittens 40.00
21. Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
21
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
id description price
001 red cat mittens 40.00
22. Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
22
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
● An inverted index is an internal data
structure that maps terms of a field to
document ids
23. Terminology
red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
23
● A document is a single searchable unit
● A field is a defined value in a document
● A term is a value extracted from the source
in order to build the inverted index
● An inverted index is an internal data
structure that maps terms of a field to
document ids
● An index is a collection of documents
12.50 [003]
19.99 [002]
25.00 [004]
30.00 [005]
40.00 [001]
001 red cat mittens 40.00
002 blue mittens 19.99
... ... ...
24. red [001, 005]
blue [002, 003, 005]
cat [001, 003, 004]
hat [003, 004, 005]
mitten [001, 002]
001 red cat mittens
002 blue mittens
003 blue hat for cats
004 cat hat
005 red and blue hat
How did we do this?
29. By Ludwinus van den Arend
circuszoo.etsy.com
● Stemming ✓ hat for cats
● Tokenization ✗ vacation
● Synonyms ✓ kitten hat
Building an
Inverted Index
33. title1. “Carlos Vives is the
greatest singer alive”
2. “Shakira is the best
dancer in the world”
3. “Sophía Vergara is the
most famous Colombian
in the United States”
carlos=[1]
vives=[1]
is=[1,2,3]
the=[1,2,3]
great=[1]
singer=[1]
alive=[1]
shakira=[2]
best=[2]
dancer=[2]
in=[2,3]
world=[2]
sophia=[3]
vergara=[3]
most=[3]
famous=[3]
colombia=[3]
unite=[3]
states=[3]
33
34. Did we solve it?
✓ Customize the solution (your users, your data, your algorithms)
✓ Improve performance
✓ Improve quality of results
34
35. Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
40. Source
Side by Side with Elasticsearch and Solr
By Rafał Kuć and Radu Gheorghe
https://berlinbuzzwords.de/14/session/side-side-elasticsearch-and-solr
https://berlinbuzzwords.de/15/session/side-side-elasticsearch-solr-part-2-performance-scalability
See also
http://solr-vs-elasticsearch.com/
By Kelvin Tan
40
It Doesn’t Matter
● Most projects work well with either
● Getting configuration right is more important
● Test with your own data and your own queries
59. 59
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give
every cat a special cat toy
1. TF(cat) = 2/8
2. TF(cat) = 1/5
3. TF(cat) = 3/14
IDF(cat) = loge
(3/3)
“cat” → [1, 3, 2]
60. 60
TF-IDF
TF(term) = # times this term appears in doc / total # terms in doc
IDF(term) = loge
(total number of docs / # docs which contain this term)
1. The orange cat is a very good cat
2. My cat ate an orange
3. Cats are the best and I will give
every cat a special cat toy cat cat
cat cat cat
1. TF(cat) = 2/8
2. TF(cat) = 1/5
3. TF(cat) = 8/19
IDF(cat) = loge
(3/3)
“cat” → [3, 1, 2]
67. Query Understanding
● Tokenization and stemming
● Language identification
● Spelling correction
● Query rewriting (scoping, expansion, relaxation)
For more information
http://queryunderstanding.com/
By Daniel Tunkelang
67
70. How Etsy Uses Thermodynamics to Help You Search for “Geeky” by Fiona Condon
http://codeascraft.com/2015/08/31/how-etsy-uses-thermodynamics-to-help-you-search-for-geeky
72. Agenda Main Section One
Main Section Two
Main Section Three
Why Build Search Systems?
Search Indexes
Open Source Tools
Interesting Challenges in Search
✓
✓
✓
✓
74. title
74
We Covered We Did Not Cover
● Stemming
● Tokenization
● Synonyms
● Replication, distribution,
and sharding
● Ranking for relevance
● Query understanding
● Faceting
● Field data
● Internationalization
● Spelling correction
● Autocomplete suggestions