A quick overview of Elasticsearch usage at Dailymotion for video search
Talk given at Elasticsearch Meetup France #7
June 10, 2014
http://www.meetup.com/elasticsearchfr/events/171946592/
4. search cluster
cluster and indexation
> 10 nodes for video search
_ one main video index
_ 5 shards with 1 replica
_ nodes : 32 cores, 48 gb RAM, 15k disks
> 600 to 1000 search requests per second
> end-to-end response time < 40 ms
6. search cluster
cluster and indexation
elasticsearch cluster
10 nodes
mysql
farm
elasticsearch
indexer
2 nodes
index
constantly
filter data for
search
update if hash
changed
hash data
9. search
query and score
Scorer
> custom scorer
_ only slightly alter the query score
_ take into account: recency, popularity, etc.
> boosted filters and scripts when testing
> native java for performance
10. search
query and score
Query
> we need to keep control of the query base score
> problem is our text content is thin
_ short title, a few tags
_ a more or less relevant description
> bare bones TF-IDF may not be suitable
_ TF not that relevant to us
11. search
query and score
> BM25: reduce importance of
document length
> why common terms query
_ increase performance
_ ignore popular terms when searching
_ but still use them for scoring
_ like a real time specialized stop words list
similarity:
my_bm25:
type: BM25
b: 0.001
12. > ignore inexistent terms in query
> boost repeated terms (TF) only if repeated in query
a doc titled “A A A game” has a better score than “A game”
only when explicitly searching for “A A A”
> boost term by position in query and documents
search
query and score
brown fox zerzer brown fox zerzer
the quick brown fox jumps.
^1.1 ^1.07 ^1.05 ^1.03 ^1.02
13. > keep both stemmed and original terms
> score them with dis_max (tie_breaker = 0)
> disable coord factor for consistent scoring
I like dogs
i like dogs dog token
1 2 3 position
search
query and score
"field": "dogs"
"dis_max": {
"tie_breaker" : 0,
"queries" : [
{ "term": { "field": "dogs" } },
{ "term": { "field": "dog" } } ...
15. sharding
what suits us
> less shards make a query slower
> but not 16 times slower (112 ms vs 12 ms)
index / 16 shards index / 1 shard
1 ms request handling 1 ms request handling
10 ms
shard 0
(9ms)
shard 1
(10ms)
shard 2
(6ms)
shard 3
(9ms)
110 ms shard 0
shard 4
(10ms)
shard 5
(10ms)
shard 6
(9ms)
shard 7
(10ms)
shard 8
(10ms)
shard 9
(8ms)
shard 10
(5ms)
shard 11
(10ms)
shard 12
(7ms)
shard 13
(7ms)
shard 14
(10ms)
shard 15
(10ms)
1 ms return result 1 ms return result
12 ms 112 ms
16. sharding
what suits us
> takes more resources
> everything runs at 100 % for each query
> less requests per second for the same hardware
9 ms
+ 10 ms
+ 6 ms
+ (…)
=
140 ms
shard 0
(9ms)
shard 1
(10ms)
shard 2
(6ms)
shard 3
(9ms)
110 ms shard 0
shard 4
(10ms)
shard 5
(10ms)
shard 6
(9ms)
shard 7
(10ms)
shard 8
(10ms)
shard 9
(8ms)
shard 10
(5ms)
shard 11
(10ms)
shard 12
(7ms)
shard 13
(7ms)
shard 14
(10ms)
shard 15
(10ms)
140 ms spent by the shards 110 ms spent
17. sharding
what suits us
Before
> we used to have 40 shards on 18 nodes
_ ~2 millions docs per shard
_ 3 gb by shards
_ ~ 120 gb total index size
> cluster was very loaded
_ every single query was hitting all the nodes
_ response times could have been better
18. sharding
what suits us
After
> we now have 5 shards on 10 nodes
> cluster run smoother, less load
_ only 5 nodes involved per query
_ it handles many times more requests
19. sharding
what suits us
less data!
_ ~10 millions docs per shard
_ 4 gb by shards
_ ~ 25 gb total index size
> only data we need right now
_ { "_source" : false }
_ round numbers and dates
_ { "precision_step" : 2147483647 }
> less updates, faster indexation, rebalance, merges...
20. sharding
what suits us
drawbacks
> queries taken individually are slower…
> but only marginally slower
_ eg: 7 ms instead of 5 ms
> but some slower queries became more noticeable
22. how do we test
benchmarks, tuning
load test
_ benchmark with Tsung
_ dedicated test cluster
_ run real queries, lots of them
_ aim for our expected load
_ monitor everything
_ reshard, change schema
_ set masters, data-only nodes...
repeat
23. how do we test
benchmarks, tuning
use warmers
> warm segments after each merge
_ prevent slow first queries
> set it up to build cache for the filters we use
> zero reasons for not using them
25. how do we test
benchmarks, tuning
query testing
> to test a particular query raw performance
_ one index, one shard
_ millions of simple documents
_ merged in one segment
_ with some deletes
26. ?
how do we test
benchmarks, tuning
{
"query": {
"filtered": {
"query": {
"match": { "title": "some very popular terms" }
},
"filter": {
"term": { "user": "cedric" }
}
}
}
}
27. !
how do we test
benchmarks, tuning
{
"query": {
"filtered": {
"strategy": "leap_frog",
"query": {
"match": { "title": "some very popular terms" }
},
"filter": {
"term": { "user": "cedric" }
}
}
}
}
28. how do we test
benchmarks, tuning
> we also use Elasticsearch to just filter and sort
> these queries match millions of documents
_ they are slow
_ even when terms are cached
_ iterating, scoring and sorting is tedious
29. how do we test
benchmarks, tuning
query
"sort": { "created": "desc" },
"query": {
"bool": {
"must": [
{ "term": { "public": true } }
]
}
}
30. how do we test
benchmarks, tuning
query result
"sort": { "created": "desc" },
"query": {
"bool": {
"must": [
{ "term": { "public": true } }
]
}
}
"took": 695
"hits": {
"total": 79582599
}
31. how do we test
benchmarks, tuning
> we know our data
> we can help our query
32. how do we test
benchmarks, tuning
query
"sort": { "created": "desc" },
"query": {
"bool": {
"must": [
{ "term": { "public": true } },
{ "range": {
"created": {
"from": "2014-06-03"
}
} }
]
...
> with a range filter
on the sorted field
33. how do we test
benchmarks, tuning
query result
"sort": { "created": "desc" },
"query": {
"bool": {
"must": [
{ "term": { "public": true } },
{ "range": {
"created": {
"from": "2014-06-03"
}
} }
]
...
"took": 15
"hits": {
"total": 92312
}
// Same top docs
// returned
34. how do we test
benchmarks, tuning
> what if there are not enough hits?
_ re-run the query without the filter
> we use a custom query to do just that!
_ breaks once it matches enough hits
_ runs at segment level
_ no round-trips
35. "sort": { "created": "desc" },
"query": {
"break_once": {
"minimum_hits": 100,
"query": { { "term": { "public": true } },
"filters": [
{ "range": { "created": { "from": "2014-06-03" } } },
{ "range": { "created": { "from": "2014-05-03” } } },
{ "range": { "created": { "from": "2014-01-03" } } }
]
}
}
how do we test
benchmarks, tuning
stop once
there are
enough
hits
36.
37. thank you
> index only what you need now
> shard today, reshard tomorrow
> benchmark to find what suits you best
> test and optimize your queries