2. agenda
part 1 Orange French search engine
part 2 why Elasticsearch?
part 3 conclusion
2 Orange French search engines and Elastisearch
3. Orange search engine
4 millions
~1 million
8 bn docs
FR
3 Orange French search engines and Elastisearch
80 persons
~1000 servers
3 datacenters
4. search engine response page
§ one response page…
§ with a lot of data sources
§ and a lot of engines
4 Orange French search engines and Elastisearch
6. web search and web graph
6 Orange French search engines and Elastisearch
repris de Wikipedia
7. volume
§ vertical search engines
– 10m documents
– 5 engines in 2014
§ web graph
– 8bn urls
– 2bn internal vertices
– 6bn leaf vertices
– 100bn edges
7 Orange French search engines and Elastisearch
10GB
13TB
8. agenda
part 1 Orange French search engine
part 2 why Elasticsearch?
part 3 conclusion
8 Orange French search engines and Elastisearch
9. our needs
§ vertical search engines
– adopt one common technology
– lower maintenance cost
– prepare future needs
§ web graph
– gain insight on large dataset
– build analysis and visualization
– test new technology with large volume
9 Orange French search engines and Elastisearch
10. Elasticsearch responses
§ rest interface
§ near real time distributed indexing and distributed search
§ native full text search
– with a lot of different queries and wildcards
§ facets… oups! aggregations!
– values distribution on a specific criterion
§ interactive mode while exploring a dataset
– short query response time
10 Orange French search engines and Elastisearch
11. hardware architecture
…
store store store
…
11 Orange French search engines and Elastisearch
x30
x30
Elasticsearch cluster
12. indexing with ES v0.90
§ performances
– starting at 160 doc/s (1 injector, 4 ES 2cpus, 4GB)
– with bulk 1000: 920 doc/s (1 injector, 4 ES 2cpus, 4GB)
– 3 injectors: 570 doc/s * 3 = 1700 doc/s
– 1 injector, 30 ES (8cpus, 16GB): 1700 doc/s
– 30 injectors, 30 ES (8cpus, 16GB): 32,000 doc/s
– 30 injectors, 60 ES (http-data) (8cpus, 16GB): 36,000 doc/s
– 240 injectors, 60 ES (http-data) (8cpus, 16GB): 75,000 doc/s, then
43,000 doc/s
– 1bn docs in 5h (55,000 doc/s)
12 Orange French search engines and Elastisearch
13. hardware architecture
…
…
store
http
store
http
13 Orange French search engines and Elastisearch
x30
x30
Elasticsearch cluster
store
http
data data data
14. number of shards
sec!
1200
1000
800
600
400
200
0
321 sec for 12 shards
0 5 10 15 20 25 30
14 Orange French search engines and Elastisearch
#shards!
16. searching
§ performance
– 2 req/s out of the box with 6.5TB index
– OS cache is mandatory
– 130 req/s in cache
– lot of requests needed to load cache
§ relevance
– good for vertical engines
– non significant in web graph experimentation
16 Orange French search engines and Elastisearch
17. why Elasticsearch AND hadoop?
§ simply use existing bridge
– open-sourced by Elasticsearch
§ ability to choose best technology
– performance
– expression power
§ examples
– compute and re-inject back links
– distribute Elasticsearch injections
17 Orange French search engines and Elastisearch
18. hardware architecture
…
…
18 Orange French search engines and Elastisearch
x30
x30
Elasticsearch cluster
hadoop cluster
x1
http http http
data
hdfs
hadoop
hive
pig
master
data
hdfs
hadoop
hive
pig
data
hdfs
hadoop
hive
pig
19. agenda
part 1 Orange French search engine
part 2 why Elasticsearch?
part 3 conclusion
19 Orange French search engines and Elastisearch
20. conclusion
§ vertical engines
– migration decided 09/13
– first set in production 01/14
§ web graph
– experimentation decided 09/13
– 1bn docs indexed 12/13, significant queries 03/14
§ professional community
§ connectors to others technologies
§ flexibility
– production and experimentation
– high volume
20 Orange French search engines and Elastisearch