2. Background
• Technologist at The HumanGeo
• We use elasticsearch to build social media
analysis tools
• 100MM documents indexed
• 600GB+ index size
• Author of Python elasticsearch driver “rawes”
https://github.com/humangeo/rawes
3. Overview
• What is elasticsearch?
• Scaling with elasticsearch
• How can I use elasticsearch to help with
analytics?
• Use Case: Social Media Analytics
5. Search Engine
• Open source
• Distributed
• Automatic failover
• Crazy fast
6. Search Engine
• Actively maintained
• REST API
• JSON messages
• Lucene based
7. Search
Elasticsearch “Cluster”
Host
Index: Articles
• Simple case: one host
• One index containing a set of articles
8. Distributed Search
Elasticsearch “Cluster”
Host Host
Articles (a) Articles (b)
• Too much data?
• Add another host
• Indices can be broken up into “shards” and live on different
machines
9. Redundancy
Elasticsearch Cluster
Host Host
Articles (a) Articles (b)
Articles (b) Articles (a)
• Shards can be replicated to improve
availability
10. Node Auto Discovery
Elasticsearch Cluster
Host Host Host
Articles (a) Articles (b) Articles (b)
Articles (b) Articles (a) Articles (a)
• Say we add a third host
• elasticsearch will automatically start moving
shards to this new host to distribute load
11. Failover
Elasticsearch Cluster
Host Host Host
Articles (a) Articles (b) Articles (b)
Articles (b) Articles (a) Articles (a)
• Say a host goes down
• Shards on that host are no longer available for search
• Elasticsearch automatically rebuilds these two shards on other
hosts
12. Querying
Elasticsearch Cluster
Host Host Host
Articles (a) Articles (b) Articles (b)
Articles(a)
Query: “Barack Obama”
Can query against Client
Search for articles
any host
(Web
Application) Send request to
other shards if
needed
13. REST API
• JSON query syntax
• Developer friendly
• Easy to get started
17. Analytics and elasticsearch
• Date Histograms
• Statistical facets
• Geospatial queries
• All with arbitrary search parameters
• Again: Fast
18. Use Case: Social Media Analysis
• Use social media APIs to search for data on a
topic of interest
• 100MM documents indexed
• Sentiment analysis
• Location extraction (“Geotagging”)
19. Sample Document
es.post('articles/facebook', data={
”date": "2012-09-01 08:37:55",
"tags": {
"sentiment": {
"positive": 0.36,
"negative": 0.10
}
"geotags": [{
"term" : "Cairo",
"location" : "30.0566,31.2262”,
“type” : “geo_point”
}],
"search_terms": [
"Mohamed Morsi"
]
},
"item": {
"publisher: "Facebook"
"source_domain": "www.facebook.com",
"author": "James Smith",
"source_url": "http://www.facebook.com/5551231234/posts/414141414141",
"content_text": "Mohamed Morsi visits Iran for first time since 1979 ....",
"title": "James Smith posted a note to Facebook",
"author_url: "http://www.facebook.com/profile.php?id=5551231234"
}
})