2. Pricing & Text Analytics Platform
• Mission - Ingest, enrich, store, analyze everything. Provide a
single platform for search and analytics capabilities over any
hosted content. Serve as a platform for future innovation.
• Content
• Twitter (~675 Tweets/sec, 15 days history)
• News (~40 articles/sec, 18 months history)
• Research (40 million docs, 3 million/year)
• Filings (29 million docs, 2.5 million/year)
• Trade data (500k RICS, 30K/sec, 10 years)
• Various metadata and derived content sets
8. Maximum Shard Size
• This same experiment will also give you the ratio of data to
index size, which is great for planning. Just make sure you’re
using your real analyzer settings.
• The rest is just math!
• Don’t forget to account for:
• Memory required to facet & sort
• Replica shards
• Data compression
Max Total Index Size / Max Shard Size = # Nodes
11. Cluster Allocation
• Elasticsearch will figure out which node should host which shard. Let it! Its
better than you at figuring this out and moving shards around.
• Well mostly….
• Let’s say you have indices A – D, 4 shards each, 0 replicas, 4 nodes.
Elasticsearch might arrange your shards like this based on the size of each
shard.
A1
C1
B1
C4D4C3
B3A3B4A4B2A2
D2C2D3D1
12. Cluster Allocation
• But what about other considerations?
• Hot spotting
• Access frequency
• Connectivity for River-based ingestion
• Heterogeneous hardware
A1
C1
B1
C4D4C3
B3A3B4A4B2A2
D2C2D3D1
13. Cluster Allocation – Heterogeneous Hardware
• Suppose you know that indices A and B get queried 1000s of times per
second, but C and D are only hit ~1 a second. Maybe bought some better
hardware to host A and B and don’t want to waste those machines on C and
D.
• Is this a good allocation?
Slow HW Slow HW Fast HW Fast HW
A1
C1
B1
C4D4C2
B3A1B4A4B2A2
D2C3D3D1
14. Cluster Allocation – Heterogeneous Hardware
• Suppose you know that indices A and B get queried 1000s of times per
second, but C and D are only hit ~1 a second. Maybe bought some better
hardware to host A and B and don’t want to waste those machines on C and
D.
• Is this a good allocation?
• Not really. The slower machines will slow all queries to A & B. And I’m not
getting my money’s worth from that better hardware!
Slow HW Slow HW Fast HW Fast HW
A1
C1
B1
C4D4C2
B3A1B4A4B2A2
D2C3D3D1
15. Cluster Allocation – Heterogeneous Hardware
• Wouldn’t this be better?
• Shard allocation settings allow us to “control” which nodes host which indices
without ever specifying specific machines or IPs.
Slow HW Slow HW Fast HW Fast HW
A1C1 B1
C4
D4C2
B3A1B4A4
B2A2
D2C3D3
D1
16. Cluster Allocation – Heterogeneous Hardware
Slow HW Slow HW Fast HW Fast HW
A1C1 B1
C4
D4C2
B3A1B4A4
B2A2
D2C3D3
D1
node.hardware: slow node.hardware: fast
Index.routing.allocation.require.hardware: fast
Node Settings Node Settings
Index Settings: A & B
17. Cluster Allocation – Heterogeneous Hardware
Slow HW Fast HW Fast HW Fast HW
A1C1 B1
C4 D4
C2
B3A1
B4
A4
B2A2
D2C3D3
D1
• Is this ok? …Sure, why not?!
18. Cluster Allocation – Archive Example
• We can use the same feature for large data sets of a time-based feed. Say
we keep an index for all news ever. People are generally searching the
most recent 12 months, not the last 30 years.
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HWSlow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HWSlow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW Slow
HW
Slow
HW
Slow
HW
Slow
HW Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Slow
HW
Fast
HW
Fast
HW
Fast
HW
Fast
HW
Fast
HW
Fast
HW
Fast
HW
Fast
HW