In the world of Big Data, scaling out is the norm. However, many Big Data deployments are trapped in a sea of small box clusters.
With the advent of scalable platforms like Scylla, node performance is no longer an issue and doubling the size of the nodes can double the available storage, memory, and processing power. So what stops people from going big in the Cloud Native world?
Watch this webinar to learn the pros and cons of large nodes, and explore why people resist using big machines, including:
- Is the cost of recovering from failures higher in larger nodes?
- Does performance increase linearly as machines get bigger?
- Does cluster performance suffer for the entire time of recovery from failures?
Handwritten Text Recognition for manuscripts and early printed texts
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
1. Go Big or Go Home!
Does it still make sense to do Big Data
with Small nodes?
WEBINAR
2. 2
Glauber Costa
Glauber Costa is a Principal Architect at ScyllaDB.
He shares his time between the engineering department
working on upcoming Scylla features and helping
customers succeed.
Before ScyllaDB, Glauber worked with Virtualization in the
Linux Kernel for 10 years with contributions ranging from
the Xen Hypervisor to all sorts of guest functionality and
containers.
3. 3
+ Next-generation NoSQL database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
About ScyllaDB
5. A long, long time ago...
5
+ NoSQL allows Big Data with
commodity HW.
+ 2008: Intel Core, 2 cores.
+ 2018: Samsung S8, octacore,
+ fits in your pocket.
6. A long, long time ago...
6
+ NoSQL allows Big Data with
commodity HW.
+ 2008: Intel Core, 2 cores.
+ 2018: Samsung S8, octacore,
+ fits in your pocket.
+ Need to store 200TB of data:
+ 200 nodes, 4 cores and 1TB each, or:
+ 20 nodes, 40 cores and 10TB each?
7. 7
Big vs Small?
+ Bigger nodes have fewer noisy neighbors.
+ Bigger nodes see economies of scale.
+ Fewer nodes increase manageability.
+ But I do small nodes because each of them only have 500GB of disk
anyway!
13. 13
+ MTBF is a constant. So twice the nodes means twice the failures.
+ Even assuming each individual failure takes more time to recover:
+ How many failures per year if 3-node cluster with 20TB each?
+ How many failures per year if 60-node cluster with 1TB each?
+ A part of the cost is per-failure, not per size:
+ How do you like being paged every week, instead of twice a year?
+ Security fix, kernel update: rolling update 3 nodes vs 60 nodes.
More nodes mean more failures
15. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
16. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
latency mean : 12.9
17. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
latency mean : 12.9
latency 95th percentile : 20.2
18. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
latency mean : 12.9
latency 95th percentile : 20.2
latency 99th percentile : 26.2
19. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
latency mean : 12.9
latency 95th percentile : 20.2
latency 99th percentile : 26.2
latency 99.9th percentile : 40.0
20. Let’s do some experiments
1 c4.4xlarge, 250 threads, QUORUM writes 1,000,000,000 partitions
Cluster is 3 x i3.xlarge, RF=3
latency mean : 12.9
latency 95th percentile : 20.2
latency 99th percentile : 26.2
latency 99.9th percentile : 40.0
Total operation time : 14:19:02
21. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
22. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
23. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
latency 95th percentile : 21.2
24. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
latency 95th percentile : 21.2
latency 99th percentile : 27.3
25. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
latency 95th percentile : 21.2
latency 99th percentile : 27.3
latency 99.9th percentile : 38.8
26. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
latency 95th percentile : 21.2
latency 99th percentile : 27.3
latency 99.9th percentile : 38.8
Total operation time : 15:09:49
27. Let’s do some experiments
2 c4.4xlarge, 250 threads, QUORUM writes 2,000,000,000 partitions
Cluster is 3 x i3.2xlarge, RF=3 (max between all clients)
latency mean : 13.6
latency 95th percentile : 21.2
latency 99th percentile : 27.3
latency 99.9th percentile : 38.8
Total operation time : 15:09:49 (+ 6%)
28. Let’s do some experiments
4 c4.4xlarge, 250 threads, QUORUM writes 4,000,000,000 partitions
Cluster is 3 x i3.4xlarge, RF=3 (max between all clients)
29. Let’s do some experiments
4 c4.4xlarge, 250 threads, QUORUM writes 4,000,000,000 partitions
Cluster is 3 x i3.4xlarge, RF=3 (max between all clients)
latency mean : 10.6
latency 95th percentile : 16.8
latency 99th percentile : 21.5
latency 99.9th percentile : 26.5
30. Let’s do some experiments
4 c4.4xlarge, 250 threads, QUORUM writes 4,000,000,000 partitions
Cluster is 3 x i3.4xlarge, RF=3 (max between all clients)
latency mean : 10.6
latency 95th percentile : 16.8
latency 99th percentile : 21.5
latency 99.9th percentile : 26.5
Total operation time : 11:44:26
31. Let’s do some experiments
4 c4.4xlarge, 250 threads, QUORUM writes 4,000,000,000 partitions
Cluster is 3 x i3.4xlarge, RF=3 (max between all clients)
latency mean : 10.6
latency 95th percentile : 16.8
latency 99th percentile : 21.5
latency 99.9th percentile : 26.5
Total operation time : 11:44:26 (- 22%)
32. Let’s do some experiments
8 c4.4xlarge, 250 threads, QUORUM writes 8,000,000,000 partitions
Cluster is 3 x i3.8xlarge, RF=3 (max between all clients)
33. Let’s do some experiments
8 c4.4xlarge, 250 threads, QUORUM writes 8,000,000,000 partitions
Cluster is 3 x i3.8xlarge, RF=3 (max between all clients)
11:48:11 (+ 5%)
34. Let’s do some experiments
16 c4.4xlarge, 250 threads, QUORUM writes 16,000,000,000 partitions
Cluster is 3 x i3.16xlarge, RF=3 (max between all clients)
35. Let’s do some experiments
16 c4.4xlarge, 250 threads, QUORUM writes 16,000,000,000 partitions
Cluster is 3 x i3.16xlarge, RF=3 (max between all clients)
12:30:04 (+ 6%)
36. Let’s do some experiments
Linear scale-up capabilities, as much as scale out. It pays to scale up
Total data size per node in the i3.16xlarge case is 4.8TB.
1B rows 2B rows 4B rows 8B rows 16B rows
time to ingest
41. Let’s do some experiments
In one of the nodes from previous experiment:
nodetool compact from quiescent state, 293GB, i3.xlarge: 1:45:27
42. Let’s do some experiments
nodetool compact from quiescent state, 587GB i3.2xlarge: 1:47:05
43. Let’s do some experiments
nodetool compact from quiescent state, 1.2TB i3.4xlarge: 2:00:41
44. Let’s do some experiments
nodetool compact from quiescent state, 2.4TB i3.8xlarge: 2:02:59
45. Let’s do some experiments
nodetool compact from quiescent state, 4.8TB i3.16xlarge: 2:11:34
4.8TB2.4TB1.2TB0.6TB0.3TB
Time to fully compact the node
47. Heat-weighted Load Balancing
+ Replica goes down and comes back up
+ Caches are cold.
+ Never sending requests to the node means caches never warm up.
+ Optimize mathematically the desired hit ratio so that caches warm up,
+ while keeping the latencies down
47
Restart
56. Conclusion
56
+ Scylla scales linearly in the amount of resources
+ Linear scalability also reflects on compaction performance
+ During failures, features like Heat Weighted Load Balance help the cluster
keep SLAs
+ so the fact that it takes longer to recover is not an issue
59. The real cost of streaming
59
+ Same clusters as previous experiments.
+ Destroy compacted node, rebuild from remaining two.
60. The real cost of streaming
60
+ Same clusters as previous experiments.
+ Destroy compacted node, rebuild from remaining two.
1B rows 2B rows 4B rows 8B rows 16B rows
4.8TB2.4TB1.2TB0.6TB0.3TB
61. Conclusion (revised)
61
+ Scylla scales linearly in the amount of resources
+ Linear scalability also reflects on compaction performance
+ During recovery, other features like Heat Weighted Load Balancing help keep
SLAs
+ so the fact that it takes longer to recover is not an issue
+ Larger nodes are not more expensive to recover on failures. That’s a myth.