MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)
1. +
MyCassandra: A Cloud Storage Supporting
both Read Heavy and Write Heavy Workloads
Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan)
Kazuyuki Shudo (Tokyo Inistitute of Technology)
Session 6 - Storage, SYSTOR 2012
(Haifa, Israel, Jun 4-6)
2. +
Cloud Storage
Distributed data store processing large amount of data
NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB
Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache
Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs,
Kai, Redis, LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB,
Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)
Characteristics: “limited functions, massive volume, high performance”
Data access only by primary key
No luxury features such as join, global transaction
Scalable to much larger data and number of nodes
3. +
Design policies of cloud storages
There are many trade-offs.
data model
key/value, multi-dimensional map, document or graph
performance - write vs. read
latency vs. persistence
latency – memory and disk utilization
persistence – synchronous vs. asynchronous (snapshot)
replication – synchronous vs. asynchronous
consistency between replicas – strong vs. weak
data partitioning – row vs. column
distribution – master/slave vs. decentralized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
4. +
MyCassandra focuses on
performance trade-off
data model
key/value vs. multi-dimensional map vs. document vs. graph
performance - write vs. read
latency vs. persistence
latency – memory and disk utilization
persistence – synchronous vs. asynchronous (snapshot)
replication – synchronous vs. asynchronous
consistency between replicas – strong vs. weak
data partitioning – row vs. column
distribution – master/slave vs. decentralized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
5. +
Performance trade-off
Write-optimized vs. read-optimized
A cloud storage with persistence is designed to optimize either
write or read workload.
Storage engine determines which workload a cloud storage
treats efficiently.
Bigtable, Cassandra, MySQL, Yahoo!
HBase
Sherpa
Indexing
Log-Structured B-Trees [R.Bayer ’70]
Merge Tree [P. O’Neil ‘96]
Write to disk
append random reads, writes
Read to disk
random reads + merge random read
Performance
write-optimized read-optimized
Storage engine Bigtable clone MySQL
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
8. +
Research overview
Contribution:
A technique to build a cloud storage performing well with both read and
write workloads
Steps:
1. MyCassandra: Storage engine enabled Apache Cassandra
2. MyCassandra Cluster: Heterogeneous cluster with different storage engines
1. MyCassandra
2. MyCassandra Cluster
read-optimized
read and write-optimized
select
write-optimized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
9. +
Apache Cassandra
Open-sourced by in 2008
A top-level project in
Features:
Scalability up to hundreds of servers across multiple racks/datacenters
High availability without SPOF by adopting a decentralized architecture
Write-optimized
dc1
dc2
Clustering across multiple racks/DCs
Replication strategy based on region
dc3
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
10. +
Apache Cassandra
A decentralized cloud storage without SPOF
Consistent Hashing (a decentralized algorithm):
Assign identifiers to both nodes and data on its circular ID space.
A-Z: hash value
Num of replica := 3
ID space
A
F
Z
Roles of each node
secondary 1
• Proxy, serving clients
Q
• Primary/secondary data nodes
V
N
primary
secondary 2
hash(key) = Q
key
values
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
11. +
Apache Cassandra
Write-optimized storage engine, a Bigtable clone
O(1) fast write operation
Write an update to disk sequentially
- Fast because of no random I/O to disk
- Always writable because of no write-lock
memory
sync
<k1, obj (v1+v2)>
async flush
write path
Memtable
1. Append an update to
CommitLog for persistence Only sequential write
disk
2. Update Memtable, a map in
<k1, v1>, <k1, v2>
memory, for quick reading
3. Acknowledge a client CommitLog
4. Asynchronously flush Memtable <k1,obj1>
to SSTable SSTable 1
5. Delete flushed data from <k1,obj2>
CommitLog and Memtable
SSTable 2
<k1,obj3>
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
SSTable 3
12. +
Apache Cassandra
Write-optimized storage engine, a Bigtable clone
Slow read operation
Read data from Memtable and multiple SSTables, and merge them
- Slow because of multiple random I/Os on disk
memory
<k1,obj>
Memtable
disk
CommitLog
merge
<k1,obj1>
SSTable 1
multiple random I/Os
<k1,obj2>
SSTable 2
<k1,obj3>
SSTable 3
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
13. + Performance of original Cassandra
Write performance is much higher.
YCSB results show:
Average: write is 9 x as fast as read.
99.9%ile: write is 43.5 x as fast as read.
Better
read
Number of operations
write
avg. 6.16 ms
read
Latency (ms)
99.9 %ile
write write: 2.0 ms
avg. 0.69 ms
read: 86.9 ms
Latency (ms)
14. 1. Storage Engine Support
+
1.MyCassandra
read-optimized
select
write-optimized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
15. +
MyCassandra: A modular cloud storage
Storage engines are supported
Storage engine feature inspired by MySQL
An independent and pluggable component
Perform disk I/O
A cloud storage can be either write-optimized or read-optimized by
selecting storage engine
Keep Cassandra’s original distribution architecture and data model
Decentralized
Consistent Hashing
Bigtable
Gossip Protocol
Bigtable
MySQL
Redis
…
selectable
InnoDB
MyISAM
Memory
…
selectable
Decentralized + Storage engine
Storage engine
16. MyCassandra implementation
Cassandra’s original distribution arch.
Storage Engine Interface introduced
Implement
Storage Engine
Storage Engine Interface
Interface
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
17. Performance of each storage engine
storage engines
Bigtable: write-optimized (original Casssandra 0.7.5)
MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)
Redis: in-memory KVS (Redis 2.2.8)
6 nodes
- Crucial’s SSD
- allocate 6GB mem in 8GB
x 11.79
1KB x 36 million data set
x 9.87
workload
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
18. 2. Heterogeneous cluster of
different storage engines
+
2.MyCassandra Cluster
read and write-optimized
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
19. • W: write-optimized
Basic idea
• R: read-optimized
• RW: in-memory
Replicate data on different storage engine nodes
write query
Route a query to nodes processing it efficiently
Synchronously route to nodes processing quickly sync
async
Asynchronously route to nodes processing slowly
→ Exploit each node’s advantage
W R
Furthermore, maintain consistency between replicas as much as the
original Cassandra
Quorum Protocol: (write agreements) + (read agreements) > (num of replicas)
= Guarantee retrieval of the latest data
write
read
Consequence: At least one node processes
both read and write queries synchronously and quickly
→ In-memory nodes play this role. W RW
R
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
20. • W: write-optimized
Cluster design
• R: read-optimized
• RW: in-memory
Combine nodes with different storage engines
write-optimized (W), read-optimized (R), in-memory (RW)
Disseminate storage engine types of each nodes
The type is attached to gossip messages
Place replicas on nodes with different storage engines
Proxy (any node requested) selects the storing nodes
1. The primary node determined based on the queried key
2. N -1 secondary nodes with different storage engines
Multiple nodes share a single server for load balance
Proxy (any node)
Cluster configuration (N=3)
gossip
RW
W W RW
RW
R
RW
R
W
secondary2
primary
secondary1
responsible nodes
21. • W: write-optimized
Process for a write access
• R: read-optimized
• RW: in-memory
• Quorum parameters Client
= 3, = = 2 1) A proxy receives a write query
• Num. of replicas Write for a single record
from a client. The proxy routes
to nodes storing the record.
W:RW:R = 1:1:1
Proxy
…
… 2) The proxy waits ACKs. W, RW
…
nodes usually reply quickly.
Wait for two ACKs …
for write and return 3-a) If writing succeeds and the
RW
proxy receives ACKs, it
Async write
returns a success message.
R
W 3-b) If a data node fails to write,
W RW
R
the proxy waits for ACKs
including R nodes and returns a
Nodes storing the record
success message.
4) After returning, the proxy
Write latency: max (W, RW)
asynchronously waits ACKs
from the remaining nodes.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
22. • W: write-optimized
Process for a read access
• R: read-optimized
• RW: in-memory
• Quorum parameters
= 3, = = 2 Client
1) A proxy receives a read query
and routes to storing nodes.
• Num. of replicas Read for a single record
W:RW:R = 1:1:1
2) Theproxy waits for ACKs. R
Proxy
and RW nodes reply quickly.
…
…
…
3-a) If returned values are
…
consistent, the proxy returns it.
Async check
Check consistentcy
RW
consistency 3-b) If the values are mismatched,
and return result
the proxy waits for consistent
R W values including W nodes.
W RW
R
4) After returning, the proxy waits
Nodes storing the record
from the remaining nodes. If the
proxy notices inconsistent
Read latency: max (R, RW)
values, it asynchronously
updates them to the consistent
one. Cassandra’s feature
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6) ReadRepair does it.
23. +
Performance Evaluation
Demonstrate that a heterogeneous cluster performs
well with both read- and write-heavy workloads
Targets
MyCassandra Cluster: 3 different nodes/server x 6 servers
Cassandra: 1 node/server x 6 servers
Quorum parameters
= 3, = =2
Storage Engine
Bigtable (W), MySQL / InnoDB (R), Redis (RW)
Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10]
1. Load data (1KB record, 10 x 100bytes columns) from a YCSB client
2. Warm up
3. Run benchmark and measure response times from a client
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
24. +
YCSB workloads
Workload
Application Operation Record
example
ratio selection
Write-Only
Log
Read: 0% Zipfian
Write Write: 100%
heavy
Write-Heavy
Session store
Read: 50%
Write: 50%
Read-Heavy
Photo Read: 95%
Read Write: 5%
heavy
tagging
Read-Only
Cache
Read: 100%
Write: 0%
Zipfian distribution: the access frequency of each datum is
determined by its popularity, not by freshness.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
26. Throughput
25000 Cassandra
x 0.87
QPS for 40 clients MyCassandra Cluster
20000
15000
Better
10000
x 2.16
x 11.00
x 4.07
5000
0
[100:0]
[50:50]
[5:95]
[0:100]
[write:read]
(query/sec)
Write-Only Write-Heavy Read-Heavy Read-Only
Write heavy
Read heavy
• 11.0 times as high as Cassandra in Read-Only workload
• Write performance is comparable with Cassandra.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
27. +
Conclusion
A cloud storage supporting both write-heavy and read-
heavy workloads by combining different storage engine
nodes.
MyCassandra Cluster achieved better throughput than the
original Cassandra on read heavy workload.
With a read-heavy workload
Read latency: 90.4 % lower at most
Throughput: 11.0 times at most
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
28. +
Related Work
Indexing algorithm whose goals include achieving both write and
read performance
FD-Tree: Tree Indexing on Flash Disks, VLDB ’10
bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12
Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)
Modular data stores:
MySQL
Anvil, SOSP ’09
Cloudy, VLDB ’10
Dynamo, SOSP ’07
Fractured Mirrors:
MyCassandra, SYSTOR ‘12: read vs. write
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
29. Discussion 1. the slight higher
+
write latency
The cause is load balancing.
Cassandra
Write to any nodes in N nodes
MyCassandra Cluster
Write to the specified and nodes
However this cost well worths improving for read performance.
MyCassandra
Cassandra
Cluster
Sync operation is write
read
write
read
Sync operation is
equally distributed.
fixed.
W RW
R
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)
30. +
Discussion 2. in-memory node
Q. Memory overflow
A. In-memory node plays as LRU-like cache.
The swapped data is recovered from the other persistent
nodes by read repair.
Q. Fault tolerance
A. 1) Write to an alternative node, and if the node is recovered, it
resolves inconsistency using values from the node.
2) Asynchronous snapshot (Redis’s feature)
Q. Whole in-memory nodes
A. This case limits capacities in cluster with the memory’s capacity.
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)