MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

+

MyCassandra: A Cloud Storage Supporting
both Read Heavy and Write Heavy Workloads

Shunsuke Nakamura (Tokyo Institute of Technology, NHN Japan)
Kazuyuki Shudo (Tokyo Inistitute of Technology)

Session 6 - Storage, SYSTOR 2012
(Haifa, Israel, Jun 4-6)

+
Cloud Storage

Distributed data store processing large amount of data

  NoSQL, Key-Value Storage (KVS), Document-Oriented DB, GraphDB
  Example: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache
Cassandra, Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs,
Kai, Redis, LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB,
Neo4j, IBM ObjectGrid, Giraph, Oracle Coherence, and the others. (> 100 products)

  Characteristics: “limited functions, massive volume, high performance”
  Data access only by primary key
  No luxury features such as join, global transaction
  Scalable to much larger data and number of nodes

+
Design policies of cloud storages

There are many trade-offs.

  data model
  key/value, multi-dimensional map, document or graph

  performance - write vs. read

  latency vs. persistence
  latency – memory and disk utilization
  persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak
  data partitioning – row vs. column

  distribution – master/slave vs. decentralized

Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6)

+
MyCassandra focuses on
performance trade-off

  data model
  key/value vs. multi-dimensional map vs. document vs. graph

  performance - write vs. read

  latency vs. persistence
  latency – memory and disk utilization
  persistence – synchronous vs. asynchronous (snapshot)

  replication – synchronous vs. asynchronous

  consistency between replicas – strong vs. weak
  data partitioning – row vs. column

  distribution – master/slave vs. decentralized


+
Performance trade-off

Write-optimized vs. read-optimized

  A cloud storage with persistence is designed to optimize either
write or read workload.

  Storage engine determines which workload a cloud storage
treats efficiently.

Bigtable, Cassandra, MySQL, Yahoo!
HBase
Sherpa
Indexing
Log-Structured B-Trees [R.Bayer ’70]
Merge Tree [P. O’Neil ‘96]

Write to disk
append random reads, writes

Read to disk
random reads + merge random read

Performance
write-optimized read-optimized

Storage engine Bigtable clone MySQL


+ Performance trade-off

- write-optimized vs. read-optimized -

Write latency for write-heavy workload

Better

read-optimized

write-optimized

6

Yahoo! Cloud Serving Benchmark, SOCC ’10

- mycassandra -

+ Performance trade-off

- write-optimized vs. read-optimized -

Read latency for read-heavy workload

write-optimized

Better

read-optimized

Yahoo! Cloud Serving Benchmark, SOCC ’10

- mycassandra -

+
Research overview

  Contribution:
A technique to build a cloud storage performing well with both read and
write workloads

  Steps:
1.  MyCassandra: Storage engine enabled Apache Cassandra
2.  MyCassandra Cluster: Heterogeneous cluster with different storage engines

1. MyCassandra
2. MyCassandra Cluster

read-optimized

read and write-optimized

select
write-optimized


+
Apache Cassandra

Open-sourced by in 2008

A top-level project in

  Features:
  Scalability up to hundreds of servers across multiple racks/datacenters
  High availability without SPOF by adopting a decentralized architecture
  Write-optimized

dc1
dc2

Clustering across multiple racks/DCs
Replication strategy based on region

dc3


+
Apache Cassandra

A decentralized cloud storage without SPOF

  Consistent Hashing (a decentralized algorithm):
Assign identifiers to both nodes and data on its circular ID space.

A-Z: hash value

Num of replica := 3
ID space

A
F

Z
Roles of each node
secondary 1

•  Proxy, serving clients
Q
•  Primary/secondary data nodes
V

N

primary
secondary 2

hash(key) = Q

key
values


+
Apache Cassandra

Write-optimized storage engine, a Bigtable clone

  O(1) fast write operation
  Write an update to disk sequentially
- Fast because of no random I/O to disk
- Always writable because of no write-lock
memory

sync
<k1, obj (v1+v2)>
async flush

write path
Memtable

1.  Append an update to
CommitLog for persistence Only sequential write
disk

2.  Update Memtable, a map in
<k1, v1>, <k1, v2>
memory, for quick reading
3.  Acknowledge a client CommitLog

4.  Asynchronously flush Memtable <k1,obj1>

to SSTable SSTable 1

5.  Delete flushed data from <k1,obj2>

CommitLog and Memtable
SSTable 2

<k1,obj3>


SSTable 3

+
Apache Cassandra

Write-optimized storage engine, a Bigtable clone

  Slow read operation
  Read data from Memtable and multiple SSTables, and merge them
- Slow because of multiple random I/Os on disk

memory

<k1,obj>

Memtable

disk

CommitLog

merge

<k1,obj1>

SSTable 1

multiple random I/Os
<k1,obj2>

SSTable 2

<k1,obj3>

SSTable 3


+ Performance of original Cassandra

Write performance is much higher.
  YCSB results show:
  Average: write is 9 x as fast as read.
  99.9%ile: write is 43.5 x as fast as read.

Better

read
Number of operations

write

avg. 6.16 ms

read

Latency (ms)

99.9 %ile

write write: 2.0 ms
avg. 0.69 ms
read: 86.9 ms

Latency (ms)

1. Storage Engine Support

+
1.MyCassandra

read-optimized

select
write-optimized


+
MyCassandra: A modular cloud storage

Storage engines are supported

  Storage engine feature inspired by MySQL
  An independent and pluggable component
  Perform disk I/O

  A cloud storage can be either write-optimized or read-optimized by
selecting storage engine
  Keep Cassandra’s original distribution architecture and data model

Decentralized

Consistent Hashing
Bigtable
Gossip Protocol

Bigtable
MySQL
Redis
…

selectable

InnoDB
MyISAM
Memory
…

selectable
Decentralized + Storage engine
Storage engine

MyCassandra implementation

Cassandra’s original distribution arch.

Storage Engine Interface introduced

Implement
Storage Engine
Storage Engine Interface
Interface


Performance of each storage engine

  storage engines
  Bigtable: write-optimized (original Casssandra 0.7.5)
  MySQL: read-optimized (MySQL 6.0 with InnoDB, JDBC API, stored procedure)
  Redis: in-memory KVS (Redis 2.2.8)

6 nodes
-  Crucial’s SSD
-  allocate 6GB mem in 8GB
x 11.79

1KB x 36 million data set

x 9.87

workload


2. Heterogeneous cluster of
different storage engines

+
2.MyCassandra Cluster

read and write-optimized


•  W: write-optimized
Basic idea
•  R: read-optimized
•  RW: in-memory

  Replicate data on different storage engine nodes
write query

  Route a query to nodes processing it efficiently
  Synchronously route to nodes processing quickly sync
async

  Asynchronously route to nodes processing slowly
→ Exploit each node’s advantage
W R

  Furthermore, maintain consistency between replicas as much as the
original Cassandra

Quorum Protocol: (write agreements) + (read agreements) > (num of replicas)

= Guarantee retrieval of the latest data

write
read

Consequence: At least one node processes
both read and write queries synchronously and quickly
→ In-memory nodes play this role. W RW
R


Cluster design
  Combine nodes with different storage engines
  write-optimized (W), read-optimized (R), in-memory (RW)
  Disseminate storage engine types of each nodes
  The type is attached to gossip messages
  Place replicas on nodes with different storage engines
  Proxy (any node requested) selects the storing nodes
1.  The primary node determined based on the queried key
2.  N -1 secondary nodes with different storage engines
  Multiple nodes share a single server for load balance
Proxy (any node)

Cluster configuration (N=3)

gossip

RW

W W RW
RW
R

RW
R
W

secondary2
primary

secondary1

responsible nodes

Process for a write access

•  Quorum parameters Client

= 3, = = 2 1)  A proxy receives a write query
•  Num. of replicas Write for a single record
from a client. The proxy routes
to nodes storing the record.
W:RW:R = 1:1:1
Proxy

…
… 2)  The proxy waits ACKs. W, RW
…
nodes usually reply quickly.
Wait for two ACKs …

for write and return 3-a) If writing succeeds and the
RW
proxy receives ACKs, it
Async write
returns a success message.
R
W 3-b) If a data node fails to write,
W RW
R
the proxy waits for ACKs
including R nodes and returns a
Nodes storing the record
success message.

4) After returning, the proxy
Write latency: max (W, RW)

asynchronously waits ACKs
from the remaining nodes.

Process for a read access
•  Quorum parameters
= 3, = = 2 Client
1)  A proxy receives a read query
and routes to storing nodes.
•  Num. of replicas Read for a single record

W:RW:R = 1:1:1
2)  Theproxy waits for ACKs. R
Proxy
and RW nodes reply quickly.
…
…
…
3-a) If returned values are
…
consistent, the proxy returns it.
Async check
Check consistentcy
RW
consistency 3-b) If the values are mismatched,
and return result

the proxy waits for consistent
R W values including W nodes.
W RW
R

4) After returning, the proxy waits
Nodes storing the record
from the remaining nodes. If the
proxy notices inconsistent
Read latency: max (R, RW)
values, it asynchronously
updates them to the consistent
one. Cassandra’s feature
Session 6 - Storage, SYSTOR 2012 (Haifa, Israel, Jun 4-6) ReadRepair does it.

+
Performance Evaluation

Demonstrate that a heterogeneous cluster performs
well with both read- and write-heavy workloads

  Targets
  MyCassandra Cluster: 3 different nodes/server x 6 servers
  Cassandra: 1 node/server x 6 servers

  Quorum parameters
= 3, = =2

  Storage Engine
  Bigtable (W), MySQL / InnoDB (R), Redis (RW)

  Yahoo! Cloud Serving Benchmark (YCSB) [SOCC ’10]
1.  Load data (1KB record, 10 x 100bytes columns) from a YCSB client
2.  Warm up
3.  Run benchmark and measure response times from a client


+
YCSB workloads

Workload
Application Operation Record
example
ratio selection

Write-Only
Log
Read: 0% Zipfian
Write Write: 100%

heavy
Write-Heavy
Session store
Read: 50%
Write: 50%
Read-Heavy
Photo Read: 95%
Read Write: 5%

heavy

tagging

Read-Only
Cache
Read: 100%
Write: 0%

Zipfian distribution: the access frequency of each datum is
determined by its popularity, not by freshness.


Write/Read latency (Response time)

1.5
avg. write-latency Cassandra
+ 0.57ms (max)
MyCassandra Cluster
1
Better
+ 42.5%
+ 59.5%
+ 69.5%
Performs well
0.5 with
write:5%
MySQL + Redis

write:100%
write:50%
write:0%

0
(ms)
max 90.4% lower in read-only workload

35
30
avg. read-latency
- 26.5ms (max)

25
20
Better
- 88.8%
- 90.4%

15
- 83.3%
10
5
read:0%
read:50%
read:95%
read:100%

0
(ms)

Write-Only Write-Heavy Read-Heavy Read-Only

Throughput

25000 Cassandra
x 0.87
QPS for 40 clients MyCassandra Cluster
20000

15000

Better
10000
x 2.16
x 11.00

x 4.07

5000

0
[100:0]
[50:50]
[5:95]
[0:100]
[write:read]

(query/sec)
Write-Only Write-Heavy Read-Heavy Read-Only

Write heavy
Read heavy

•  11.0 times as high as Cassandra in Read-Only workload
•  Write performance is comparable with Cassandra.


+
Conclusion

  A cloud storage supporting both write-heavy and read-
heavy workloads by combining different storage engine
nodes.

  MyCassandra Cluster achieved better throughput than the
original Cassandra on read heavy workload.

  With a read-heavy workload
  Read latency: 90.4 % lower at most

  Throughput: 11.0 times at most


+
Related Work

  Indexing algorithm whose goals include achieving both write and
read performance
  FD-Tree: Tree Indexing on Flash Disks, VLDB ’10
  bLSM: A General Purpose Log Structured Merge Tree, SIGMOD ‘12
  Fractal-Tree: It’s implemented in TokuDB (MySQL storage engine)

  Modular data stores:
  MySQL
  Anvil, SOSP ’09
  Cloudy, VLDB ’10
  Dynamo, SOSP ’07
  Fractured Mirrors:
  MyCassandra, SYSTOR ‘12: read vs. write


Discussion 1. the slight higher
+
write latency

The cause is load balancing.

  Cassandra
  Write to any nodes in N nodes
  MyCassandra Cluster
  Write to the specified and nodes
However this cost well worths improving for read performance.

MyCassandra
Cassandra
Cluster

Sync operation is write
read
write
read
Sync operation is
equally distributed.
fixed.

W RW
R


+
Discussion 2. in-memory node

Q. Memory overflow

A. In-memory node plays as LRU-like cache.
The swapped data is recovered from the other persistent
nodes by read repair.

Q. Fault tolerance

A. 1) Write to an alternative node, and if the node is recovered, it
resolves inconsistency using values from the node.

2) Asynchronous snapshot (Redis’s feature)

Q. Whole in-memory nodes

A. This case limits capacities in cluster with the memory’s capacity.


+
オープンソース化


+


MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)

Ähnlich wie MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012) (20)

Mehr von Shun Nakamura

Mehr von Shun Nakamura (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MyCassandra: A Cloud Storage Supporting both Read Heavy and Write Heavy Workloads (SYSTOR 2012)