More Related Content Similar to 読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1) (20) More from Shun Nakamura (6) 読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)2. +
NoSQL, Key-Value Store (KVS), Document-Oriented DB, GraphDB
: memcached, Google Bigtable, Amazon Dynamo, Amazon SimpleDB, Apache Cassandra,
Voldemort, Ringo, Vpork, MongoDB, CouchDB, Tokyo Tyrant, Flare, ROMA, kumofs, Kai, Redis,
LevelDB, Hadoop HBase, Hypertable,Yahoo! PNUTS, Scalaris, Dynomite, ThruDB, Neo4j, IBM
ObjectGrid, Oracle Coherence, Velocity, … 100
: ↔
join, transaction
/
MyCassandra
3. +
key/value vs. multi-dimensional map vs. document vs. graph
vs.
vs. – fsync
vs. (snapshot)
vs.
strong vs. weak
row vs. column
master/slave vs. decentralized
MyCassandra
4. +
key/value vs. multi-dimensional map vs. document vs. graph
vs.
vs.
vs. (snapshot)
vs.
strong vs. weak
row vs. column
master/slave vs. decentralized
MyCassandra
5. +
vs.
write/read
Bigtable, Cassandra, MySQL, Sherpa
HBase
Log-Structured B-Trees [R.Bayer ’70]
Merge Tree [P. O’Neil ‘96]
disk append (buffering) random
disk n random I/O + merge 1 random I/O
Bigtable MySQL
MyCassandra
6. +
~ vs. ~
Write-Heavy
Better
read-optimized
write-optimized6
MyCassandra Yahoo! Cloud Serving Benchmark, SOCC ’10
- mycassandra -
7. +
~ vs. ~
Read-Heavy
write-optimized
Better
read-optimized
MyCassandra Yahoo! Cloud Serving Benchmark, SOCC ’10
- mycassandra -
8. +
/
1.
2.
1.MyCassandra 2.MyCassandra Cluster
read-optimized
read and write-optimized
write-optimized
MyCassandra
9. +
Apache Cassandra
dc1 dc2
rack/dc
region
dc3
10. +
Apache Cassandra
Consistent Hashing ( )
(A~Z )
N := 3 ID
A F
Z • request proxy
secondary 1
• primary node
Q • secondary node
V N
primary secondary 2
hash(key) = Q
key values
11. +
Google Bigtable
: O(1)
sequential write
I/O
Always writable
write-lock memory
sync <k1, obj (v1+v2)> async flush
write path Memtable
LSM-Tree [P. O’Neil ‘96]
disk
<k1, v1>, <k1, v2>
sequential write
Commit Log
disk mem <k1,obj1>
SSTable 1
<k1,obj2>
SSTable 2
<k1,obj3>
SSTable 3
SSTable
MyCassandra
12. +
Google Bigtable
Key
Memtable value
SSTable value
I/O
disk memory
<k1,obj>
Memtable
disk mem disk
<k1,obj+obj1~3>
Commit Log
client merge
<k1,obj1>
SSTable 1
I/O <k1,obj2>
SSTable 2
<k1,obj3>
SSTable 3
MyCassandra
13. + Cassandra
( / 99.9%)
1/9
Better
read write
Number of queries
avg. 6.16 ms
read
Latency (ms)
write write: 2.0 ms
avg. 0.69 ms read: 86.9 ms
99.9 percentile
Latency (ms)
14. 1.
+
1.MyCassandra
read-optimized
write-optimized
11.4.14 14
15. + MyCassandra:
Cassandra
Cassandra /
InnoDB MyISAM Memory … Consistent Hashing
Bigtable Gossip Protocol
Bigtable MySQL Redis …
MyCassandra
16. + MyCassandra:
Cassandra
Cassandra /
Consistent Hashing
Bigtable Gossip Protocol
Bigtable MySQL Redis …
InnoDB MyISAM Memory …
MyCassandra
18. : Cassandra
: . JDBC API / stored procedure
: key-value store
• ….
MyCassandra
19. 2.
+
2.MyCassandra Cluster
read and write-optimized
11.4.14 19
20. • W:
• R: 20
• RW:
write query
sync async
W R
Quorum Protocol: ( )+ ( )> ( )
write read
W RW R
- mycassandra -
21. 21
MyCassandra
(W) / (R) / (RW)
(join/dead) gossip protocol
1. (key )
2. × N-1
1 3
Proxy
N=3
gossip
RW
W RW R W W RW RW R
secondary
secondary primary
22. • :
• R: 22
• RW:
=3, =2
Client 1)
W:RW:R = 1:1:1 Proxy
2) W, RW
ACK
ACK
3a)
W RW R
3b) R
ACK
: max (W, RW)
- mycassandra -
23. • :
• R: 23
• RW:
=3, =2
W:RW:R = 1:1:1 Client
Proxy 1)
2) R, RW
3a)
3b) or
W RW R W
4)
: max (R, RW) .
(Cassandra read repair )
- mycassandra -
24. + 24
/
MyCassandra Cluster: 6×3 = 18 /6 (W:R:RW = 6 : 6 : 6)
Cassandra: 6 /6
: = 3, : = =2
: Bigtable (W), MySQL / InnoDB (R), Redis (RW)
: YCSB (Yahoo! Cloud Serving Benchmark) [SOCC ’10]
1. MyCassandra/Cassandra×6 YCSB Client×1
2. 1KB values(100[Bytes]×10[columns])+key 1,000
3.
4. YCSB
5. YCSB Stat
- mycassandra -
25. + 25
YCSB
4
Workload Application Operation Record
Example Ratio Selection
Write-Only Log Read: 0% Zipfian( )
Write Write: 100%
Heavy Write-Heavy Session Store Read: 50%
Write: 50%
Read-Heavy Photo Read: 95%
Read Write: 5%
Heavy
tagging
Read-Only Cache Read: 100%
Write: 0%
( ) Zipfian : ,
/
- mycassandra -
26. /
1.5
avg. write-latency Cassandra
0.36ms MyCassandra Cluster
1
9.3% 26.2% 46.2%
Better 0.5
MySQL + Redis
write:100% write:50% write:5% write:0%
0
(ms)
12
84.9% avg. read-latency
10
8.59ms
8
Better
6 82.6% 84.9%
4 35.7%
2
read:0% read:50% read:95% read:100%
0
(ms)
- mycassandra - Write-Only Write-Heavy Read-Heavy Read-Only 26
27. 27
20000 Cassandra
0.90 max. qps for 40 clients MyCassandra Cluster
18000
16000 6.49
14000
12000 1.54
0.93
10000
Better 8000
6000
4000
2000
0
[100:0] [50:50] [5:95] [0:100] [write:read]
(query/sec) Write-Only Write-Heavy Read-Heavy Read-Only
Write Heavy Read Heavy
• 6.49
•
- mycassandra -
28. + 28
1:
Cassandra
N
MyCassandra Cluster
:
:
MyCassandra
Cassandra Cluster
write read write read
N R,W
W RW R
- mycassandra -
29. +
2:
Q.
A. LRU like cache
Swap read repair
Q.
A. 1)
2) Redis fsync
( )
myCassandra
30. + 30
Read-Heavy
84.9%
6.49
+
- mycassandra -
31. 31
index algorithm
FD-Tree: Tree Indexing on Flash Disks, VLDB ’10
B+tree + LSM-tree
SSD
Fractal-Tree / TokuDB (MySQL )
MySQL: RDBMS
Anvil, SOSP ’09: 1
Cloudy, VLDB ’10:
Dynamo, SOSP ‘07: vs.
MyCassandra ( ): vs. +
- mycassandra -
32. + 32
:
1.
2. (MySQL + memcached)
: MyCassandra Cluster
Web Table
movie-id name thumb-name tag count
704122313 movieA EY37lHk5bgU sport, succer, FIFA, 169,374
704122314 movieB Zk3BSYMWjzQ music, jazz, … 472,803
- mycassandra -
34. + 34
: ( )
5 6
twitter: @MyCassandraJP
- mycassandra -
35. 35
: MyCassandra/MyCassandra Cluster
Cassandra 1. MyCassandra 2. MyCassandra
Cluster
data model multi-dimensional map (Column Family)
throughput write write or read write and read
latency low lower in case lower
persistence yes yes or no yes
consistency weak (eventual, quorum)
replication sync / async
data partition row
node decentralized
organization
throughput, latency
- mycassandra -
36. host
(1) 1 /1
node
☓
☓ storage
(2) 1 /k
ID [Amazon Dynamo, SOSP ’07]
☓
(3) 1
Fault
FT space FT space
Torelance (FT) space
1storage / 1node / 1 host
(2) (3)
(1)
virtual node
1 node / host
k storages / node
k nodes / host
1 storage / node 36
37. : HDD vs. SSD
25000 Cassandra HDD
SSD
20000 MyCassandra HDD
20000 Cluster SSD
15000
15000
10000
10000
Better
5000 5000
0 0
(qps) (qps)
IOZone HDD: Western SSD: Crucial
benchmark digital
seq. write 86,277 qps 96,401 qps
seq. read 108,914 qps 216,099 qps
random write 2,485 qps 29,045 qps
random read 926 qps 21,751 qps
11.4.14 - mycassandra -