This document provides an overview of distributed databases and the Yahoo! Cloud Serving Benchmark (YCSB). It discusses NoSQL databases Cassandra and HBase and how YCSB can be used to benchmark their performance. Experiments were conducted on Amazon EC2 using YCSB to load data and run workloads on Cassandra and HBase clusters. The results showed Cassandra had lower latency and higher throughput than HBase. YCSB provides a way to compare the performance of different databases.
3. Distributed Databases
Traditional RDBMS
• ACID transactions
• Query language (SQL)
• Data tied to the modeling (hard to analyze)
• Scalable to a limit
Distributed Databases
• Not ACID
• Not Relational
• Column oriented (key-value)
• CAP (Consistency, Availability, Partitioning)
• Big Data (Massively scalable)
5. Distributed Databases
• NoSQL Databases have different designs and architecture
Cassandra
Thrift
Gossip
Token ring
…
Hbase
HDFS
Zookeeper
Hadoop (MapReduce)
BigTable
GFS
Chubby (Lock Service)
MapReduce
6. Cassandra
• Highlights
• High availability
• Incremental scalability
• Eventually consistent
• Tradeoffs between consistency and latency
• Minimal administration
• No SPF (Single Point of Failure)
8. Cassandra
• Data Model
• Cluster:
• Machines (nodes) in a logical
Cassandra instance
• can contain multiple keyspaces
• Keyspace:
• name for ColumnFamilies
• ColumnFamilies:
• contain multiple columns each with name, value and timestamp
referenced by row keys.
• Analogous to table on RDBMS
• SuperColumns:
• columns with subcolumns
• Rows
• Columns
keyA Column1 Column2 Column3
keyB Column5 Column6 column10
Column
Byte[] Name
Byte[] Value
I64 Timestamp
10. HBase
“HBase is more a datastore than a database”
• It lacks many of the features of RDBMS
• Distributed and scalable big data store.
• Regions model
• Strong consistency
12. HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.
13. HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.
Note: In our study case, we only
had interest on HDFS layer.
16. HBase
• Data is stored into HBase tables.
• Tables are made of rows and columns.
• All columns belong to a particular column family.
Important note: All column family members are stored together.
• A query on a
column family
model has a better
performance
17. YCSB General View
• Which is the best NoSQL DB?
• How to compare?
• Yahoo! Cloud Serving Benchmark (YCSB)
• Benchmarking tool
• Evaluate key-value and cloud DBs performance on a common set
of workloads
• Client – an extensible workload generator
• Yahoo! Research
• Brian F. Cooper - cooperb@yahoo-inc.com
• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan
and Russell Sear
18. YCSB Details
• How it works?
YCSB Client
DBInterface
Layer
Client
Threads
Statistics
Workload
Executor
Cloud
Serving
Store
Workload file
• Read/write mix
• Record size
• Popularity distribution
• …
Command line
• DB to use
• Workload to use
• Target throughput
• Number of threads
• …
19. YCSB Details
Benchmark Tiers
• Performance
• Measure latency/throughput curve
• Increase throughput until saturation
• Scalability
• Scale up: increase hardware, data size and throughput
proportionally
• Elastic speedup: add servers while running a workload
20. YCSB Details
Load phase
- Load the database
$ ycsb load cassandra-10
–p hosts=127.0.0.1 –P workloadX
Transactions phase
- Executes the workload
$ ycsb run cassandra-10
–p hosts=127.0.0.1 –P workloadX
Random Load Distribution
36. Conclusions
• YCSB provides a common ground for benchmarking cloud DB
services
• Good for leaning and experimenting with different distributed
databases
• Open source, extensible for new databases
• Laboratory with Amazon EC2 provided good insight into setting
up cloud services
• Challenges
• Installation problems
• Hard to follow documentation
• Working on distributed environment require lots of configuration