This document provides a summary of a presentation that benchmarked the performance of three popular NoSQL databases: Apache Cassandra, Apache HBase, and MongoDB. It describes the architectures and data models of each database. Benchmark tests were run using the Yahoo Cloud Serving Benchmark and found that Apache Cassandra consistently outperformed the other databases across different workloads in terms of load time, read and write performance, and latency. The presentation emphasizes the importance of benchmarks for evaluating NoSQL database performance and choosing the right database based on application requirements.
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
1. Referent
Einrichtung Titel des Vortrages 1
WP-Benchmarking Top NoSQL
Databases
Apache Cassandra, Apache HBase and MongoDB
Presented By
Athiq Ahamed
Supriya
2. Referent
Einrichtung Titel des Vortrages 2
Introduction
Enormous amount of data-BigData
Scalabilty issue in RDBMS
Rise of NoSQL databases
Amazon Dynamo
Big table
CAP Theorem
BASE system
3. Referent
Einrichtung Titel des Vortrages 3
CAP Theorem
Consistency
Availability
Partition tolerance
CAP theorem states that only two of the properties can be
achieved at a time.
4. Referent
Einrichtung Titel des Vortrages 4
RDBMS NoSQL
Supports powerful query
language
Supports very simple query
language
It has a fixed schema No fixed schema
Follows ACID (Atomicity,
Consistency, Isolation and
Durability)
It is only eventually consistent
Supports transactions Does not support transactions
RDBMS vs NoSQL
Content:tutorialspoint.com
5. Referent
Einrichtung Titel des Vortrages 5
Basically available: System guarantees availability, in
terms of the CAP theorem
Soft state: State of the system may change over time,
because of eventual consistency model
Eventual consistency: System will become consistent over
time
BASE
Content:www.edureka.in
6. Referent
Einrichtung Titel des Vortrages 6
Fast Performance is the key.
POC processes include right benchmarks:
Configurations
Parameters
Workloads
Making the right choice!
Selection of NoSQL
7. Referent
Einrichtung Titel des Vortrages 7
Yahoo Cloud Serving Benchmark (YCSB)
Top 3 NoSQL databases-Apache Cassandra, Apache
Hbase and MongoDB.
Amazon Web Services EC2 instances for hosting the tests
Test performed 3 times on 3 different days
Benchmark configuration
8. Referent
Einrichtung Titel des Vortrages 8
The tests ran on large size instances (15GB RAM and 4
CPU cores)
Instances used customized Ubuntu with Oracle Java 1.6
installed as a base.
A customized script written to drive the benchmark
processes
Benchmark configuration
9. Referent
Einrichtung Titel des Vortrages 9
Each NoSQL system performs differently, not alike.
Components and Internal working.
Apache Cassandra: Columnar database model
Apache HBase: Columnar database model
MongoDB: Document storage database model
Understanding NoSQL Databases
10. Referent
Einrichtung Titel des Vortrages 10
Apache Cassandra
Cassandra is scalable, fault-tolerant, and consistent. All
nodes are equal.
Its distribution design is based on Amazon’s Dynamo and
its data model on Google’s Bigtable.
Key components: Node, Cluster, Commit log, Mem-table,
SSTable and Bloom filter
Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
11. Referent
Einrichtung Titel des Vortrages 11
Ring structure, peer to peer architecture
All nodes are equal
This improves general database availablity
Scaling up and scaling down is easier
Cassandra has key-value, column oriented database
Apache Cassandra
12. Referent
Einrichtung Titel des Vortrages 12
Apache Cassandra
Content:http://demoiselle.sourceforge.net/component/demoiselle-
cassandra/1.0.0/images/datamodel1.png
13. Referent
Einrichtung Titel des Vortrages 13
Cassandra has an internal keyspace called system, stores
metadata about the cluster.
Metadata:
The node‘s token
The cluster name
Keyspace n schema definitions (dynamic loading)
Whether or not the node is bootstrapped
Apache Cassandra
Content:https://www.edureka.co/blog/category/apache-cassandra/
14. Referent
Einrichtung Titel des Vortrages 14
Commit log: Crash recovery mechanism. Every write
operation is written to commit log
Mem-Table: A memory resident data structure.
SSTable: It is a disk file to which the data is flushed from
the mem-table
Apache Cassandra
15. Referent
Einrichtung Titel des Vortrages 15
Bloom filters are used as a performance booster
Bloom filter are very fast, quick algorithms for testing a
member in the set.
Bloom filters serves as a special kind of cache – quick
lookups/search as they reside in memory
Apache Cassandra
16. Referent
Einrichtung Titel des Vortrages 16
Gossip protocol: Communiction between nodes, co-
ordination and failure check
Anti-Entropy protocol: Replica sync mechanism enusing
data on different nodes are updated (Merkle trees)
Snitches ensures host proximity
Apache Cassandra
18. Referent
Einrichtung Titel des Vortrages 18
Sparse, distributed, sorted map and multidimensional and
consistent.
Hbase is a Key/value store
Consists Row key, Column family, columns and timestamp.
Apache HBase
19. Referent
Einrichtung Titel des Vortrages 19
Apache HBase
Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-
20. Referent
Einrichtung Titel des Vortrages 20
Region: Contiguous rows form a region
Region server(RS): Serves one or more regions.
Master server: Daemon responsible for managing Hbase
cluster
HDFS: Distributed, open source file system containing
HBase‘s data
Zookeeper: Distributed, open source co-ordinated service
for co-ordination of master and region servers.
Apache HBase Components
Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture
22. Referent
Einrichtung Titel des Vortrages 22
Client obtains meta table RS from Zookeeper
Client gets RS which holds the corresponding rowkey
Client receives the row from the respective Region server
Client caches this information along with the location of
meta table server.
First Read/Write to HBase
23. Referent
Einrichtung Titel des Vortrages 23
WAL: Write Ahead Log is a file on the distributed file
system. It is used to store new data
Block Cache: It is the read cache. It stores frequently
read data in memory
Mem Store: Write cache that stores new data which is not
written to disk yet.
Hfiles stores the rows as sorted key values on disk
HBase RS Components
24. Referent
Einrichtung Titel des Vortrages 24
Client writes the data to the WAL file stored on disk
WAL is used to recover not yet persisted data in case a
server crashes.
Once data is written to WAL, it is placed in Mem Store
Hbase Write steps (1)
25. Referent
Einrichtung Titel des Vortrages 25
All write/read are to/from the primary node.
HDFS replicates WAL and Hfile blocks. Replication
happens automatically.
When data is written in HDFS, one copy is written locally
and then it is replicated to a secondary node and later to
tertiary node.
HDFS Write steps (2)
26. Referent
Einrichtung Titel des Vortrages 26
Cassandra usecase: Availability and Partition tolerant
requirements.
Consistency is tunable by setting it high in the option
Hbase usecase: Consistency and Scalability. However, at
less number of nodes/threads, availability is achieved high
Cassandra and Hbase
27. Referent
Einrichtung Titel des Vortrages 27
Document-oriented database
High performance and automatic scaling
High consistency and partition tolerant
Replication and failover for high availability
Low latency
Flexible indexing
MongoDB
28. Referent
Einrichtung Titel des Vortrages 28
Document is the basic unit for MongoDB(row)
Collection is similar to a table
A single instance has multiple independent databases
Every document has a special key, “_id”
Powerful JavaScript shell for administration
Configdb contains metadata of clusters
MongoDB Concepts
30. Referent
Einrichtung Titel des Vortrages 30
A mongo receives queries from applications
Uses metadata from config server for the data
Mangos directs write operations to a particular shard
Mongos uses the cluster metadata from the config
database
Read/Write MongoDB
31. Referent
Einrichtung Titel des Vortrages 31
Scalability
Availability
Partition Tolerant
Consistency
MOST IMPORTANT PERFORMANCE
Yahoo Cloud Serving Benchmark (YCSB)
Recap Importance of Benchmark and Factors
38. Referent
Einrichtung Titel des Vortrages 38
Identify data model for the application
Corresponding data sets have to be known
Whether the application requires replication
Identify the performance requirements
Prototype the application
Test the performance of the prototype
Discussion
39. Referent
Einrichtung Titel des Vortrages 39
Conclusion
NoSQL replaced tradition relational databases
Performance is the key feature
Importance of benchmarks
Top three NoSQL data base’s performance tested
Cassandra outperforms all the other NoSQL data bases
Decide based on application
Managing the start up
Configuration and Termination of EC2 instances
Running the test on clients
Apache Cassandra: Columnar database model (Combination of Amazon Dynamo+Bigtable)
Apache HBase: Columnar database model (Big table inspired Hadoop system)
Rows are split and it has row key for range of rows (primary key is hashed, md5 hash), column family (column name) with value and time stamp. In habse, data is split columnwise, it has row key for range of rows, column family and column qualifier and time stamp. Ordered distribution and no hash distribution. Frequently accessed column are grouped together under commom family.
System keyspace stores metadata for the local node. System keyspace cannot be modeified or edited by us . The node‘s token is decided by the partitioner.
Memory reads are faster than disk reads..so when we see results of test, cassandra outperforms and bloom filters could be one of the reason, because of fast memory access and reads.
Cassandra nodes exchange merkle trees for conversation with neighbours. Merkle tree is a hash representing the data in a column family. Trees are compared and if there is any difference, it launches a repair for the ranges that dont agree. Read-repair happens in the background internally.There is something called as snitch which routes the client to the nearest node.(there is no separate configdb like mongodb to route or zookeeper in hbase..which may take aditional time to respond). Snitch gives host proximity.