Performance Comparison of Hbase and Cassandra databases with YCSB

Data Storage and Management
on
Performance Comparison of Hbase and Cassandra databases
with YCSB
Yash Balaji Iyengar
x18124739
MSc Data Analytics – 2018/9
Submitted to: Dr. Muhammad Iqbal

National College of Ireland
Project Submission Sheet – 2017/2018
School of Computing
Student Name: Yash Balaji Iyengar
Student ID: x18124739
Programme: MSc Data Analytics
Year: 2018/9
Module: Data Storage and Management
Lecturer: Dr. Muhammad Iqbal
Submission Due
Date:
22nd April 2019
Project Title: Performance Comparison of Hbase and Cassandra databases
with YCSB
I hereby certify that the information contained in this (my submission) is information
pertaining to my own individual work that I conducted for this project. All information
other than my own contribution is fully and appropriately referenced and listed in the
relevant bibliography section. I assert that I have not referred to any work(s) other than
those listed. I also include my TurnItIn report with this submission.
ALL materials used must be referenced in the bibliography section. Students are
encouraged to use the Harvard Referencing Standard supplied by the Library. To use
other author’s written or electronic work is an act of plagiarism and may result in disci-
plinary action. Students may be required to undergo a viva (oral examination) if there
is suspicion about the validity of their submitted work.
Signature:
Date: September 13, 2019
PLEASE READ THE FOLLOWING INSTRUCTIONS:
1. Please attach a completed copy of this sheet to each project (including multiple copies).
2. You must ensure that you retain a HARD COPY of ALL projects, both for
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer. Please do not bind projects or place in covers unless specifically
requested.
3. Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.
Office Use Only
Signature:
Date:
Penalty Applied (if
applicable):

Performance Comparison of Hbase and Cassandra
databases with YCSB
Yash Balaji Iyengar
1234567
22nd April, 2019
Abstract
In recent times due to easy and wide spread access to internet there have
been many social media, mobile application and e-commerce businesses that have
emerged and prevailed. This has lead to generation and availability of large amount
of data and BigData is the term used to describe this data. This has lead to devel-
opment of SQL as well as NoSQL database. In today’s market there are hundreds of
NoSQL database technologies available. It increases peoples difficulty to compare
and choose a database technology which is well suited to their business needs. In this
study two databases Hbase and Cassandra have been analysed and compared.From
basic architectural perspective Cassandra has no master where as Hbase is a mas-
ter based. The performance comparison is carried using Yahoo! Cloud Serving
Benchmark(YCSB). Here load and run tests are run on both Hbase and Cassandra
databases for Workload A, Workoad B and Workload C for the counts 100,000 ,
250,000 and 500,000 respectively. It is only after studying the results of the above
tests we will get a better understanding of which database technology is better.
1 Introduction
Over the period of time lots of data has been generated in various forms like music,
movies, social media data etc. In order to retrieve and store data companies invested
in different database technologies. The Relational Database Management Systems were
used in the early of the internet age but as the era progressed relational databases were
falling short. This is because query time required to pull large amount of data is very high.
Also horizontal scalability becomes difficult because of relational database which increases
management costs. Tang & Fan (2017) To counter these issues NoSQL databases have
emerged and are being opted by many companies for data storage and organizational
purposes. Nosql has a major advantage as it provides horizontal scalability. It provides
more flexiblity as it can store unstructured or non schema based data. Nosql based
databases can be accessed from multiple machines without a dip in performance. They
can store There are four types of NoSQL databases, Document Databases, Graph Stores,
Key-value stores, Wide-column store. Blogger (2019) However since there are so many of
these database technologies one can’t blindly rely on any one of the sources. Therefore
we will be testing Hbase and Cassandra on various workloads for different number of
operation counts and comparing their results to check their performances.
1

2 Key Characteristics
2.1 Hbase
Hbase is a column family base database which has a shape shifting dynamic schema.Hbase
supposts Mapreduce and is mounted onto HDFS. Its important features are listed below.
1. Consistency:
Transmission of data at a higher speed can be done using Hbase as it performs consistent
Read and Write Operations.DataFlair (2019)
2.Automatic Read and Write:
Hbase automatically reads and writes the rows. What this means is while performing a
single Read and Write operation all other processes are halted. DataFlair (2019)
3. Sharding:
Hbase breaks the regions into subregions automatically in order to minimize overhead
and I/O time. This is called Sharding DataFlair (2019)
4. High Availability:
This means that multiple regional servers are handled by one master server. This in-
creases the availability. DataFlair (2019)
5. Scalability:
Another peculiar feature of Hbase is linear and modular scaling.DataFlair (2019)
6. High Throughput:
Hbase provides high throughput due to high security and management characteristics.
DataFlair (2019)
7. Sorted RowKeys:
In Hbase three main operations namely get, put and scan are used. These commands
select appropriate data by using row keys. DataFlair (2019)
8. Distributed Storage:
Hbase stores data in distributed form as it is mounted on HDFS.DataFlair (2019)
2.2 Cassandra
Apache Cassandra is an open source NoSQL, column based database which can handle
huge amount of data. Its important feature are discussed below.
1. Distributed:
Cassandra is made on a foundation of multiple nodes, this increases scalability, fault tol-
erance and availability.Hasan (2019)
2. Multi-Master or Masterless:
This means that Cassandra is based on a masterless architecture. This means that, write

operation is performed on many nodes and its assigned by using the hash function where
as read operation is performed on speciﬁc nodes.Hasan (2019)
3. Column family store:
Cassandra is a column family based database. The data is stored and organised in column
family format.Hasan (2019)
4. Linear Scaling:
Cassandra provides the feature of linear scaling. This is due to its multi master or master-
less architecture. The write operation handling capacity of Cassandra increases if twice
the number of nodes are provided to it.Hasan (2019)
5. High Write Availability:
This is a very important feature. Lets consider an example, suppose in MongoDB if a
master node crashes, it stops the write operation till a new master node is chosen. Due
to masterless or multimaster architecture in Cassandra if a node dies then the write op-
eration is automatically rerouted to other nodes.Hasan (2019)
6. Design Time Schema:
This feature was not available when Cassandra was launched, but now its not important
to make a schema and provide datatypes during the designing phase.Hasan (2019)
7. Hot Writes in RAM:
Cassandra’s performance increases tremendously as it stores the Write operations in
RAM.Hasan (2019)
3 Database Architecture:
3.1 Hbase:
Hbase is a Column family oriented Database which can be run on distributed mode as
its mounted onto HDFS. due to this the risk of single point failure reduces because if
one master node dies the Hmaster assigns another master node. Lets now look at the
architecture of HBase.
Hbase architecture is divided into three main parts, HMaster Server, Zookeeper and
region server. Lets discuss about each one of them in detail.
1. Region Server:
A region is set by assigning a set of row keys to it. It means a region consists of all the
column families and a set of rows depending on the amount of row keys assigned to that
region. Thus Region server is assigned in charge of multiple regions and is responsible
for performing read and write operations in those regions. Sinha (2019)
2. HMaster:
HMaster is incharge of multiple region servers. These region servers are mounted on
various data nodes. HMaster is responsible for managing the Region servers. Hmaster
creates and deletes table accordingly and assigns regions to region servers.It sometimes
reassigns regions for load balancing purposes. Along with Zookeeper it recovers data from

Template/figures/Hbase.PNG Template/figures/Hbase.PNG
Figure 1: Hbase ArchitectureSinha (2019)
a region if it goes down.Sinha (2019)
3. Zookeeper:
Since the Hbase environment is very vast and distributed HMaster can’t alone handle ev-
erything. Therefore Zookeeper coordinates with the Region servers and even the Hmaster.
Both, Hmaster and Region server send heart beat signals in regular time intervals no-
tifying their their activity. There is one inactive Hmaster which acts as back up to the
main HMaster which is connected to the Zookeeper. If the active Hmaster fails, then
the inactive HMaster replaces it. Zookeeper thus keep track of the activity of different
Hmaster and region servers and coordinates accordingly. Zookeeper also keeps records
of a .Meta server path.This .META server path helps the clients to locate a particular
region.Sinha (2019)
3.2 Cassandra:
The arrangement aim of Cassandra is to handle big data workloads across various mul-
tiple nodes without any single point of failure. Cassandra has a peer-to-peer distributed
system across its nodes, and data is shared among all the nodes in a cluster. All the nodes
in a cluster have the same task. Each node performs its own task and simultaneously in-
terconnected to other nodes. Irrespective of where the data is present in the cluster, each
node can receive read and write requests. When a node stops functioning, read/write
requests can be utilized from other nodes in the network. Each node continuously trans-
fers its state information to other nodes across the cluster making use of the peer-to-peer
gossip communication protocol. A consecutive commit log on each node gathers write
activity to certify data durability. Data is then given a proper order and written to an
in-memory structure called a memtable which is similar to the write-back cache. When-
ever the memory structure saturates, the data is written to disk in an SSTables data
periodically file. All writes are automatically separated and mirrored throughout the
cluster. Cassandra systematically develops SSTables using a process called compaction
removing obsolete data marked for deletion with a tombstone. To certify whether all
data in the cluster stays uniform, different repair methods are selected.Educba (2019)

Cassandra is a separated row store database, where rows are arranged into tables with
a required primary key. Cassandras architecture permits any authorized user to connect
to any node in any datacentre and access data making use of the CQL language. CQL
has a matching syntax to SQL and works with the table data. Developers can access
CQL using cqlsh, DevCentre and through drivers for application languages. Generally,
a cluster has one keyspace per application consisting several tables. Client read or write
requests can be transferred to any node in the cluster. A node will serve as a coordinator
for a particular operation when a client links to a node with a request. The coordinator
will serve as a proxy between the client application and the nodes that the data being
requested. The coordinator will find out which node in the ring will be granted the re-
quest on the basis of the cluster configuration. Key structures: 1) Node: It is the location
where we place our data. It is the fundamental factor of Cassandra. 2) Datacentre: It is
a collection of the linked nodes. A data centre can be a physical data centre or virtual
data centre. For different workloads, we must use different data centres. The datacetre
sets the duplication. Cassandra transactions getting affected by other work loads can be
avoided using different workloads. This helps in keeping requests near each other to lower
latency.Scnsoft (2019) 3) Cluster: A collection of datacentres is called a cluster. A cluster
can stretch over physical sites. 4) Commit log: All the data is written initially to commit
log for durability. Then the data is pushed to SSTables where it can be archived, removed
or recycled. 5) SSTable: SSTable stands for sorted string table. It is a constant datafile
to which Cassandra writes memtables periodically. SSTables are joined and stored on
disk in consecutive order and maintained for each Cassandra table. 6) CQL Table: A
collection of columns in proper order a retrieved by table row. A table has columns and
a primary key.
Figure 2: Cassandra Architecture Tutorialspoint (2019)

4 Comparison of HBase and Cassandra
In this section Hbase and Cassandra have been compared in the following two areas.
1.Security:
Security is an important concern while choosing between different databases. Over he
time its seen that NoSQL database security features have been compromised for high
performance as compared to RDBMS. We can see that in HBase there are different types
of security protocols like client authentication, server authentication. In addition to that
it provides role based security. It means access is given based on the employee hierar-
chy in an organisation. All users do not have same level of access to the database.On
the other hand Cassandra provides security features like authorization based on Object
permission management. Here access is given based on roles. Cassandra also provides
Authentication based on Java Management Extensions. Cassandra provides secure con-
nection between client and their database by using SSL encription.DataStax (2019)
2. Scalability, Reliability, Availablity:
Scalability of Cassandra:
Cassandra is linear scalable meaning we can increase the size by adding new nodes. In
cassandra, we can expand by adding more data centres and by adding new nodes as well.
Scalability of Hbase:
Hbase administers the data horizontally on the table when the data expands. Design
of Hbase is based on Googles Big table. Hbase has the ability of committing dynamic
distribution of the tables. Availability of Cassandra: Cassandra offers higher availability
than Hbase. Cassandra has high availability due to data replication technique Availability
of Hbase: Storage optimization is one of the vital aspects affecting the availability of
Hbase.
Reliability of Cassandra: Cassandra is used by major organisations due to the relia-
bility it offers. Reliability is offered at a large scale but is complex.
Reliability of HBase: HBase gives us high degree of reliability. When configured with
adequate redundancy, HBase is considered fault tolerant i.e. Hbase can handle failure
and work accurately.
5 Learning from Literature Survey
In recent times due to rapid development in the internet era vast amount of data is
generated and needs to be stored. It is clear that relational databases are not sufficient
to handle BigData. So certain advancements have been made in the field of NoSQL
databases. They have been studied and tested on basis of performance to get an idea on
which one to invest in. In one such study by Tang & Fan (2017) five popular NoSQL
databases have been chosen and Yahoo Cloud Serving Benchmark hbas been performed
on them. The databases selected are Redis, MongoDB, Couchbase, Cassandra and Hbase.
The author has selected WorkloadA, WorkloadC and WorkloadH and has kept a fixed
workload count of 100,000 records. This test has been run on 5 different Ubuntu Vir-
tual Machines. The results shows that out of the five databases Hbase and CAssandra
had the slowest execution time as compared to Redis, MongoDB and CouchBase. But
they found that for Workload C between Hbase and Cassandra ”Hbase was 1.58 times

faster than Cassandra”Tang & Fan (2017). While Data loading for workload A it was
observed that throughput for Cassandra and Couchbase increased fast as compared to
Hbase and MongoDB. In another test performed by Seriatos et al. (2016) YCSB test for
all 6 types of workload on MongoDB, Hbase and Cassandra. For workloadA it was found
that throughput for Hbase and Cassandra was comparatively higher than MongoDB. But
for workload B where there is a 95-5 ratio of data read and update, here MongoDB per-
formed significantly better than Hbase and Cassandra. Another study was found where
Gandini et al. (2014) performed YCSB test on MongoDB, Hbase and Casandra databases
for workload type A on Amazon AWS cloud platform shows for single node Hbase has
the most throughput error as compared to other two databases. According to Swami-
nathan & Elmasri (2016), NoSQL databases have become conventional data platform
for big data applications. These databases have come up as an entry point for alter-
native methodologies outside usual traditional relational databases. They are described
by efficient horizontal scalability, schema-less way to data modelling, high performance
data access and limited querying capabilities. The absence of transactional semantics
among NoSQL databases has made the choice of particular consistency model reliant
on the application. Hence it is important to scrutinise methodically. In this research,
the author provides direction that would map the application requirements to a fitting
NoSQL database. Three of the most widely used NoSQL databases MongoDB, Cassan-
dra and HBase are assessed making use of YCSB (Yahoo Cloud Service Benchmark) The
horizontal scalability of three systems using different workload conditions and variable
dataset sizes is acquired. For 50% read and 50% write workload its was inferred that
Cassandra had a better throughput performance. However, on small sized databases,
HBase gave 20% better throughput performance. For 100% Read workload, Mongo DB
stores data as BSON (Binary JSON) document and gave better performance for read
only operations. For the 100% Blind write workload, HBase had the best performance up
to 265% better than Cassandra irrespective of the database volume. The difference in the
performance was due to the method in which the write requests were handled. For the
100% Read-Modify-Write workload, the working was identical to the 50% read-50% write
workload. For the 100% scan workload, the performance of a database was dependent on
the partitioning method used for the database. Cassandra had the best performance for
this workload. Cassandra performed better on large size databases. On the other hand,
HBase performed better on small sized databases. It is concluded that databases with
different design factors had different outputs for various experimental setups.
6 Performance Test Plan
The experimental setup for this test started by creating an account on Open Stack NCI.
An instance was created on NCI Open stack. Boot Source selected was Image. It was
allocated in DSM-BaseProj2018. Then the flavour selected was m1.Medium. The msc-
data-net network was selected. In the KeyPair section a new Key pair was generated
and was saved. In configuration section the encrypted key was pasted and home account
password was set. Then the instance was launched and floating ip was associated to it.
Putty is a terminal emulator which was used to connect to our virtual machine on Open
stack. After connecting to the ubuntu virtual machine using putty, server was updated
and new version of java was installed. Then Java path was set in the profile. Secure shell
was installed on the server. Then new user was created named hduser. After that Hadoop

was downloaded and installed on the server. After complete installation of Hadoop and
giving Hduser the permission to access files of hadoop, Hbase was installed on the virtual
machine.After installation of hbase a table names ”usertable” with a column family of
”cf1” was created in the hbase. After that python was installed and then Cassandra was
installed on the ubuntu virtual machine. Then a new keyspace called ycsb was created
and in that key space a new table called ”usertable” was created which consisted of ten
fields. After that Yahoo Cloud Serving Benchmark (Ycsb) was downloaded and installed
on the ubuntu virtual machine. Then WinSCP software was installed and was synced
with putty. Testharness zip file was then downloaded form moodle onto the windows
machine. This testhrness.tgz file was then transferred to ubuntu virtual machine. It was
then extracted and moved to /home/hduser path. In testharness directory testdbs.txt
was updated with what all databases to be tested. opcounts.txt file was altered with the
counts 100,000 , 250,000 , 500,000 counts respectively. Then workloadlist.txt file was
updated with workloada, workloadb, workloadc respectively. A directory named output
was created in the ycsb folder. ycsb version was updated in runtest.sh file and then
testharness was run multiple times to get the desired files which are submitted on the
moodle turnit in link.
7 Evaluation and Results
In this section the results have been explained using Excel Visualisations.
7.1 Workload A Read Operation Comparison
We can see in Figure 3 that number of Read operation for workload A and Workload B are
compared against Average Latency for Cassandra and Hbase databases. For Cassandra
which is represented by orange colour we can see that at 49,923 records it experiences
maximum read Latency and decreases gradually as the read record count increases. But
after 125,187 record counts the average latency remains constant till 250,159 records.
For Hbase which is represented by blue colour we can see that from 50,007 records till
124,813 the average latency decreases slightly but from 124,813 records till 249,841 records
it increases.
Figure 3: Results for Workload A Read operations

7.2 Workload A Update Operation Comparison
As illustrated in Figure 4 we can see that Average latency is compared with update counts
for workload A on both databases Hbase and Cassandra. For cassandra for 50,007 records
the average latency is very high but it gradually decreases till the count reaches 124813
records and then remains constand till 249,841 counts.. Hbase also shows a similar trend
initially as Cassandra but after 124868 records the average latency keeps increasing as it
reaches 249723 records.
Figure 4: Results for Workload A Update operation
7.3 Workload B Read Operation Comparison
The general ﬁndings are as illustrated in Figure 6.:
Here Average latency for workload B is compared with read operations. For Cassandra we
can see that at the beginning the average latency increases with increase in read operations
till 237507 from here onwards the average latency slightly decreases till 475065 records.
For Hbase we can see a linear graph that is the average latency keeps increasing from
95066 records till the very end 474870 records.
Figure 5: Results for Workload B Read operation

7.4 Workload B Update Operation Comparison
The general findings are as illustrated in Figure 6.:
We can see that for Workload B only 5% records are write operations the records are less
in number. For Cassandra we see that the average latency remains constant from 5012
records till 12493 records then later gradually decreases till 24935 records. But for Hbase
we can see that there is high latency at 4934 record count which drastically decreases till
12506 records and then gradually decreases till 25130 records.
Figure 6: Result for Workload B Update Operatio
8 Conclusion and Discussion
When we test for workload A we can see that Cassendra experiences higher latency for
both read as well as write operation as compared to Hbase.We can see 3 and 4 that both
graphs have a similar trend. So we can conclude from the above tow figures that if there
is a company whose database has heavy updating nature then they should choose Hbase
database. For workload B we can see that for read operation Hbase has a high latency
but remains comparatively stable. But for Hbase it increases gradually. For write in
workload B only 5% operations are write and we can see that Hbase as low latency as
compared to Cassandra here aswell. So based on the above study we can conclude that
Hbase should be used for Update heavy and Write mostly workload environments.
References
Blogger, M. (2019), ‘Nosql databases explained’, https: // www. mongodb. com/
nosql-explained .
DataFlair, T. (2019), ‘Best features of hbase — why hbase is used? - dataflair.’, https:
// data-flair. training/ blogs/ features-of-hbase/ .
DataStax, . (2019), ‘Cassandra security features.’, https: // docs. datastax. com/ en/
cassandra/ 3. 0/ cassandra/ configuration/ secureIntro. html .
Educba, . (2019), ‘Cassandra security features.’, https: // www. educba. com/ .

Gandini, A., Gribaudo, M., Knottenbelt, W. J., Osman, R. & Piazzolla, P. (2014), ‘Perfor-
mance evaluation of NoSQL databases’, Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
8721 LNCS, 16–29.
Hasan, H. . (2019), ‘Apache cassandra, part 1: Intro-
duction and key features.’, https: // blog. emumba. com/
apache-cassandra-part-1-introduction-and-key-features-18d02ba0b8cc
.
Scnsoft, . (2019), ‘Cassandra security features.’, https: // www. scnsoft. com/ .
Seriatos, G., Kousiouris, G., Menychtas, A., Kyriazis, D. & Varvarigou, T. (2016), ‘Com-
parison of database and workload types performance in cloud environments’, Lecture
Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics) 9511, 138–150.
Sinha, S. . (2019), ‘Hbase architecture — hbase data model — hbase read/write —
edureka.’, https: // www. edureka. co/ blog/ hbase-architecture/ .
Swaminathan, S. N. & Elmasri, R. (2016), ‘Quantitative analysis of scalable NoSQL
databases’, Proceedings - 2016 IEEE International Congress on Big Data, BigData
Congress 2016 pp. 323–326.
Tang, E. & Fan, Y. (2017), ‘Performance comparison between five NoSQL databases’,
Proceedings - 2016 7th International Conference on Cloud Computing and Big Data,
CCBD 2016 pp. 105–109.
Tutorialspoint, . (2019), ‘Cassandra architecture.’, https: // www. tutorialspoint.
com/ /cassandra/ cassandra_ architecture. htm .

Performance Comparison of Hbase and Cassandra databases with YCSB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance Comparison of Hbase and Cassandra databases with YCSB

Similar to Performance Comparison of Hbase and Cassandra databases with YCSB (20)

More from YashIyengar

More from YashIyengar (9)

Recently uploaded

Recently uploaded (20)

Performance Comparison of Hbase and Cassandra databases with YCSB