IAC 2024 - IA Fast Track to Search Focused AI Solutions
Distributed Databases Overview
1. Achieving High Availability, Scalable Storage and
Performance at Portal do Aluno
- Distributed Databases Overview Study -
Luis Carlos Dill Junges1 , Ivan Linhares Martins1
1
Certi Foundation – Federal University of Santa Catarina (UFSC)
Postal Box 5053 – 88.040-970 – Florian´ polis – SC – Brazil
o
luis.junges@gmail.com, ilm@certi.org.br
Abstract. This document is a consolidation study made at Certi Foundation for
the federal project called Portal do Aluno. This project will be an internet portal
with the main objective to spread knowledge among kids between 12 and 18
years old from Brazilian elementary schools. Considering the fact that there will
be around 5 millions students using it montlhy, some problems are inevitable on
the storage system and at the availability of the portal. With this problem in
mind, a comprehensive study has been made on the new flavor of distributed
databases available at the market. The results of such study has been published
on this document for appreciation with some considerations on each one.
Resumo. Este documento e uma consolidacao de um estudo realizado na
´ ¸˜
fundacao Certi para o projeto do governo Federal chamado Portal do Aluno.
¸˜
Este projeto ser´ voltado para os estudantes entre 12 e 18 anos da rede de en-
a
sino b´ sico do Brasil com o objetivo de ser tornar um portal para divulgacao e
a ¸˜
geracao de assuntos relacionados a formacao dos estudantes.
¸˜ ` ¸˜
Tal projeto ter´ algo em torno de 5 milh˜ es de usu´ rios que inevitavelmente
a o a
trar˜ o alguns problemas em relacao ao backend do sistema nos aspectos de
a ¸˜
escalabilidade de dados e alta disponibilidade do portal. Com isto em mente,
um estudo elaborado das solucoes atuais dos novos sistemas de banco de dados
¸˜
distribu´dos foi feito e os seus resultados s˜ o apresentados neste documento.
ı a
1. Introduction
This study was born due the problem being faced at Portal do Aluno project. The project
consists of a social network focused on spreading the knowledge among students between
12 and 18 years old from the elementary schools of Brazil.
Those problems are related to the availability, the storage capacity and also the per-
formance of the overall system. Although minor projects were developed using standard
relational databases which inevitable have become the SPOF1 of the system, this project
had required a better solution in order to meet new requirements. This study shows a way
to overcome such problems by using a new kind of Open Source tools available at the
developing community.
This new set of tools have been driven by the NoSQL movement which had began
around 2009 to solve the limitations found on handling big data volumes and workloads.
1
Single Point of Failure
2. This group has the aim to redirect the database development to horizontal scalability by
relaxing on some aspects. One of those aspects could be shown on the fact that such
system often provide eventually consistency and, therefore, are not fully compliant with
the ACID2 properties.
This article is organized as follow: Section 2 describes the project. Section 3
presents the problems and the motivation to study a new approach. Section 4 describes the
general characteristics of those distributed systems and Section 5 gives a brief overview
of the major Open Source players. Section 6 shows a comparative table of the properties
of each system. Section 7 presents the most prominent solution that best meets the Portal
do Aluno’s requirements. Finally Section 8 gives the conclusion.
2. Portal do Aluno
Portal do Aluno is a social learning environment project from the Ministry of Education of
Brazil(known as MEC). It has characteristics of social network and has the aim to provide
an educational portal with colaborative tools for schools tasks. It will be an extension
of elementary schools on the internet trying to promote the integration among schools,
students and teachers around Brazil by the possibility of having groups for researchs,
discussions and others common tasks.
This portal is subdivided into modules with specific content. On some of them,
there is the possibility of uploading files like images and any other type of document in-
cluding video. As the number of users of this portal is potentially high from the begining,
scalability and availability are essencial and lead to the problem described at the next
section.
3. Problem
Relational databases are powerful and robust in such way that there is widespread of
applications and systems using them. However, they show limitations when large sets of
data need to be stored and when high availability of the system is mandatory. On the first
issue is provably impossible3 to keep the ACID properties while scaling across multiple
machines. Until now it has been tipically solved by high end RDMS4 through the use
of replication system with master-slave architecture as shown on figure 1. Even being a
working approach, this model has a prominent SPOF5 on the master. If it fails, the system
goes down. This approach reachs scalabilty by forwarding the reads to free slaves (load
balancer) and all writes on master, being again the bottleneck of the data flow.
The second issue is usually solved by hardware solutions based on RAID6 . At a
glance, the goal of this model is achieved by replicating the data among several hard drives
and swapping them accordly on a failure. RAID systems, however, are not a complete
safety solution because they can not survive without a backup if the server holding them
is lost by fire or flooding or any other reason.
2
Atomicity, Consistency, Isolation, Durability
3
See Section 4.1 - CAP Theorem
4
Relational Database Management System
5
Single Point of Failure
6
Redundant Array of Independent Drives
3. Figure 1. Tradicional Relational Database Scaling
Those relational issues have been claimed to be solved (or at least on the road) by
a new flavor of distributed databases relatively new at the developing community. Those
systems promise to overcome efficiently the lacks found at relational systems by relaxing
on some characteristics like consistency and strong consideration on nodes failures. The
next section introduces such systems.
4. Distributed Databases
One of the major advantages of using a distributed database over a traditional relational
database is the possibility to scale the reads and writes easily by just adding new nodes
on the cluster. Relational databases can have this issue solved with the reads but scale the
writes are virtually impossible and at the end it becomes too expensive.
A brief comparison between relational databases and those new systems is de-
scribed at table 1.
Table, Columns, Rows
ACID properties fully satisfied
Relational Databases Normalized to avoid data duplication
Strong storage schema
Queries fully supported
Table like domain
Data identified only by a key
Distributed Databases
Schema-less
Data integrity on application’s code
Eventual Consistency
Support for queries is limited
Table 1. Relational vs Distributed Database
Those systems also adopt a key-value model or a document-oriented approach:
4. Key-value Basically the data is associated with a key like a map. It is only possible to
retrieve the data by knowing the key. They usually are able to retrieve the data at
a constant time independet of how many entries have been stored.
Document-Oriented The data is stored in a format which represents a document. It does
not have any schema and some fields present at some document may not exist on
others documents. Some implentations use JSON or XML as protocol layer for
the data.
4.1. CAP Theorem
The CAP theorem [Gilbert and Lynch 2002] was born as some properties that shared sys-
tem must choose from. Their properties are as follow:
• Strong Consistency:All Clients see the same view even in presence of updates.
• High Availability: All Clients can find some replica of the data, even in presence
of failures.
• Partition-Tolerance: The system properties hold event when the system is parti-
tioned by node failures, network problems or any other reason.
The theorem states that a distributed system can always have only two of three
CAP properties at the same time. At distributed databases, it is usually used Availability
and Partition Tolerance. In order to handle the consistency, some of them use versioning
systems [Manassiev and Amza 2005] [Amza et al. 2003] for update’s conflicts resolution.
5. Available Solutions
On this section some approaches of distributed databases are shown explaining the sin-
gular characteristics of each one and a practical example where they are being used
at the moment. Those new systems are based on a large set of Open Source Tec-
nhologies [Bortnikov 2009] which makes them really atractive and although there is
not a consolidated benchmark already accepted by the community [Binnig et al. 2009]
[Cryans et al. 2008], some points still can be made on each solution.
5.1. Voldemort
Voldemort is a relatively new Open Source project at the community as it has been re-
leased at the beginning of this year. It has been entirely written in Java and it’s based on
the Key-Value model having just 2 functions to interact with (set and get). As the own de-
velopers said, voldemort is basically just a big, distributed, persistent, fault-tolerant
hash table. For the data persistency it uses MySQL or BDB as backend on each node.
As it has the concept of eventual consistency [Vogels 2008], it uses a simple incremental
versioning sytem for each update on the data. The application is responsible for fixing
integrity problems and other issues that may happen on the data stored.
This project is currently in use on production at Linkedin.com on some parts which
require high-availability. The speed access observed at the production environment are at
order of 19384 requisitions per second (req/sec) for reading and 16559 req/sec for writing.
As some good points of this project, there is a well written documentation. There
is also a good replication schema of data that can be manually configured in terms of how
many writes and reads have to be made in order to validade a store or a reading operation,
5. respectively. As an example, an exception will be through if it is set to write at least on 3
node and just 2 nodes are up.
This project’s design also has taken into consideration working properly on load
balancers with clustering proposals as described on figure 2.
As one major advantage, this project does not have a SPOF delivering, therefore,
a high available system for critical applications.
The drawbacks are the impossibility to add a new node on a live cluster which
means the entire system has to be shut down in order to configure a new node. Other
point is that all code processes a value at a time in memory (no cursor or streaming)
meaning that the values need to fit comfortably in memory.
Figure 2. Voldemort’s clustering architecture
5.2. HBase
HBase is the official Hadoop project database. It is an Open Source, distributed,
column-oriented store modeled after Google’s BigTable [Chang et al. 2008]. Just as
BigTable leverages the distributed data storage provided by the Google File System
[Ghemawat et al. 2003], HBase provides BigTable-like capabilities on top of Hadoop us-
ing the HDFS7 .
HBase is very good and powerful project which gives the users the opportunity to
run parallel processing on the cluster through the use of MapReduce jobs. The current
release (0.20) has removed the major drawbacks of having an SPOF and high reading
latency.
Its architecture works through the use of a distribution of masters and region
servers along the cluster’s machines as described at figure 3.
7
Hadoop Distributed File System
6. HBase is currently in use at several places including a Yahoo’s Cluster with 10000
PCs. There is also some companies also doing tests with HBase running at Amazon
Elastic Compute Cloud (know as Amazon EC2)
Figure 3. HBase’s architecture
5.3. Redis
Redis is a key-value distributed solution with the advantage of having more operations
than just the tradicional set and get API. Those operations include handling multiple sets
and some simple queries on the dataset stored with the garantee of being atomic(just
some operations). It also supports storing more datatypes instead of just string or binaries
including list, sets and ordered sets.
Other major point is that it is increadibly fast, able to perform around 110000
SETs/second and around 81000 GETs/second according to the developer’s test case. It
works by doing assynchronous calls which means data can be lost between the time is
was requested to write and it definitely happened (not atomic operations). There is also
the constraint that all dataset needs to fit on a single device.
5.4. Cassandra
Cassandra was born to solve Facebook’s problems. It is a more complete key-value
database based on Dynamos’s fully distributed database design [DeCandia et al. 2007]
and BigTable’s Column family based data model [Chang et al. 2008].
This project has high-availability without a SPOF with incremental scalability
through the option of adding new nodes on a live environment without disturbing the
7. applications currently running on the database. It also has the garantee of being atomic
on a single Column Family’s operation.
Drawbacks include the poor and inexistent documentation with a very obscure and
difficult API that will pass though a heavy remodelling on the next releases.
Cassandra is currently at use on Facebook on the inbox search where it is claimed
to exist 40 TB of data distributed along 120 machines at separated data centers. It is also
in use at Rackspace and Digg.com
5.5. MongoDB
MongoDB is document-oriented approach for scalable distributed databases. It is an Open
Source implementation entirely written in C++ with commercial support. Its major ad-
vantage is the query support that made it unique on this feature. It works through a BSON
(binary JSON) format for big data handling (photos and videos) with support for MapRe-
duce jobs.
Figure 4. MongoDB’s Architecture Design
As a drawback it has an intricated cluster schema as shown on figure 4 which has
several SPOFs. It is subdivided into config servers (store metadata on which mongo shard
is the data) and mongo shards that store the data. There is also the mongo instances that
are entry points for clients. Right now it is a relatively new implementation without full
support for sharding and data replication has constraint on the number of nodes (2 nodes
only) that can be used.
5.6. Tokyo Cabinet/Tyrant
Tokyo Cabinet/Tyrant is an Open Source project claimed to be in use at mixi.jp, a japanese
social network with 10000 updates/second through MemCache. The use of this tool seems
to apply on the handling of 20 millions entries of data(20 bytes each).
Although no test has been made, this solution claims to be really fast on writing
and reading operations able to perform around 58000 req/seconds. It also seems to support
8. ACID properties with several differents storing approaches (Hash, B-Tree) for each type
of data being stored.
As drawbacks, it does not have a good documentation and few projects are using
it.
5.7. CouchDB
CouchDB is a very easy to run project with a document-oriented approach. It has a to-
tally unstructured schema-less storing backend throught the use of JSON format as data
handling. It is very similar to Amazon’s SimpleDB solution with assynchronous replica-
tion of data. It also have a browser administration console where it is possible to create
MapReduce jobs, backup operations and views statements like those ones found at rela-
tional systems.
It uses http requests to manage the dataset which makes it connectable to any soft-
ware able to perform http requests. CouchDB has a major advantage because it provides
a query like engine which enables the user to build their own queries properly for the
application being developed.
As a big drawback, it does not satisfy the concept of scalability because all the
data being stored needs to fit on a single device. The availability of the system is achieved
by a client router which forward the queries to the desired backend service. So, CouchDB
is not a distributed database at the current moment but has some interesting features that
make it eligible to be on this listing. One of those features is the MapReduce support. An-
other one is the approach of having the entire dataset or part of it stored directly at client’s
computer with assynchronous replication. By doing this, the workload at the backend can
be reduced because the replication will happen on an appropriatte moment. This feature
could also be used for mobile devices that get synchronized at base station (bluetooh,
wireless, cable) and can access a website after that without having to connect to the in-
ternet. This leads the user to avoid spending money on data carrier or even witnessing
low conections speeds which invariable leads to a great website’s user-experience. There
is some issues regarded to the type of data that can be handled on such approach or even
if modifications made by the user can, at a later moment, be synchronized with the main
data server without consistency problems. Despite of those issues, this approach seems
interesting to delivery fast content on mobile devices. At Portal do Aluno, this feature
could be used to connect the users to the dashboard with the option of editing comments
on mobile devices that later can be synchronized with the main server.
CouchDB is in use at several projects and websites because of its easiness it pro-
vides through http requests. At the moment it is a very young project with strong security
problems and at Alpha development. Even with such issues, it’s a project to keep an eye
on.
5.8. MemCache
According to the developers, MemCache is an Open Source, high-performance, dis-
tributed memory object caching system, generic in nature, but intended for use in
speeding up dynamic web applications by alleviating database load.
It is a really simple and robust solution to improve the performance of web appli-
cations by dropping the reading time by accessing the cache layer instead of the database
9. itself. As real examples, it has reach the speed of 38000 req/sec at Flickr.com.
It does not have persistency layer but is able to work properly doing load balancing
just by adding new nodes on the cluster. It is in use at several big projects on the internet
and can be used directly by the API of some high end RDBS as a cache layer (MySQL
and PostgreSQL).
Figure 5. Solution using intermediate storage layers
Although MemCache does not provide persistency, it can be very useful with so-
lutions that do provide the storage together with commercial services available at market
like Amazon Simple Storage Service (Amazon S3). By using the MemCache, there is a
considerable drop on the number of reading operations that hit the Amazon S3 and con-
sequently the month payment. Figure 5 shows this approach by using MemCache as an
intermediate layer for applications that do required high availability but do not have the
capacity to setup a private cluster for it. Instead, they use commercial storage solutions
for the cluster [Brantner et al. 2008] [Palankar et al. 2008] and drop the month payment
by adding aditional storage layers (disk and MemCache).
5.9. Others
There is also a lot more of distributed databases projects with differents approaches. Most
of them seems to be at beginning development without enough documentation and robust-
ness or without persistence layer (In-memory only) . Some of them include:
• ThruDB • LightCloud • Kay
• MemcachedDB • Scalaris • NMDB
• Disco • Riak • Hazelcast
• KeySpace • Dynomite • MNesia
• Ringo • Hypertable
6. Solution’s Benchmark
Until now there is not an accepted benchmark for those new systems [Binnig et al. 2009]
[Cryans et al. 2008] and the decision to use or not a system is based on their properties.
Table 2 shows a comparative listing of some properties of each system. This table is a
snapshot of the systems made at December 2009.
10. Name Language Fault-Tolerance Persistence Client Data Model Documentation Production
Voldemort Java Partitioned, Replicated Berkeley DB,MySQL Java API Structured, Blob, Text Good LinkedIn
HBase Java Replication, Partitiong Custom on-disk Custom API, Thrift BigTable Good Yahoo
Cassandra Java Replication, Partitiong Custom on-disk Thrift BigTable,Dynamo Poor Facebook
CouchDB Earlang Replication Custom on-disk HTTP,JSON Document-Oriented Good UbuntuOne
MongoDB C++ Replication Custom on-disk,GridFS Java, C++ Drivers Document-Oriented Good SourceForge
Hypertable C++ Replication,Partitioning Custom on-disk Java, Thrift BigTable Good Baidu
ThruDB C++ Replication Custom on-disk Thrift Document-Oriented Medium —
Ringo Earlang Replication,Partitioning Custom on-disk HTTP Blob Medium Nokia
Tokyo Tyrant C — B-Tree,Hash ANSI C Document-Oriented Poor Mixi.jp
Scalaris Earlang Replication, Partitioning In-Memory Java, Earlang, HTTP Blob Medium OnScale
MemCache C Partitiong In-Memory Python, java, Ruby —- Good Several Projects
Dynomite Earlang Replication, Partitioning —- Custom, Thrift Blob Poor PowerSet
Kai Earlang Partitioning —- —- Blob Poor —
Table 2. Comparative List of Distributed Databases Properties
7. Adopted Solution
For the Portal do Aluno, there is some requirements that needs to be meet as easily scal-
able, high available and fast content retrieval storage.
On the presented solutions, HBase (Hadoop Project) and Voldemort have shown
as the major robust solutions available at the moment which completely meet the easily
scalable goal proposed by the NoSQL movement. Hbase had the problem of having a
high latency and a SPOF on the master being inappropriate to serve web pages in real
time. At the current version (≥ 0.20), those problems seem to be solved. Voldemort
is a really robust approach but it’s not optimized for large data sets the Portal do Aluno
needs to store(Video, Photos). This limitations is because it uses mysql or BDB as storage
backend.
MongoDB has some problems on the scalability it can provide an also the SPOFs
it have. It does have, however, a strong support for queries which make it eligible to be
tested as backend on Portal do Aluno. It also have its own binary format (BSON) that
makes it relatively fast.
As a result, HBase could be used for future tests based on its good documentation,
easiness of use and robustness it provides. Considering that some intricated and complex
joins have to be made at Portal do Aluno, MongoDB could also be used for tests due its
query support. Togheter with one of those solutions, MemCache could be used to speed
up the performance of the Portal.
8. Conclusion
It is not possible to deny that a new set of databases are being developed from now
on. They have started as commercial competitive advantages from private companies to
solve internet related problems. Assume that they will replace the old fashion relational
database model is naive thinking because just some little and special applications require
their power. RDMS also have more features that are well known to implement and deal
with not to metion the fact they can organize the data as it is at the real world with strong
integrity which make them independent of application.
The new flavor, however, has shown themselves as good promises in terms of
scalability and availability using ordinary hardware as a cheap solution. Businesses which
relies completely on a single access point with the client will see those new tools as
mandatory in order to have more availability on their applications.
The requisite of having a high available system is mandatory for some applica-
11. tions. Until now this property was satisfied when huge investments were made at backup
systems with RAID and others devices and software which have just increased the num-
ber of SPOFs. Of course those new systems are not completely trustful at the moment
because they are relatively new and may end up on failures, but they could be considered
as an option to fulfill this requirement.
The use of those systems should be analyzed for each application as it is known
the entire storing logic will have to be glued to the application’s code. The possiblity
of dealing with complicated queries (search, insert, update, delete) is, at the current mo-
ment, very little or inexistent. Also, the normalization theorem usually found at relational
database to avoid data replication does not apply at all for them. This new approach have
several replicas of data inside it at several places which need to be synchronized and keep
up to date entirely by the application’s code.
As a thumb rule, those new tools are encouraged to be used when the requisites
of the application match at least one of the following statements: There is a huge amount
of data that needs to be stored; The data set has an easy representation that does not
required complex joins or queries and it naturally fits the key-value model; The future of
the application will have a high-demand access which will lead to performance problems
without clustering.
References
Amza, C., Cox, A. L., and Zwaenepoel, W. (2003). Distributed versioning: consistent
replication for scaling back-end databases of dynamic content web sites. In Middle-
ware ’03: Proceedings of the ACM/IFIP/USENIX 2003 International Conference on
Middleware, pages 282–304, New York, NY, USA. Springer-Verlag New York, Inc.
Binnig, C., Kossmann, D., Kraska, T., and Loesing, S. (2009). How is the weather to-
morrow?: towards a benchmark for the cloud. In DBTest ’09: Proceedings of the
Second International Workshop on Testing Database Systems, pages 1–6, New York,
NY, USA. ACM.
Bortnikov, E. (2009). Open-source grid technologies for web-scale computing. SIGACT
News, 40(2):87–93.
Brantner, M., Florescu, D., Graf, D., Kossmann, D., and Kraska, T. (2008). Building
a database on s3. In SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD inter-
national conference on Management of data, pages 251–264, New York, NY, USA.
ACM.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra,
T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for
structured data. ACM Trans. Comput. Syst., 26(2):1–26.
Cryans, J.-D., April, A., and Abran, A. (2008). Criteria to compare cloud computing
with current database technology. In IWSM/Metrikon/Mensura ’08: Proceedings of
the International Conferences on Software Process and Product Measurement, pages
114–126, Berlin, Heidelberg. Springer-Verlag.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: amazon’s highly
12. available key-value store. In SOSP ’07: Proceedings of twenty-first ACM SIGOPS
symposium on Operating systems principles, pages 205–220, New York, NY, USA.
ACM.
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP
’03: Proceedings of the nineteenth ACM symposium on Operating systems principles,
pages 29–43, New York, NY, USA. ACM.
Gilbert, S. and Lynch, N. (2002). Brewer’s conjecture and the feasibility of consistent,
available, partition-tolerant web services. SIGACT News, 33(2):51–59.
Manassiev, K. and Amza, C. (2005). Scalable database replication through dynamic mul-
tiversioning. In CASCON ’05: Proceedings of the 2005 conference of the Centre for
Advanced Studies on Collaborative research, pages 141–154. IBM Press.
Palankar, M. R., Iamnitchi, A., Ripeanu, M., and Garfinkel, S. (2008). Amazon s3 for
science grids: a viable solution? In DADC ’08: Proceedings of the 2008 international
workshop on Data-aware distributed computing, pages 55–64, New York, NY, USA.
ACM.
Vogels, W. (2008). Eventually consistent - revisited. http://www.
allthingsdistributed.com/2008/12/eventually consistent.
html, Visited in December 2009.