The document discusses the NoSQL movement and non-relational databases. It provides background on the limitations of relational databases that led to the development of NoSQL databases. Examples of NoSQL databases are described like Voldemort, CouchDB, and Cassandra. Benefits of NoSQL databases include horizontal scaling, high availability, and faster performance.
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
The NoSQL Movement
1. The NoSQL movement
Raluca Gheorghita
Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi
Abstract. As the amount and pace of data-generation keeps growing,
businesses are stepping away from traditional RDBMs solutions and go
for highly scalable store solutions. Numerous papers are being published
on this topic for instance by well known companies like Google, Facebook
and Amazon and open-source projects come into existence. Next genera-
tion Databases mostly address some of the points: being non-relational,
distributed, open source and horizontal scalable. The movement, named
with the misleading term NoSQL (the community now translates it with
”not only sql”) began early 2009 and is growing rapidly. Often more char-
acteristics apply as: schema-free, replication support, easy API, eventu-
ally consistency, and more.
1 Introduction
There is an interesting transition taking place in the Web-scale data stores’ world
as an entire new type of scalable data stores is gaining popularity very quickly.
The traditional LAMP (Linux, Apache HTTP Server, MySQL, and PHP, Python
or Perl) stack is starting to look like a thing of the past. For a few years now,
Memcached (free and open source, high-performance, distributed memory object
caching system, generic in nature, but intended for use in speeding up dynamic
web applications by alleviating database load) has often appeared right next to
MySQL, and now the whole data tier is being shaken up. While some might see
it as a move away from MySQL and PostgreSQL, the traditional open source
relational data stores, it is actually a higher-level change. Much of this change
is the result of a few revelations [1] :
– a relational database isn’t always the model or system for every piece of data
– relational databases are tricky to scale; normalization often hurts perfor-
mance
– in many applications, primary key lookups are all you need
The new data stores vary quite a bit in their specific features, but in general
they derive from a similar set of high-level characteristics. Not all of them meet
all of these, of course, but just looking at the list gives you a sense of what they
are trying to accomplish.
– de-normalized, often schema-free, document storage
– key/value based, supporting lookups by key
2. – horizontal scaling (ability of a software or hardware system or a network to
grow without breaking down or requiring an expensive redesign)
– built-in replication
– HTTP/REST or easy to program APIs
– support for Map/Reduce style programming (a programming model and an
associated implementation for processing and generating large data sets)
– Eventually Consistent [2](when no updates occur for a long period of time,
eventually all updates will propagate through the system and all the replicas
will be consistent).
The movement to these distributed schema-free data stores has begun to use
the name NoSQL.
2 What is NoSQL?
First of all, the current disadvantages of relational databases need to be ad-
dressed. A relational database like Microsoft SQL can be most easily described
as a table-based data system where there is minimal data duplication and where
sets of data can be accessed through a series of relational operators like joins
and unions. The problem with such relations is that complex operations with
large data sets quickly become big consumers of resources, although generally
the benefits are collected at the application level.
Adam Wiggins (Heroku) points out ways of getting around these limitations
in his article SQL Databases Don’t Scale [3]. He presents the well-known tac-
tics of backing up relational databases for huge applications (vertical scaling,
sharding1 or partitioning and read slaves), also enumerating their drawbacks.
So why are relational databases just now becoming a problem? Eric Florenzano
puts it best: ”As the web has grown more social, however, more and more it’s the
people themselves who have become the publishers. And with that fundamental
shift away from read-heavy architectures to read/write and write-heavy archi-
tectures, a lot of the way that we think about storing and retrieving data needed
to change.” [4] The solution to this problem seems to be NoSQL: non-relational
data stores that ”provide for web-scale data storage and retrieval especially in
web based applications because it views the data more closely to how web apps
view data - a key/value hash in the sky.” NoSQL is used for describing the cur-
rent growing type of web applications that need to scale effectively. Applications
can horizontally scale on clusters of commodity hardware without being subject
to complicated sharding techniques.
Many Web and Java developers built their own data storage solutions, fol-
lowing the example of those built by Google Inc. and Amazon.com Inc., so that
they can manage without Oracle at the beginning. They released them as open
source afterwards. Now that their open source data stores manage hundreds of
1
Database Sharding can be simply defined as a ”shared-nothing” partitioning scheme
for large databases across a number of servers, enabling new levels of database per-
formance and scalability achievable.
3. terabytes or even petabytes of data for thriving Web 2.0 and cloud computing
vendors, switching back is neither technically, economically or even ideologically
feasible.
Johan Oskarsson, a Web developer of Last.fm site and the organizer of the
NoSQL meeting that took place in San Francisco in June 2009, stresses that
”Web 2.0 companies can take chances and they need scalability”. He points
out that having these two combined is what makes NoSQL so compelling. He
also says that many developers had even stopped using the open source MySQL
database, a long-time Web 2.0 favorite, for a NoSQL alternative, because the
advantages were too compelling to ignore.
Which are these advantages so compelling they can’t be ignored?
3 What are the benefits NoSQL provides?
First of all, they aren’t simply databases. Amazon.com’s CTO, Werner Vogels,
refers to the company’s Dynamo system as a ”highly available key-value store.”
[5] Google calls its BigTable, the other role model for many NoSQL enthusiasts,
a ”distributed storage system for managing structured data.” [6]
Second of all, they can easily handle considerable amounts of data. Hyper-
table , an open source column-based database modeled upon BigTable, is used
by local search engine Zvents Inc. to write 1 billion cells of data per day, accord-
ing to a presentation by Doug Judd [7], a Zvents engineer. Meanwhile BigTable,
in conjunction with its sister technology, Map/Reduce, processes as much as 20
petabytes of data per day [8]. Map/Reduce is a programming model and an
associated implementation for processing and generating large data sets. Users
specify a map function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function that merges all interme-
diate values associated with the same intermediate key. Programs written in
this functional style are automatically parallelized and executed on a large clus-
ter of commodity machines. The run-time system takes care of the details of
partitioning the input data, scheduling the program’s execution across a set of
machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with paral-
lel and distributed systems to easily utilize the resources of a large distributed
system.
”Definitely, the volume of data is getting so huge that people are looking at
other technologies,” said Jon Travis from SpringSource, whose ’VPork’(Voldemort
Performance Testing Framework) technology helps NoSQL users benchmark the
performance of their database alternative. Travis, who is Principal Engineer at
Hyperic, which was acquired by SpringSource, put together a basic performance
testing framework to prove out Voldemort for use in his company.
Another benefit worth mentioning is that these NoSQL databases run on
clusters of cheap PC servers. PC clusters can be easily and cheaply expanded
without the complexity and cost of sharding, which involves cutting up databases
into multiple tables to run on large clusters or grids. Google has said that one
4. of BigTable’s bigger clusters manages as much as 6 petabytes of data [9] across
thousands of servers. ”Oracle would tell you that with the right degree of hard-
ware and the right configuration of Oracle RAC (Real Application Clusters)
and other associated magic software, you can achieve the same scalability. But
at what cost?” asks Javier Soltero, CTO of SpringSource.
The NoSQL systems also solve performance issues. NoSQL architectures per-
form much faster by avoiding the time-consuming task of translating Web or
Java applications and data into a SQL-friendly format. ”SQL is an awkward fit
for procedural code, and almost all code is procedural,” said Curt Monash, a
leading analyst of and strategic advisor to the software industry. For data upon
which users expect to do heavy, repeated manipulations, the cost of mapping
data into SQL is ”well worth paying. But when your database structure is very,
very simple, SQL may not seem that beneficial.”
Raffaele Sena, Senior Computer Scientist in Adobe’s Business Productivity
Unit, being asked about Adobe ConnectNow - a Web collaboration service - and
Terracotta integration and how it addressed their web site scalability require-
ments, said that Adobe decided against using a relational database for just the
reason raised by Monash. Adobe uses Java clustering software from Terracotta
Inc. to manage data in Java formats, which Sena says is key to boosting Connect-
Now’s performance two to three times over the prior version. ”The system would
have been more complex and harder to develop using a relational database,” he
said. Another project, MongoDB, calls itself a ”document-oriented” database
because of its native storage of object-style data.
But it is important to note that NoSQL alternatives lack vendors offering
formal support because they are open source. This fact isn’t seen as a problem
by most supporters of the movement as they are closely in touch with the com-
munity. But some admitted that working without a formal ”throat to choke”
[10] when things go wrong was scary, at least for their managers.
”We did have to do some selling,” admitted Adobe’s Sena. ”But basically
after they saw our first prototype was working, we were able to convince the
higher-ups that this was the right way to go.” Despite their huge promise, most
enterprises needn’t worry that they are missing out just yet, said Monash. ”Most
large enterprises have an established way of doing OLTP [online transaction pro-
cessing], probably via relational database management systems. Why change?”
he said. Map/Reduce and similar BI-oriented projects ”may be useful for enter-
prises. But where it is, it probably should be integrated into an analytic DBMS
[database management system.]” Even NoSQL’s organizer, Oskarsson, admits
that his company, Last.fm, has yet to move to a NoSQL alternative for produc-
tion, instead relying on open-source databases. He agrees that a revolution, for
now, remains on hold. ”It’s true that [NoSQL] aren’t relevant right now to main-
stream enterprises,” Oskarsson said, ”but that might change one to two years
down the line.”
But one thing that has to be underlined is that NoSQL is not, and never was,
intended to be a replacement for more mainstream SQL databases. There is no
war between relational and non-relational databases. There’s nothing stopping
5. people from splitting up data in their web application and using both types of
data stores where it makes sense. As Brad Anderson of Cloudant says: NoSQL
is about ’right tools for the job’ as opposed to anti-relational or replacing tradi-
tional solutions.
4 NoSQL Databases
The need to look at Non SQL systems arises out of scalability issues with rela-
tional databases, which are a function of the fact that relational databases were
not designed to be distributed (which is key to write scalability), and could thus
afford to provide abstractions like ACID transactions and a rich high-level query
model. All NoSQL databases try and address the scalability issue in many ways
- by being distributed, by providing a simpler data / query model, by relaxing
consistency requirements, etc.
4.1 Project Voldemort
Voldemort is a distributed key-value storage system where automatically parti-
tioned data is replicated over multiple servers. It is used at LinkedIn for certain
high-scalability storage problems where simple functional partitioning is not suf-
ficient [11].
Many of LinkedIn’s products, like the modules People You May Know, View-
ers of This Profile Also Viewed, and much of the Job matching functionality
that LinkedIn gives to people who post jobs on the site, are severely counting on
computationally intensive data mining algorithms. The difficulty in these sys-
tems comes with the fact that large amounts of data need to be moved around
every day. Thus although hundreds of gigabytes or terabytes of data are not
too difficult when sitting still in a storage system, the problem becomes much,
much harder when it must be transformed to support quick lookups and moved
between systems on a daily basis.
To solve this problem they spent some time thinking about how to build
support for large daily data cycles. Voldemort was designed to support fast,
scalable read/write loads, and is already used in a number of systems at LinkedIn.
It was not designed specifically with batch computation in mind, but it supports
a pluggable architecture which allows the support of multiple storage engines
in the same framework. This allows integrating of fast, failure-resistant online
storage system, with the heavy offline data crunching running on Hadoop.
4.2 CouchDB
CouchDB is one of the most popular and mature document-oriented databases
written in Erlang. Its primary focus was robustness, high concurrency, and fault
tolerance. One key distinctness compared to other systems is its bi-directional
incremental replication. CouchDB now has over 100 production users, 3 books
are in writing, and the community is vibrant.
6. CoucbDB’s documents are JSON based and they can have binary attach-
ments. Each document has a revision which is deterministically generated from
the document content. It is very robust since it never overwrites previously writ-
ten data. There is therefore not a repair step after a server crash, and one can
take backups with CP. Concurrency is another one of the benefits in CoucbDB’s
design. It uses Erlang approach with lightweight processes which means one pro-
cess per TCP connection. The architecture is also lock free. The API is REST
based, using standard verbs: GET, PUT, POST, and DELETE.
Map/Reduce views are used for generating persistent representations of docu-
ment data. These are generally written in JavaScript. A really interesting feature
of the views is that they are generated incrementally. The views are stored in a
B-tree and kept up-to-date when new data is added. The bi-directional replica-
tion is peer based (two nodes). One can replicate a subset of documents meeting
a certain criteria. The replication happens over HTTP, which makes replication
across datacenters easy and secure. In a multi-master replication setup CouchDB
can deterministically choose which revision is the winner (with the loosing re-
vision saved as well). One of the first adopters of CouchDB of scale was BBC.
They needed flexibility in schema and robustness. They used CouchDB as a sim-
ple key/value store for their existing application infrastructure. It has proven to
be robust in production for several years and continues to scale to their de-
mands of data and concurrency. Scoopler [12] is a real-time aggregation service
with large and rapidly growing data volume. The schema flexibility was crucial
when they selected CouchDB.
An unnamed real-time analytics service migrated from a 40+ table Post-
greSQL setup to a single CoucbDB document type with only two views. Ubuntu
9.10 includes the Ubuntu One system which stores user’s address books in
CouchDB. Replication is the killer feature in this scenario.
4.3 Cassandra
Digg is probably the only large site which has Cassandra in production (Facebook
runs a forked version). It has been researching ways to scale their database
infrastructure for some time now. Up until now Digg has used a normal LAMP
stack. Step one was to adopt a traditional vertically partitioned master-slave
configuration with MySQL, and they also investigated sharding MySQL with
IDDB (a way to partition both indexes - integer sequences and unique character
indexes - and actual tables across multiple storage servers).
They went out and looked for alternatives. After considering Hbase, Hyper-
table, Cassandra, Tokio Cabinet, Voldemort and Dynomite, They settled for
Cassandra, because it offers a column-oriented data storage, highly available,
peer-to-peer cluster. Even if it’s currently lacking some core features, it was
the optimal solution for Digg. They wanted something open source, scalable,
efficient, and easily administrable. They picked Cassandra because the promise
of easier administration, no single point of failure, more flexible than a simple
key/value store, very fast writes, the community was growing, and it was Java
based (3 out 4 of the people in the team was comfortable with Java)[13].
7. Digg implemented the green flag feature in Cassandra as a proof of concept.
These flags appear on the Digg icon for a story when one of your friends has
dug it. They did a dark launch with MySQL running alongside. First they just
wrote data to Cassandra, then they enabled reading from Cassandra. Based on
the results of the proof of concept, Digg are going to port the entire application
to Cassandra. Digg is going to continue to use MySQL in some places, according
to the saying ”Use the right tool for the job”.
Ian Eure, Senior Core Infrastructure Software Engineer at Digg, declares
their interest in NoSQL in general and Cassandra specifically. He states that they
believe in this technology, and they are contributing to its ongoing development,
both by submitting patches and by funding development of features necessary
to support wide scale deployment.
4.4 MongoDB
MongoDB is an open source, non-relational database that combines three key
qualities: scalable, schema-less, and queryable. It has native drivers for pretty
much every major language, and a small but growing community. Mongo’s design
trades off a few traditional features of databases (notably joins and transactions)
in order to achieve much better performance. It is perhaps most comparable to
CouchDB for its JSON document-oriented approach, but has much better query-
ing capabilities: you can do dynamic queries without pre-generating expensive
views. So Mongo occupies a sweet spot for powering web apps.
BusinessInsider.com, a business news site launched in February 2009, runs on
LAMP platform: Linux, Apache, Mongo, PHP. The M comes from Mango, not
from MySQL, as it usually does. They use MongoDB for different reasons. First
of all, it’s scalable. Next, it’s document-oriented, not relational. RDBMs were
invented in the 1970s, long before object-oriented programming and dynamic
scripting languages became popular. By now, we’re all accustomed to the process
of translating our code’s data structures back and forth to the tables in our
database, but it doesn’t have to be that way. Rather than rows in a table, Mongo
stores documents in collections. Documents are slightly enhanced JSON objects,
so you can stash much more complex structured data in a single document
than you can store in a table row. Data modeling becomes a much more natural
process. The data modeling approach is different; instead of using multiple tables
and joining them together with foreign keys, objects can be embedded within a
single document.
For example, each post on their site is a document. Similarly, in a MySQL-
based system, a post would be a row in a table. But comments are different.
Comments are embedded directly within the post document as an array of ob-
jects. All of the comment data, including the text of each comment, information
on who posted it, and the thumbs up/thumbs down voting, is stored directly
within the post document. When the code pulls up a post like this one, the
database doesn’t have to query over a separate comments table. The comments
are right there as part of the post object, ready to be displayed. This is faster,
and makes intuitive sense [13].
8. Another benefit when using MongoDB is that there is no database-enforced
schema, so when a notable change is made (like adding thumbs-ups to the com-
ments), it can be easily done in a backwards-compatible way. Regarding caching,
BussinessInsider does a lot less caching than they would on a MySQL database.
Mongo is very fast at retrieving individual objects, so there is no need to cache
individual posts. It is usually going to be as fast as Memcached for retrieving in-
dividual documents. And Mongo itself can be used as an effective caching layer. If
your collection is small, Mongo will keep it entirely in memory and performance
will be comparable to a cache.
Another plus for Mongo is that it can store binary data in the database, so
that they don’t have to deal with the common hassle of having files in the file
system and metadata in the database. Using its GridFS API, all the images can
easily be stashed on the site in Mongo. SourceForge.net had a large redesign last
summer where they moved to MongoDB. Their goal was to store the front pages,
project pages, and download pages in a single document. It’s deployed with one
master and 5-6 read-only slaves (obviously scaled for reads and reliability).
4.5 Amazon S3 - Simple Storage Service
Amazon has announced Amazon S3 - Simple Storage Service, but it’s not in-
tended for the general public, but rather for software developers who want to
work with the Amazon Web Services system. Amazon Web Services Newsletter
describes some specific details: ”Amazon S3 is storage for the Internet. It is de-
signed to make web-scale computing easier for developers. Amazon S3 provides a
simple web services interface that can be used to store and retrieve any amount
of data, at any time, from anywhere on the web. It gives any developer access to
the same highly scalable, reliable, fast, inexpensive data storage infrastructure
that Amazon uses to run its own global network of web sites. The service aims
to maximize benefits of scale and to pass those benefits on to developers.”
Amazon S3 Functionality Amazon S3 is intentionally built with a minimal fea-
ture set.
– ” Write, read, and delete objects containing from 1 byte to 5 gigabytes of
data each. The number of objects you can store is unlimited.”
– ” Each object is stored and retrieved via a unique, developer-assigned key.”
– ” Authentication mechanisms are provided to ensure that data is kept secure
from unauthorized access. Objects can be made private or public, and rights
can be granted to specific users.”
– ” Uses standards-based REST and SOAP interfaces designed to work with
any Internet-development toolkit.”
– ” Built to be flexible so that protocol or functional layers can easily be added.
Default download protocol is HTTP. A BitTorrent (TM) protocol interface
is provided to lower costs for high-scale distribution. Additional interfaces
will be added in the future.”
9. 5 Conclusions
It is obvious that relational database systems are no longer the main keepers of
the data, and that is especially true with some of the large companies that have
risen during the Internet era: Amazon, Google, Facebook, LinkedIn, and others.
But it is also true that many have invested heavily in Oracle, DB2 or MS SQL,
and the truth is those databases are still serving their needs. It is completely
unlikely relational databases to disappear any time soon, but it is possible to see
a gradual move towards open source non-SQL data stores for costs, simplicity
and scalability reasons.
References
1. Jeremy Zawodny: NoSQL: Distributed and Scalable Non-Relational Database Sys-
tems. Linux Magazine, October, 2009
2. Werner Vogels: Eventually Consistent. ACM Queue Magazine, December 4, 2008
3. Adam Wiggins: SQL Databases Don’t Scale, Jul 06
4. Eric Florenzano: My thoughts on feric NoSQL, July 21, 2009
5. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels: Dynamo: Amazons Highly Available Key-value Store
6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,
Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable:
A Distributed Storage System for Structured Data. OSDI’06: Seventh Symposium
on Operating System Design and Implementation, Seattle, WA, November, 2006.
7. Doug Judd: Hypertable, June 2009
8. Eric Lai: Researchers: Databases still beat Google’s MapReduce, April, 2009
9. Stephen Shankland: Google spotlights data center inner workings, May, 2008
10. Eric Lai: Red Hat Puts the Heat on Oracle. Computerworld, May 2007
11. Jay Kreps: Project Voldemort: Scaling Simple Storage at LinkedIn. LinkedIn blog,
March, 2009
12. http://www.scoopler.com/
13. Ian Eure: Looking to the future with Cassandra.Digg blog, September, 2009