SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
The NoSQL movement

                                Raluca Gheorghita

         Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi



      Abstract. As the amount and pace of data-generation keeps growing,
      businesses are stepping away from traditional RDBMs solutions and go
      for highly scalable store solutions. Numerous papers are being published
      on this topic for instance by well known companies like Google, Facebook
      and Amazon and open-source projects come into existence. Next genera-
      tion Databases mostly address some of the points: being non-relational,
      distributed, open source and horizontal scalable. The movement, named
      with the misleading term NoSQL (the community now translates it with
      ”not only sql”) began early 2009 and is growing rapidly. Often more char-
      acteristics apply as: schema-free, replication support, easy API, eventu-
      ally consistency, and more.


1   Introduction

There is an interesting transition taking place in the Web-scale data stores’ world
as an entire new type of scalable data stores is gaining popularity very quickly.
The traditional LAMP (Linux, Apache HTTP Server, MySQL, and PHP, Python
or Perl) stack is starting to look like a thing of the past. For a few years now,
Memcached (free and open source, high-performance, distributed memory object
caching system, generic in nature, but intended for use in speeding up dynamic
web applications by alleviating database load) has often appeared right next to
MySQL, and now the whole data tier is being shaken up. While some might see
it as a move away from MySQL and PostgreSQL, the traditional open source
relational data stores, it is actually a higher-level change. Much of this change
is the result of a few revelations [1] :

 – a relational database isn’t always the model or system for every piece of data
 – relational databases are tricky to scale; normalization often hurts perfor-
   mance
 – in many applications, primary key lookups are all you need

    The new data stores vary quite a bit in their specific features, but in general
they derive from a similar set of high-level characteristics. Not all of them meet
all of these, of course, but just looking at the list gives you a sense of what they
are trying to accomplish.

 – de-normalized, often schema-free, document storage
 – key/value based, supporting lookups by key
– horizontal scaling (ability of a software or hardware system or a network to
   grow without breaking down or requiring an expensive redesign)
 – built-in replication
 – HTTP/REST or easy to program APIs
 – support for Map/Reduce style programming (a programming model and an
   associated implementation for processing and generating large data sets)
 – Eventually Consistent [2](when no updates occur for a long period of time,
   eventually all updates will propagate through the system and all the replicas
   will be consistent).

   The movement to these distributed schema-free data stores has begun to use
the name NoSQL.


2     What is NoSQL?

First of all, the current disadvantages of relational databases need to be ad-
dressed. A relational database like Microsoft SQL can be most easily described
as a table-based data system where there is minimal data duplication and where
sets of data can be accessed through a series of relational operators like joins
and unions. The problem with such relations is that complex operations with
large data sets quickly become big consumers of resources, although generally
the benefits are collected at the application level.
    Adam Wiggins (Heroku) points out ways of getting around these limitations
in his article SQL Databases Don’t Scale [3]. He presents the well-known tac-
tics of backing up relational databases for huge applications (vertical scaling,
sharding1 or partitioning and read slaves), also enumerating their drawbacks.
So why are relational databases just now becoming a problem? Eric Florenzano
puts it best: ”As the web has grown more social, however, more and more it’s the
people themselves who have become the publishers. And with that fundamental
shift away from read-heavy architectures to read/write and write-heavy archi-
tectures, a lot of the way that we think about storing and retrieving data needed
to change.” [4] The solution to this problem seems to be NoSQL: non-relational
data stores that ”provide for web-scale data storage and retrieval especially in
web based applications because it views the data more closely to how web apps
view data - a key/value hash in the sky.” NoSQL is used for describing the cur-
rent growing type of web applications that need to scale effectively. Applications
can horizontally scale on clusters of commodity hardware without being subject
to complicated sharding techniques.
    Many Web and Java developers built their own data storage solutions, fol-
lowing the example of those built by Google Inc. and Amazon.com Inc., so that
they can manage without Oracle at the beginning. They released them as open
source afterwards. Now that their open source data stores manage hundreds of
1
    Database Sharding can be simply defined as a ”shared-nothing” partitioning scheme
    for large databases across a number of servers, enabling new levels of database per-
    formance and scalability achievable.
terabytes or even petabytes of data for thriving Web 2.0 and cloud computing
vendors, switching back is neither technically, economically or even ideologically
feasible.
    Johan Oskarsson, a Web developer of Last.fm site and the organizer of the
NoSQL meeting that took place in San Francisco in June 2009, stresses that
”Web 2.0 companies can take chances and they need scalability”. He points
out that having these two combined is what makes NoSQL so compelling. He
also says that many developers had even stopped using the open source MySQL
database, a long-time Web 2.0 favorite, for a NoSQL alternative, because the
advantages were too compelling to ignore.
    Which are these advantages so compelling they can’t be ignored?


3    What are the benefits NoSQL provides?

First of all, they aren’t simply databases. Amazon.com’s CTO, Werner Vogels,
refers to the company’s Dynamo system as a ”highly available key-value store.”
[5] Google calls its BigTable, the other role model for many NoSQL enthusiasts,
a ”distributed storage system for managing structured data.” [6]
    Second of all, they can easily handle considerable amounts of data. Hyper-
table , an open source column-based database modeled upon BigTable, is used
by local search engine Zvents Inc. to write 1 billion cells of data per day, accord-
ing to a presentation by Doug Judd [7], a Zvents engineer. Meanwhile BigTable,
in conjunction with its sister technology, Map/Reduce, processes as much as 20
petabytes of data per day [8]. Map/Reduce is a programming model and an
associated implementation for processing and generating large data sets. Users
specify a map function that processes a key/value pair to generate a set of
intermediate key/value pairs, and a reduce function that merges all interme-
diate values associated with the same intermediate key. Programs written in
this functional style are automatically parallelized and executed on a large clus-
ter of commodity machines. The run-time system takes care of the details of
partitioning the input data, scheduling the program’s execution across a set of
machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with paral-
lel and distributed systems to easily utilize the resources of a large distributed
system.
    ”Definitely, the volume of data is getting so huge that people are looking at
other technologies,” said Jon Travis from SpringSource, whose ’VPork’(Voldemort
Performance Testing Framework) technology helps NoSQL users benchmark the
performance of their database alternative. Travis, who is Principal Engineer at
Hyperic, which was acquired by SpringSource, put together a basic performance
testing framework to prove out Voldemort for use in his company.
    Another benefit worth mentioning is that these NoSQL databases run on
clusters of cheap PC servers. PC clusters can be easily and cheaply expanded
without the complexity and cost of sharding, which involves cutting up databases
into multiple tables to run on large clusters or grids. Google has said that one
of BigTable’s bigger clusters manages as much as 6 petabytes of data [9] across
thousands of servers. ”Oracle would tell you that with the right degree of hard-
ware and the right configuration of Oracle RAC (Real Application Clusters)
and other associated magic software, you can achieve the same scalability. But
at what cost?” asks Javier Soltero, CTO of SpringSource.
    The NoSQL systems also solve performance issues. NoSQL architectures per-
form much faster by avoiding the time-consuming task of translating Web or
Java applications and data into a SQL-friendly format. ”SQL is an awkward fit
for procedural code, and almost all code is procedural,” said Curt Monash, a
leading analyst of and strategic advisor to the software industry. For data upon
which users expect to do heavy, repeated manipulations, the cost of mapping
data into SQL is ”well worth paying. But when your database structure is very,
very simple, SQL may not seem that beneficial.”
    Raffaele Sena, Senior Computer Scientist in Adobe’s Business Productivity
Unit, being asked about Adobe ConnectNow - a Web collaboration service - and
Terracotta integration and how it addressed their web site scalability require-
ments, said that Adobe decided against using a relational database for just the
reason raised by Monash. Adobe uses Java clustering software from Terracotta
Inc. to manage data in Java formats, which Sena says is key to boosting Connect-
Now’s performance two to three times over the prior version. ”The system would
have been more complex and harder to develop using a relational database,” he
said. Another project, MongoDB, calls itself a ”document-oriented” database
because of its native storage of object-style data.
    But it is important to note that NoSQL alternatives lack vendors offering
formal support because they are open source. This fact isn’t seen as a problem
by most supporters of the movement as they are closely in touch with the com-
munity. But some admitted that working without a formal ”throat to choke”
[10] when things go wrong was scary, at least for their managers.
    ”We did have to do some selling,” admitted Adobe’s Sena. ”But basically
after they saw our first prototype was working, we were able to convince the
higher-ups that this was the right way to go.” Despite their huge promise, most
enterprises needn’t worry that they are missing out just yet, said Monash. ”Most
large enterprises have an established way of doing OLTP [online transaction pro-
cessing], probably via relational database management systems. Why change?”
he said. Map/Reduce and similar BI-oriented projects ”may be useful for enter-
prises. But where it is, it probably should be integrated into an analytic DBMS
[database management system.]” Even NoSQL’s organizer, Oskarsson, admits
that his company, Last.fm, has yet to move to a NoSQL alternative for produc-
tion, instead relying on open-source databases. He agrees that a revolution, for
now, remains on hold. ”It’s true that [NoSQL] aren’t relevant right now to main-
stream enterprises,” Oskarsson said, ”but that might change one to two years
down the line.”
    But one thing that has to be underlined is that NoSQL is not, and never was,
intended to be a replacement for more mainstream SQL databases. There is no
war between relational and non-relational databases. There’s nothing stopping
people from splitting up data in their web application and using both types of
data stores where it makes sense. As Brad Anderson of Cloudant says: NoSQL
is about ’right tools for the job’ as opposed to anti-relational or replacing tradi-
tional solutions.


4     NoSQL Databases
The need to look at Non SQL systems arises out of scalability issues with rela-
tional databases, which are a function of the fact that relational databases were
not designed to be distributed (which is key to write scalability), and could thus
afford to provide abstractions like ACID transactions and a rich high-level query
model. All NoSQL databases try and address the scalability issue in many ways
- by being distributed, by providing a simpler data / query model, by relaxing
consistency requirements, etc.

4.1   Project Voldemort
Voldemort is a distributed key-value storage system where automatically parti-
tioned data is replicated over multiple servers. It is used at LinkedIn for certain
high-scalability storage problems where simple functional partitioning is not suf-
ficient [11].
    Many of LinkedIn’s products, like the modules People You May Know, View-
ers of This Profile Also Viewed, and much of the Job matching functionality
that LinkedIn gives to people who post jobs on the site, are severely counting on
computationally intensive data mining algorithms. The difficulty in these sys-
tems comes with the fact that large amounts of data need to be moved around
every day. Thus although hundreds of gigabytes or terabytes of data are not
too difficult when sitting still in a storage system, the problem becomes much,
much harder when it must be transformed to support quick lookups and moved
between systems on a daily basis.
    To solve this problem they spent some time thinking about how to build
support for large daily data cycles. Voldemort was designed to support fast,
scalable read/write loads, and is already used in a number of systems at LinkedIn.
It was not designed specifically with batch computation in mind, but it supports
a pluggable architecture which allows the support of multiple storage engines
in the same framework. This allows integrating of fast, failure-resistant online
storage system, with the heavy offline data crunching running on Hadoop.

4.2   CouchDB
CouchDB is one of the most popular and mature document-oriented databases
written in Erlang. Its primary focus was robustness, high concurrency, and fault
tolerance. One key distinctness compared to other systems is its bi-directional
incremental replication. CouchDB now has over 100 production users, 3 books
are in writing, and the community is vibrant.
CoucbDB’s documents are JSON based and they can have binary attach-
ments. Each document has a revision which is deterministically generated from
the document content. It is very robust since it never overwrites previously writ-
ten data. There is therefore not a repair step after a server crash, and one can
take backups with CP. Concurrency is another one of the benefits in CoucbDB’s
design. It uses Erlang approach with lightweight processes which means one pro-
cess per TCP connection. The architecture is also lock free. The API is REST
based, using standard verbs: GET, PUT, POST, and DELETE.
    Map/Reduce views are used for generating persistent representations of docu-
ment data. These are generally written in JavaScript. A really interesting feature
of the views is that they are generated incrementally. The views are stored in a
B-tree and kept up-to-date when new data is added. The bi-directional replica-
tion is peer based (two nodes). One can replicate a subset of documents meeting
a certain criteria. The replication happens over HTTP, which makes replication
across datacenters easy and secure. In a multi-master replication setup CouchDB
can deterministically choose which revision is the winner (with the loosing re-
vision saved as well). One of the first adopters of CouchDB of scale was BBC.
They needed flexibility in schema and robustness. They used CouchDB as a sim-
ple key/value store for their existing application infrastructure. It has proven to
be robust in production for several years and continues to scale to their de-
mands of data and concurrency. Scoopler [12] is a real-time aggregation service
with large and rapidly growing data volume. The schema flexibility was crucial
when they selected CouchDB.
    An unnamed real-time analytics service migrated from a 40+ table Post-
greSQL setup to a single CoucbDB document type with only two views. Ubuntu
9.10 includes the Ubuntu One system which stores user’s address books in
CouchDB. Replication is the killer feature in this scenario.

4.3   Cassandra
Digg is probably the only large site which has Cassandra in production (Facebook
runs a forked version). It has been researching ways to scale their database
infrastructure for some time now. Up until now Digg has used a normal LAMP
stack. Step one was to adopt a traditional vertically partitioned master-slave
configuration with MySQL, and they also investigated sharding MySQL with
IDDB (a way to partition both indexes - integer sequences and unique character
indexes - and actual tables across multiple storage servers).
    They went out and looked for alternatives. After considering Hbase, Hyper-
table, Cassandra, Tokio Cabinet, Voldemort and Dynomite, They settled for
Cassandra, because it offers a column-oriented data storage, highly available,
peer-to-peer cluster. Even if it’s currently lacking some core features, it was
the optimal solution for Digg. They wanted something open source, scalable,
efficient, and easily administrable. They picked Cassandra because the promise
of easier administration, no single point of failure, more flexible than a simple
key/value store, very fast writes, the community was growing, and it was Java
based (3 out 4 of the people in the team was comfortable with Java)[13].
Digg implemented the green flag feature in Cassandra as a proof of concept.
These flags appear on the Digg icon for a story when one of your friends has
dug it. They did a dark launch with MySQL running alongside. First they just
wrote data to Cassandra, then they enabled reading from Cassandra. Based on
the results of the proof of concept, Digg are going to port the entire application
to Cassandra. Digg is going to continue to use MySQL in some places, according
to the saying ”Use the right tool for the job”.
    Ian Eure, Senior Core Infrastructure Software Engineer at Digg, declares
their interest in NoSQL in general and Cassandra specifically. He states that they
believe in this technology, and they are contributing to its ongoing development,
both by submitting patches and by funding development of features necessary
to support wide scale deployment.

4.4   MongoDB
MongoDB is an open source, non-relational database that combines three key
qualities: scalable, schema-less, and queryable. It has native drivers for pretty
much every major language, and a small but growing community. Mongo’s design
trades off a few traditional features of databases (notably joins and transactions)
in order to achieve much better performance. It is perhaps most comparable to
CouchDB for its JSON document-oriented approach, but has much better query-
ing capabilities: you can do dynamic queries without pre-generating expensive
views. So Mongo occupies a sweet spot for powering web apps.
    BusinessInsider.com, a business news site launched in February 2009, runs on
LAMP platform: Linux, Apache, Mongo, PHP. The M comes from Mango, not
from MySQL, as it usually does. They use MongoDB for different reasons. First
of all, it’s scalable. Next, it’s document-oriented, not relational. RDBMs were
invented in the 1970s, long before object-oriented programming and dynamic
scripting languages became popular. By now, we’re all accustomed to the process
of translating our code’s data structures back and forth to the tables in our
database, but it doesn’t have to be that way. Rather than rows in a table, Mongo
stores documents in collections. Documents are slightly enhanced JSON objects,
so you can stash much more complex structured data in a single document
than you can store in a table row. Data modeling becomes a much more natural
process. The data modeling approach is different; instead of using multiple tables
and joining them together with foreign keys, objects can be embedded within a
single document.
    For example, each post on their site is a document. Similarly, in a MySQL-
based system, a post would be a row in a table. But comments are different.
Comments are embedded directly within the post document as an array of ob-
jects. All of the comment data, including the text of each comment, information
on who posted it, and the thumbs up/thumbs down voting, is stored directly
within the post document. When the code pulls up a post like this one, the
database doesn’t have to query over a separate comments table. The comments
are right there as part of the post object, ready to be displayed. This is faster,
and makes intuitive sense [13].
Another benefit when using MongoDB is that there is no database-enforced
schema, so when a notable change is made (like adding thumbs-ups to the com-
ments), it can be easily done in a backwards-compatible way. Regarding caching,
BussinessInsider does a lot less caching than they would on a MySQL database.
Mongo is very fast at retrieving individual objects, so there is no need to cache
individual posts. It is usually going to be as fast as Memcached for retrieving in-
dividual documents. And Mongo itself can be used as an effective caching layer. If
your collection is small, Mongo will keep it entirely in memory and performance
will be comparable to a cache.
    Another plus for Mongo is that it can store binary data in the database, so
that they don’t have to deal with the common hassle of having files in the file
system and metadata in the database. Using its GridFS API, all the images can
easily be stashed on the site in Mongo. SourceForge.net had a large redesign last
summer where they moved to MongoDB. Their goal was to store the front pages,
project pages, and download pages in a single document. It’s deployed with one
master and 5-6 read-only slaves (obviously scaled for reads and reliability).


4.5   Amazon S3 - Simple Storage Service

Amazon has announced Amazon S3 - Simple Storage Service, but it’s not in-
tended for the general public, but rather for software developers who want to
work with the Amazon Web Services system. Amazon Web Services Newsletter
describes some specific details: ”Amazon S3 is storage for the Internet. It is de-
signed to make web-scale computing easier for developers. Amazon S3 provides a
simple web services interface that can be used to store and retrieve any amount
of data, at any time, from anywhere on the web. It gives any developer access to
the same highly scalable, reliable, fast, inexpensive data storage infrastructure
that Amazon uses to run its own global network of web sites. The service aims
to maximize benefits of scale and to pass those benefits on to developers.”

Amazon S3 Functionality Amazon S3 is intentionally built with a minimal fea-
ture set.

 – ” Write, read, and delete objects containing from 1 byte to 5 gigabytes of
   data each. The number of objects you can store is unlimited.”
 – ” Each object is stored and retrieved via a unique, developer-assigned key.”
 – ” Authentication mechanisms are provided to ensure that data is kept secure
   from unauthorized access. Objects can be made private or public, and rights
   can be granted to specific users.”
 – ” Uses standards-based REST and SOAP interfaces designed to work with
   any Internet-development toolkit.”
 – ” Built to be flexible so that protocol or functional layers can easily be added.
   Default download protocol is HTTP. A BitTorrent (TM) protocol interface
   is provided to lower costs for high-scale distribution. Additional interfaces
   will be added in the future.”
5   Conclusions

It is obvious that relational database systems are no longer the main keepers of
the data, and that is especially true with some of the large companies that have
risen during the Internet era: Amazon, Google, Facebook, LinkedIn, and others.
But it is also true that many have invested heavily in Oracle, DB2 or MS SQL,
and the truth is those databases are still serving their needs. It is completely
unlikely relational databases to disappear any time soon, but it is possible to see
a gradual move towards open source non-SQL data stores for costs, simplicity
and scalability reasons.


References
 1. Jeremy Zawodny: NoSQL: Distributed and Scalable Non-Relational Database Sys-
    tems. Linux Magazine, October, 2009
 2. Werner Vogels: Eventually Consistent. ACM Queue Magazine, December 4, 2008
 3. Adam Wiggins: SQL Databases Don’t Scale, Jul 06
 4. Eric Florenzano: My thoughts on feric NoSQL, July 21, 2009
 5. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
    Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
    and Werner Vogels: Dynamo: Amazons Highly Available Key-value Store
 6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach,
    Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable:
    A Distributed Storage System for Structured Data. OSDI’06: Seventh Symposium
    on Operating System Design and Implementation, Seattle, WA, November, 2006.
 7. Doug Judd: Hypertable, June 2009
 8. Eric Lai: Researchers: Databases still beat Google’s MapReduce, April, 2009
 9. Stephen Shankland: Google spotlights data center inner workings, May, 2008
10. Eric Lai: Red Hat Puts the Heat on Oracle. Computerworld, May 2007
11. Jay Kreps: Project Voldemort: Scaling Simple Storage at LinkedIn. LinkedIn blog,
    March, 2009
12. http://www.scoopler.com/
13. Ian Eure: Looking to the future with Cassandra.Digg blog, September, 2009

Weitere ähnliche Inhalte

Was ist angesagt?

Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerijaprr
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
PowerShellForDBDevelopers
PowerShellForDBDevelopersPowerShellForDBDevelopers
PowerShellForDBDevelopersBryan Cafferky
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testingNarola Infotech
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAinventy
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchStevenChike
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql DatabasesJAMES NGONDO
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.Richard Vermillion
 

Was ist angesagt? (20)

Ijaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseerIjaprr vol1-2-6-9naseer
Ijaprr vol1-2-6-9naseer
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
PowerShellForDBDevelopers
PowerShellForDBDevelopersPowerShellForDBDevelopers
PowerShellForDBDevelopers
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
Understanding big data testing
Understanding big data testingUnderstanding big data testing
Understanding big data testing
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Power bi
Power biPower bi
Power bi
 
Data lake
Data lakeData lake
Data lake
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
 
Analysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho benchAnalysis and evaluation of riak kv cluster environment using basho bench
Analysis and evaluation of riak kv cluster environment using basho bench
 
No sql database
No sql databaseNo sql database
No sql database
 
The Rise of Nosql Databases
The Rise of Nosql DatabasesThe Rise of Nosql Databases
The Rise of Nosql Databases
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
 

Andere mochten auch

Intro to MySQL Master Slave Replication
Intro to MySQL Master Slave ReplicationIntro to MySQL Master Slave Replication
Intro to MySQL Master Slave Replicationsatejsahu
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Legacy Coderetreat Bologna @ CodersTUG
Legacy Coderetreat Bologna @ CodersTUGLegacy Coderetreat Bologna @ CodersTUG
Legacy Coderetreat Bologna @ CodersTUGMatteo Baglini
 
The NoSQL movement @ DotNetToscana
The NoSQL movement @ DotNetToscanaThe NoSQL movement @ DotNetToscana
The NoSQL movement @ DotNetToscanaMatteo Baglini
 

Andere mochten auch (6)

Hbase hivepig
Hbase hivepigHbase hivepig
Hbase hivepig
 
Intro to MySQL Master Slave Replication
Intro to MySQL Master Slave ReplicationIntro to MySQL Master Slave Replication
Intro to MySQL Master Slave Replication
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Legacy Coderetreat Bologna @ CodersTUG
Legacy Coderetreat Bologna @ CodersTUGLegacy Coderetreat Bologna @ CodersTUG
Legacy Coderetreat Bologna @ CodersTUG
 
The NoSQL movement @ DotNetToscana
The NoSQL movement @ DotNetToscanaThe NoSQL movement @ DotNetToscana
The NoSQL movement @ DotNetToscana
 
Writing Good Tests
Writing Good TestsWriting Good Tests
Writing Good Tests
 

Ähnlich wie The NoSQL Movement

No Sql On Social And Sematic Web
No Sql On Social And Sematic WebNo Sql On Social And Sematic Web
No Sql On Social And Sematic WebStefan Ceriu
 
NoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebNoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebStefan Prutianu
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxRadhika R
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016Anthony Wijnen
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...IRJET Journal
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databasesijaia
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfajajkhan16
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 

Ähnlich wie The NoSQL Movement (20)

NOSQL
NOSQLNOSQL
NOSQL
 
No Sql On Social And Sematic Web
No Sql On Social And Sematic WebNo Sql On Social And Sematic Web
No Sql On Social And Sematic Web
 
NoSQL On Social And Sematic Web
NoSQL On Social And Sematic WebNoSQL On Social And Sematic Web
NoSQL On Social And Sematic Web
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Erciyes university
Erciyes universityErciyes university
Erciyes university
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
Report 1.0.docx
Report 1.0.docxReport 1.0.docx
Report 1.0.docx
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
AUTOMATIC TRANSFER OF DATA USING SERVICE-ORIENTED ARCHITECTURE TO NoSQL DATAB...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Report 2.0.docx
Report 2.0.docxReport 2.0.docx
Report 2.0.docx
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databases
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
the rising no sql technology
the rising no sql technologythe rising no sql technology
the rising no sql technology
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 

Kürzlich hochgeladen

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

The NoSQL Movement

  • 1. The NoSQL movement Raluca Gheorghita Faculty of Computer Science, Alexandru Ioan Cuza University, Iasi Abstract. As the amount and pace of data-generation keeps growing, businesses are stepping away from traditional RDBMs solutions and go for highly scalable store solutions. Numerous papers are being published on this topic for instance by well known companies like Google, Facebook and Amazon and open-source projects come into existence. Next genera- tion Databases mostly address some of the points: being non-relational, distributed, open source and horizontal scalable. The movement, named with the misleading term NoSQL (the community now translates it with ”not only sql”) began early 2009 and is growing rapidly. Often more char- acteristics apply as: schema-free, replication support, easy API, eventu- ally consistency, and more. 1 Introduction There is an interesting transition taking place in the Web-scale data stores’ world as an entire new type of scalable data stores is gaining popularity very quickly. The traditional LAMP (Linux, Apache HTTP Server, MySQL, and PHP, Python or Perl) stack is starting to look like a thing of the past. For a few years now, Memcached (free and open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load) has often appeared right next to MySQL, and now the whole data tier is being shaken up. While some might see it as a move away from MySQL and PostgreSQL, the traditional open source relational data stores, it is actually a higher-level change. Much of this change is the result of a few revelations [1] : – a relational database isn’t always the model or system for every piece of data – relational databases are tricky to scale; normalization often hurts perfor- mance – in many applications, primary key lookups are all you need The new data stores vary quite a bit in their specific features, but in general they derive from a similar set of high-level characteristics. Not all of them meet all of these, of course, but just looking at the list gives you a sense of what they are trying to accomplish. – de-normalized, often schema-free, document storage – key/value based, supporting lookups by key
  • 2. – horizontal scaling (ability of a software or hardware system or a network to grow without breaking down or requiring an expensive redesign) – built-in replication – HTTP/REST or easy to program APIs – support for Map/Reduce style programming (a programming model and an associated implementation for processing and generating large data sets) – Eventually Consistent [2](when no updates occur for a long period of time, eventually all updates will propagate through the system and all the replicas will be consistent). The movement to these distributed schema-free data stores has begun to use the name NoSQL. 2 What is NoSQL? First of all, the current disadvantages of relational databases need to be ad- dressed. A relational database like Microsoft SQL can be most easily described as a table-based data system where there is minimal data duplication and where sets of data can be accessed through a series of relational operators like joins and unions. The problem with such relations is that complex operations with large data sets quickly become big consumers of resources, although generally the benefits are collected at the application level. Adam Wiggins (Heroku) points out ways of getting around these limitations in his article SQL Databases Don’t Scale [3]. He presents the well-known tac- tics of backing up relational databases for huge applications (vertical scaling, sharding1 or partitioning and read slaves), also enumerating their drawbacks. So why are relational databases just now becoming a problem? Eric Florenzano puts it best: ”As the web has grown more social, however, more and more it’s the people themselves who have become the publishers. And with that fundamental shift away from read-heavy architectures to read/write and write-heavy archi- tectures, a lot of the way that we think about storing and retrieving data needed to change.” [4] The solution to this problem seems to be NoSQL: non-relational data stores that ”provide for web-scale data storage and retrieval especially in web based applications because it views the data more closely to how web apps view data - a key/value hash in the sky.” NoSQL is used for describing the cur- rent growing type of web applications that need to scale effectively. Applications can horizontally scale on clusters of commodity hardware without being subject to complicated sharding techniques. Many Web and Java developers built their own data storage solutions, fol- lowing the example of those built by Google Inc. and Amazon.com Inc., so that they can manage without Oracle at the beginning. They released them as open source afterwards. Now that their open source data stores manage hundreds of 1 Database Sharding can be simply defined as a ”shared-nothing” partitioning scheme for large databases across a number of servers, enabling new levels of database per- formance and scalability achievable.
  • 3. terabytes or even petabytes of data for thriving Web 2.0 and cloud computing vendors, switching back is neither technically, economically or even ideologically feasible. Johan Oskarsson, a Web developer of Last.fm site and the organizer of the NoSQL meeting that took place in San Francisco in June 2009, stresses that ”Web 2.0 companies can take chances and they need scalability”. He points out that having these two combined is what makes NoSQL so compelling. He also says that many developers had even stopped using the open source MySQL database, a long-time Web 2.0 favorite, for a NoSQL alternative, because the advantages were too compelling to ignore. Which are these advantages so compelling they can’t be ignored? 3 What are the benefits NoSQL provides? First of all, they aren’t simply databases. Amazon.com’s CTO, Werner Vogels, refers to the company’s Dynamo system as a ”highly available key-value store.” [5] Google calls its BigTable, the other role model for many NoSQL enthusiasts, a ”distributed storage system for managing structured data.” [6] Second of all, they can easily handle considerable amounts of data. Hyper- table , an open source column-based database modeled upon BigTable, is used by local search engine Zvents Inc. to write 1 billion cells of data per day, accord- ing to a presentation by Doug Judd [7], a Zvents engineer. Meanwhile BigTable, in conjunction with its sister technology, Map/Reduce, processes as much as 20 petabytes of data per day [8]. Map/Reduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all interme- diate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large clus- ter of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with paral- lel and distributed systems to easily utilize the resources of a large distributed system. ”Definitely, the volume of data is getting so huge that people are looking at other technologies,” said Jon Travis from SpringSource, whose ’VPork’(Voldemort Performance Testing Framework) technology helps NoSQL users benchmark the performance of their database alternative. Travis, who is Principal Engineer at Hyperic, which was acquired by SpringSource, put together a basic performance testing framework to prove out Voldemort for use in his company. Another benefit worth mentioning is that these NoSQL databases run on clusters of cheap PC servers. PC clusters can be easily and cheaply expanded without the complexity and cost of sharding, which involves cutting up databases into multiple tables to run on large clusters or grids. Google has said that one
  • 4. of BigTable’s bigger clusters manages as much as 6 petabytes of data [9] across thousands of servers. ”Oracle would tell you that with the right degree of hard- ware and the right configuration of Oracle RAC (Real Application Clusters) and other associated magic software, you can achieve the same scalability. But at what cost?” asks Javier Soltero, CTO of SpringSource. The NoSQL systems also solve performance issues. NoSQL architectures per- form much faster by avoiding the time-consuming task of translating Web or Java applications and data into a SQL-friendly format. ”SQL is an awkward fit for procedural code, and almost all code is procedural,” said Curt Monash, a leading analyst of and strategic advisor to the software industry. For data upon which users expect to do heavy, repeated manipulations, the cost of mapping data into SQL is ”well worth paying. But when your database structure is very, very simple, SQL may not seem that beneficial.” Raffaele Sena, Senior Computer Scientist in Adobe’s Business Productivity Unit, being asked about Adobe ConnectNow - a Web collaboration service - and Terracotta integration and how it addressed their web site scalability require- ments, said that Adobe decided against using a relational database for just the reason raised by Monash. Adobe uses Java clustering software from Terracotta Inc. to manage data in Java formats, which Sena says is key to boosting Connect- Now’s performance two to three times over the prior version. ”The system would have been more complex and harder to develop using a relational database,” he said. Another project, MongoDB, calls itself a ”document-oriented” database because of its native storage of object-style data. But it is important to note that NoSQL alternatives lack vendors offering formal support because they are open source. This fact isn’t seen as a problem by most supporters of the movement as they are closely in touch with the com- munity. But some admitted that working without a formal ”throat to choke” [10] when things go wrong was scary, at least for their managers. ”We did have to do some selling,” admitted Adobe’s Sena. ”But basically after they saw our first prototype was working, we were able to convince the higher-ups that this was the right way to go.” Despite their huge promise, most enterprises needn’t worry that they are missing out just yet, said Monash. ”Most large enterprises have an established way of doing OLTP [online transaction pro- cessing], probably via relational database management systems. Why change?” he said. Map/Reduce and similar BI-oriented projects ”may be useful for enter- prises. But where it is, it probably should be integrated into an analytic DBMS [database management system.]” Even NoSQL’s organizer, Oskarsson, admits that his company, Last.fm, has yet to move to a NoSQL alternative for produc- tion, instead relying on open-source databases. He agrees that a revolution, for now, remains on hold. ”It’s true that [NoSQL] aren’t relevant right now to main- stream enterprises,” Oskarsson said, ”but that might change one to two years down the line.” But one thing that has to be underlined is that NoSQL is not, and never was, intended to be a replacement for more mainstream SQL databases. There is no war between relational and non-relational databases. There’s nothing stopping
  • 5. people from splitting up data in their web application and using both types of data stores where it makes sense. As Brad Anderson of Cloudant says: NoSQL is about ’right tools for the job’ as opposed to anti-relational or replacing tradi- tional solutions. 4 NoSQL Databases The need to look at Non SQL systems arises out of scalability issues with rela- tional databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model. All NoSQL databases try and address the scalability issue in many ways - by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc. 4.1 Project Voldemort Voldemort is a distributed key-value storage system where automatically parti- tioned data is replicated over multiple servers. It is used at LinkedIn for certain high-scalability storage problems where simple functional partitioning is not suf- ficient [11]. Many of LinkedIn’s products, like the modules People You May Know, View- ers of This Profile Also Viewed, and much of the Job matching functionality that LinkedIn gives to people who post jobs on the site, are severely counting on computationally intensive data mining algorithms. The difficulty in these sys- tems comes with the fact that large amounts of data need to be moved around every day. Thus although hundreds of gigabytes or terabytes of data are not too difficult when sitting still in a storage system, the problem becomes much, much harder when it must be transformed to support quick lookups and moved between systems on a daily basis. To solve this problem they spent some time thinking about how to build support for large daily data cycles. Voldemort was designed to support fast, scalable read/write loads, and is already used in a number of systems at LinkedIn. It was not designed specifically with batch computation in mind, but it supports a pluggable architecture which allows the support of multiple storage engines in the same framework. This allows integrating of fast, failure-resistant online storage system, with the heavy offline data crunching running on Hadoop. 4.2 CouchDB CouchDB is one of the most popular and mature document-oriented databases written in Erlang. Its primary focus was robustness, high concurrency, and fault tolerance. One key distinctness compared to other systems is its bi-directional incremental replication. CouchDB now has over 100 production users, 3 books are in writing, and the community is vibrant.
  • 6. CoucbDB’s documents are JSON based and they can have binary attach- ments. Each document has a revision which is deterministically generated from the document content. It is very robust since it never overwrites previously writ- ten data. There is therefore not a repair step after a server crash, and one can take backups with CP. Concurrency is another one of the benefits in CoucbDB’s design. It uses Erlang approach with lightweight processes which means one pro- cess per TCP connection. The architecture is also lock free. The API is REST based, using standard verbs: GET, PUT, POST, and DELETE. Map/Reduce views are used for generating persistent representations of docu- ment data. These are generally written in JavaScript. A really interesting feature of the views is that they are generated incrementally. The views are stored in a B-tree and kept up-to-date when new data is added. The bi-directional replica- tion is peer based (two nodes). One can replicate a subset of documents meeting a certain criteria. The replication happens over HTTP, which makes replication across datacenters easy and secure. In a multi-master replication setup CouchDB can deterministically choose which revision is the winner (with the loosing re- vision saved as well). One of the first adopters of CouchDB of scale was BBC. They needed flexibility in schema and robustness. They used CouchDB as a sim- ple key/value store for their existing application infrastructure. It has proven to be robust in production for several years and continues to scale to their de- mands of data and concurrency. Scoopler [12] is a real-time aggregation service with large and rapidly growing data volume. The schema flexibility was crucial when they selected CouchDB. An unnamed real-time analytics service migrated from a 40+ table Post- greSQL setup to a single CoucbDB document type with only two views. Ubuntu 9.10 includes the Ubuntu One system which stores user’s address books in CouchDB. Replication is the killer feature in this scenario. 4.3 Cassandra Digg is probably the only large site which has Cassandra in production (Facebook runs a forked version). It has been researching ways to scale their database infrastructure for some time now. Up until now Digg has used a normal LAMP stack. Step one was to adopt a traditional vertically partitioned master-slave configuration with MySQL, and they also investigated sharding MySQL with IDDB (a way to partition both indexes - integer sequences and unique character indexes - and actual tables across multiple storage servers). They went out and looked for alternatives. After considering Hbase, Hyper- table, Cassandra, Tokio Cabinet, Voldemort and Dynomite, They settled for Cassandra, because it offers a column-oriented data storage, highly available, peer-to-peer cluster. Even if it’s currently lacking some core features, it was the optimal solution for Digg. They wanted something open source, scalable, efficient, and easily administrable. They picked Cassandra because the promise of easier administration, no single point of failure, more flexible than a simple key/value store, very fast writes, the community was growing, and it was Java based (3 out 4 of the people in the team was comfortable with Java)[13].
  • 7. Digg implemented the green flag feature in Cassandra as a proof of concept. These flags appear on the Digg icon for a story when one of your friends has dug it. They did a dark launch with MySQL running alongside. First they just wrote data to Cassandra, then they enabled reading from Cassandra. Based on the results of the proof of concept, Digg are going to port the entire application to Cassandra. Digg is going to continue to use MySQL in some places, according to the saying ”Use the right tool for the job”. Ian Eure, Senior Core Infrastructure Software Engineer at Digg, declares their interest in NoSQL in general and Cassandra specifically. He states that they believe in this technology, and they are contributing to its ongoing development, both by submitting patches and by funding development of features necessary to support wide scale deployment. 4.4 MongoDB MongoDB is an open source, non-relational database that combines three key qualities: scalable, schema-less, and queryable. It has native drivers for pretty much every major language, and a small but growing community. Mongo’s design trades off a few traditional features of databases (notably joins and transactions) in order to achieve much better performance. It is perhaps most comparable to CouchDB for its JSON document-oriented approach, but has much better query- ing capabilities: you can do dynamic queries without pre-generating expensive views. So Mongo occupies a sweet spot for powering web apps. BusinessInsider.com, a business news site launched in February 2009, runs on LAMP platform: Linux, Apache, Mongo, PHP. The M comes from Mango, not from MySQL, as it usually does. They use MongoDB for different reasons. First of all, it’s scalable. Next, it’s document-oriented, not relational. RDBMs were invented in the 1970s, long before object-oriented programming and dynamic scripting languages became popular. By now, we’re all accustomed to the process of translating our code’s data structures back and forth to the tables in our database, but it doesn’t have to be that way. Rather than rows in a table, Mongo stores documents in collections. Documents are slightly enhanced JSON objects, so you can stash much more complex structured data in a single document than you can store in a table row. Data modeling becomes a much more natural process. The data modeling approach is different; instead of using multiple tables and joining them together with foreign keys, objects can be embedded within a single document. For example, each post on their site is a document. Similarly, in a MySQL- based system, a post would be a row in a table. But comments are different. Comments are embedded directly within the post document as an array of ob- jects. All of the comment data, including the text of each comment, information on who posted it, and the thumbs up/thumbs down voting, is stored directly within the post document. When the code pulls up a post like this one, the database doesn’t have to query over a separate comments table. The comments are right there as part of the post object, ready to be displayed. This is faster, and makes intuitive sense [13].
  • 8. Another benefit when using MongoDB is that there is no database-enforced schema, so when a notable change is made (like adding thumbs-ups to the com- ments), it can be easily done in a backwards-compatible way. Regarding caching, BussinessInsider does a lot less caching than they would on a MySQL database. Mongo is very fast at retrieving individual objects, so there is no need to cache individual posts. It is usually going to be as fast as Memcached for retrieving in- dividual documents. And Mongo itself can be used as an effective caching layer. If your collection is small, Mongo will keep it entirely in memory and performance will be comparable to a cache. Another plus for Mongo is that it can store binary data in the database, so that they don’t have to deal with the common hassle of having files in the file system and metadata in the database. Using its GridFS API, all the images can easily be stashed on the site in Mongo. SourceForge.net had a large redesign last summer where they moved to MongoDB. Their goal was to store the front pages, project pages, and download pages in a single document. It’s deployed with one master and 5-6 read-only slaves (obviously scaled for reads and reliability). 4.5 Amazon S3 - Simple Storage Service Amazon has announced Amazon S3 - Simple Storage Service, but it’s not in- tended for the general public, but rather for software developers who want to work with the Amazon Web Services system. Amazon Web Services Newsletter describes some specific details: ”Amazon S3 is storage for the Internet. It is de- signed to make web-scale computing easier for developers. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.” Amazon S3 Functionality Amazon S3 is intentionally built with a minimal fea- ture set. – ” Write, read, and delete objects containing from 1 byte to 5 gigabytes of data each. The number of objects you can store is unlimited.” – ” Each object is stored and retrieved via a unique, developer-assigned key.” – ” Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users.” – ” Uses standards-based REST and SOAP interfaces designed to work with any Internet-development toolkit.” – ” Built to be flexible so that protocol or functional layers can easily be added. Default download protocol is HTTP. A BitTorrent (TM) protocol interface is provided to lower costs for high-scale distribution. Additional interfaces will be added in the future.”
  • 9. 5 Conclusions It is obvious that relational database systems are no longer the main keepers of the data, and that is especially true with some of the large companies that have risen during the Internet era: Amazon, Google, Facebook, LinkedIn, and others. But it is also true that many have invested heavily in Oracle, DB2 or MS SQL, and the truth is those databases are still serving their needs. It is completely unlikely relational databases to disappear any time soon, but it is possible to see a gradual move towards open source non-SQL data stores for costs, simplicity and scalability reasons. References 1. Jeremy Zawodny: NoSQL: Distributed and Scalable Non-Relational Database Sys- tems. Linux Magazine, October, 2009 2. Werner Vogels: Eventually Consistent. ACM Queue Magazine, December 4, 2008 3. Adam Wiggins: SQL Databases Don’t Scale, Jul 06 4. Eric Florenzano: My thoughts on feric NoSQL, July 21, 2009 5. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels: Dynamo: Amazons Highly Available Key-value Store 6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data. OSDI’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. 7. Doug Judd: Hypertable, June 2009 8. Eric Lai: Researchers: Databases still beat Google’s MapReduce, April, 2009 9. Stephen Shankland: Google spotlights data center inner workings, May, 2008 10. Eric Lai: Red Hat Puts the Heat on Oracle. Computerworld, May 2007 11. Jay Kreps: Project Voldemort: Scaling Simple Storage at LinkedIn. LinkedIn blog, March, 2009 12. http://www.scoopler.com/ 13. Ian Eure: Looking to the future with Cassandra.Digg blog, September, 2009