Relational databases have pretty much ruled over the IT world for the last 30 years. However, Web 2.0 and the incipient Internet of Things (IoT) are some of the sources of a data explosion that has proved to exceed the limits of what modern relational databases can handle in a growing number of cases. As a result, new technologies had to be developed to handle these new use cases. We generally group these technologies under the umbrella of Big Data. In this second installment of a two part presentation, we will look at how non SQL databases are tackling the big data problem to scale beyond what relational databases can provide us today.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
ITI016En-The evolution of databases (II)
1. The evolution of database technology (II)
Huibert Aalbers
Senior Certified Executive IT Architect
2. IT Insight podcast
• This podcast belongs to the IT Insight series
• You can subscribe to the podcast through iTunes.
• Additional material such as presentations in PDF format or white
papers mentioned in the podcast can be downloaded from the IT
insight section of my site at http://www.huibert-aalbers.com
• You can send questions or suggestions regarding this podcast to my
personal email, huibert_aalbers@mac.com
3. A brave new world
• With Web 2.0, came the need for a new
set of tools that could handle an
explosive growth of data
• Data willfully shared by the users
• Data collected on users and
customers, sometimes
unsuspectedly on their part
• Sensors, IoT, etc.
• Big Data requires a new kind of data
repository
4. Do I need a different solution?
There are basically two ways to determine that you require a new
type of database solution instead of a traditional relational database
• The architect designs a new system from the ground up using
a Big Data solution because he knowns that it will require it
• The team has tried every single strategy to try to scale the
existing relational database and it is still not enough
• Upgrading the hardware / use of SSDs / Networking, etc.
• Query optimization
• Using a data caching scheme
• Partitioning the data
• Building new indices
• Denormalizing the data
• Using stored procedures, etc.
5. In order to solve the issue, we have to give up
something
• What can we give up?
• ACID properties
• Data normalization
• Transaction support
6. No SQL Repositories
From my point of view, the name “No SQL” is
not right to describe non-relational databases
• The success behind No SQL databases is
not related to the fact that developers don’t
like SQL. It is due to the following reasons:
• They scale linearly
• They are more flexible (schema-less)
• Easier to manage for extremely high
volumes of data
I think it is better to call them distributed non-
relational databases
7. Key-Value pair databases
These data stores are also known as distributed
hash tables
• Pros
• Extremely quick, well understood CS problem
• Scale almost linearly
• Cons
• Performing complex queries against the values
can be slow and complex
• Key-value pair data stores in which the product
also keeps a time stamp on the data for versioning
are a particular case of key-value pair databases
8. Document based databases
This is a large category of data stores which allow to work with data stored in a particular document
format. Among popular document formats used to store data, we could mention:
• XML
• JSON
• YAML
In this kind of data stores, documents are identified by a unique key, which allows for quick retrieval of the
information.
Although conceptually all data stores in this category are relatively similar, there are still important
differences from one product to another
• Query methods (SQL like, Map/Reduce, etc.)
• Replication
• Data consistency
Document based databases are schema-less
9. MongoDB vs CouchDB
• MongoDB
• Very high volumes of data somewhat mutable data
• Dynamic flexible queries, somewhat similar to SQL
• Very quick queries
• CouchDB
• Very high volumes of mostly immutable data
• Pre-defined queries, based on MapReduce, implemented in Javascript
• Master-Master replication
• Neither MongoDB nor CouchDB natively work with XML data, both work with JSON documents
10. Document based databases
• Among the many “Document based databases”, MongoDB is currently the
most popular, closely followed by CouchDB
• The MongoDB API is currently supported by both DB2 and Informix
• That means that it is now very easy to migrate from mongoDB to any of
those databases and store in a single repository both structured data and
JSON documents
11. Hosted Document databases
• Both MongoDB and CouchDB are popular databases, which explains why
there are many options to use both hosted and managed versions of these
products
• Cloudant is a fully managed version of BigCouch, which is in turn a high
availability, fault tolerant version of CouchDB
• Migrating from CouchDB or BigCouch to Cloudant is totally transparent
• Both MongoDB and CouchDB scale very well by implementing sharding,
which make them very well suited for born-on-cloud applications
12. Graph databases
Social networks have become one of the
most representative applications of what is
known as Web 2.0
• Storing and processing social graphs in
relational is both complex and inefficient
Unlike relational databases, this new kind of
data stores focuses more on relationships
than on data. For social networks kind of
projects this results in:
• Increased performance
• Simpler and more natural development
13. Hadoop
Hadoop is a framework designed to process tasks that can
be parallelized on extremely high volumes of data
distributed over a large number of server nodes belonging
to a cluster. It has four main components:
• Hadoop common
• Hadoop Distributed File System (HDFS)
• Designed primarily to handle extremely high
volumes of immutable data
• Loading and deleting data is efficient, updating
data is not
• Hadoop YARN
• Hadoop MapReduce
Managing a complete Hadoop system is currently not for
the faint of heart
14. MapReduce
MapReduce is the data processing algorithm that
sits at the very core of Hadoop
Developers need to implement for each query the
following functions:
• Map: In this phase the overall problem is
divided into smaller problems which can
be divided into smaller tasks (which can
also be further broken down) that can be
distributed to run on different server nodes
• Reduce: In this second phase, the master
node combines the answers received from
the different nodes and processes them to
produce a reply to the query
15. MapReduce
Hadoop allows to store any kind of data
• Structured
• Unstructured
When using Hadoop to store structured data, in
a data warehouse like environment, it is possible
to use languages that automatically generate
the code for the Map/Reduce functions
• Apache Pig (pig latin)
• Apache Hive (HiveQL, similar to SQL)
• IBM Big SQL
16. Analyzing streams of data
Sometimes the amount of stored data is so large
that it simply becomes impossible to perform real
time analysis
• In those cases, the best alternative is to
analyze the stream of data before it is stored
in the database
• The main idea is that the data is kept outside
the database (generally in RAM) during a
certain window of time in order to detect a
combination of events in a short period of
time
• Fraud detection
• Digital marketing
17. Polygot Persistence
When working with applications that require extreme scaling, there is no
solution that fits all challenges. It is likely that after careful analysis of the
problem more than one datastore will be required to obtain the best
performance.
• This is known as “Polygot Persistence”
18. Contact information
On Twitter: @huibert (English), @huibert2 (Spanish)
Web site: http://www.huibert-aalbers.com
Blog: http://www.huibert-aalbers.com/blog