SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
WWW.OSTUSA.COM
DATABASES FOR BIG DATA
EVOLUTION OF NoSQL DATABASES and CONCEPTS
Bhaskar Gunda,
Open Systems Tecnologies
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
About Me
• Bhaskar Gunda – Working as Principal Consultant at Open Systems Technologies
• Has 28 years of IT experience
• I am an Electrical Engineer with MBA
• Started working with Computers while in college building Microprocessor based
systems such as Logic controllers on Intel 8085 and Z-80 systems using Assembly
language.
• Started Career with Databases –
– First ever database that I worked was – dBase III & dBase IV.
– First Commercial database to workd was Sybase .
– But immediately transitioned into Oracle –
• was trained in 4.0, but started using 5.0 onwards.
• Still continuing to work with Oracle and many other databases – SQL Server, Informix, PostgreSQL, MySQL
• Started working NoSQL DBs couple of years back.
• I specialize in building HA and DR systems, End-to-End Infrastructure design,
implementations, migrations.
About Today’s Presentation
• NoSQL databases are gaining momentum
• But there is some confusion over their concepts and different types of NoSQL
Databases.
• Originally I thought of only focusing on NoSQL Concepts in this presentation.
• But in keeping broader audience in mind, I have included some Database 101
Concepts also in this presentation.
• I tried my best to put everything together in a format that flows logically.
• As this is not an interactive presentation, I welcome your feedback and any
questions through email.
• I will do my best to answer your questions through email.
• My contact info is provided at the end of the presentation.
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
Data and Information
• Data can be defined as Discrete elements describing a person, thing or an activity.
• Information is putting this Data together to form a meaningful Inference –
– Querying What is there – simple way of displaying the data – may be a spreadsheet format or a tabular
format
– Visualization of data in a format that can be understood easily – dashboards, graphs, charts etc
– Making some meaningful analysis – historical analysis, Incident Analysis, Post-mortem Analysis, Predictive
Analysis..
Often times Data and Information are used interchangeably, which is not correct.
– Data is discrete element and Information is a simple or complex compound of these elements.
– Data is generated, sourced, gathered, acquired on its own
– Information is generated from Data
• Database Management System (DBMS) --
– Database is a location where the data is stored in certain format
– DBMS is a collection of programs that allows users to specify the structure of database, create, query and
modify the data in the database and control access to it.
Data and Information
• A simple and easy way to understand is to use a Lego Analogy.
– Data is like Lego blocks.
– Information is putting these Lego Blocks together to form a thing.
– And a person who puts everything together is a Data Scientist
POWER OF DATA
• Old Saying
– PEN is MIGHTIER than SWORD.
• Modern Saying is
– DATA is MIGHTIER than PEN and a SWORD.
• Companies like Yahoo, Google, Facebook, Twitter, LinkedIn and many others are
based on Using Data in a meaningful way – doing business with Data and
Information. They have completely changed the relationships among people, how
they communicate and how they interact with each other. Because of this a term
has been coined in – Social Networks.
• Companies like Amazon, Alibaba (largest e-commerce portals) are successful
because of mining of data to understand the consumer behavior.
History of DBMS and Evolution
• Databases have a long history and evolved different models from early 1960’s
until now.
– Minimal or no-format Databases (No Frills) – These databases were like writing a transaction on a
paper except was stored in Computers – pre 1960’s.
– Hierarchical Database Models – early 1960’s -- Data is stored into different Units with
Hierarchical relationships
– Network Database Model – Late 1960’s – Multiple relationships were created with transactions.
– Relational Database Management Systems (RDBMS) -- Early 1970’s – Uses Entity-Relationship
model based on E.F.Codd’s 12 Principles
– NoSQL Database – 2009. Deviates away from Relational Model and introduces new method of
storing the data
Paper/Shared
HIERARCHICAL DATABASE MODEL
NETWORK DATABASE MODEL
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
Relational Database Management System (RDBMS)
• Most Popular Database System
• Developed by E.F.Codd in early 1970’s.
• The database is based on 12 Principles developed by E.F.Codd
• This is based on Entity and Relationships.
• The data is arranged in Databases consisting of Tables – in Row & Column format.
• Data storage is optimized with Normalization.
• Data in tables are bound by relationships called Constraints – which enforces the
integrity of data across the database.
• The tables are arranged in Schema format with access controls.
• RDBMS is ACID Complaint.
ACID - Defined
• ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are
processed reliably.
• Atomicity -- Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, the entire
transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and
every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears
(by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen.
• Consistency -- Consistency property ensures that any transaction will bring the database from one valid state to
another. Any data written to the database must be valid according to all defined rules, including constraints,
cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways
the application programmer might have wanted (that is the responsibility of application-level code) but merely that
any programming errors cannot result in the violation of any defined rules.
• Isolation -- Isolation property ensures that the concurrent execution of transactions results in a system state that
would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal
of concurrency control. Depending on concurrency control method (i.e. if it uses strict - as opposed to relaxed -
serializability), the effects of an incomplete transaction might not even be visible to another transaction.
• Durability -- Durability property ensures that once a transaction has been committed, it will remain so, even in the
event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute,
the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against
power loss, transactions (or their effects) must be recorded in a non-volatile memory.
Structured Query Language (SQL)
• Special Purpose Programming Language designed for managing data in RDBMS
• Developed by IBM in 1970’s.
• SQL is 4th Generation Language.
• SQL is based on relational algebra and tuple related Calculus.
• It consists of DML, DCL and DDL.
• RDBMS and SQL are closely tied to each other.
Title
DBMS ARCHITECTURE
Title
PHYSICAL LAYER
(Represents how data is stored on the Storage
Devices)
LOGICAL LAYER
(Represents how data is accessed by the users –
Schema, Tables)
VIEW VIEW VIEW
Represents How
Data has been
portrayed- Using
Interface Languages
such as SQL
RDBMS Concept
Unique
Values
001,1,Doe,John,3000;
002,2,Smith,Jane,3500;
003,3,Taylor,John,2800;
004,4,Smith,Mike,2500;
005,5,Doak,Richard,4000;
006,6,Brown,Dan,3500
Row Format Storage
ID
1
2
6
5
4
3
Last
Doe
Smith
Brown
Doak
Smith
Taylor
First
John
Jane
Dan
Richard
Mike
John
Bonus
3000
3500
3500
4000
2500
2800
Possible duplicate contents
Unique
ROWID
001
002
006
004
003
005
RDBMS Advantages
• Very popular and almost all the ERPs and many mainstream applications are run
on RDBMS.
• Integrity and consistency of data and simple representation of data layout – tables
& constraints in a schema level
• Physical independence – Users are not worried about physical layer, but only
interact with Logical layer.
• Logical Independence – makes database portable across physical layers and
applications and users are not impacted for most of the times
• Support for SQL
• Better backup and restore capabilities
Title
RDBMS Disadvantages
• Expensive and complex Software
• Expensive Hardware
• Highly Skilled resources are required for setting up and managing.
• Difficult to recover data if lost
• Horizontal scalability is limited
• Only Vertically scalable
• Very difficult to utilize many complex data types
• Does not completely represent real world conditions
• Data processing becomes slow as the size increases or some times even simpler
data sizes also due to changing data handling algorithms.
• Very limited support for 3 GLs and hence Procedural handling of Data is not easy.
Title
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
EXPLOSION OF DATA
• With advent of Social networks, increases utilization of Computers and wide
spread use of Internet, the data in the world is growing at tremendous pace.
• Oracle has done a study to estimate the data growth and current data content in
the world from all the sources and found the following
– Data is growing at very faster pace – at an annually compounded rate of 40%.
– It is almost doubling every year or may be even more in next few years.
– At the current rate of growth it will reach about 45 Zetabytes (ZB) by 2020
(1 zettabyte = 1021 bytes or 1 trillion GB)
– Amount of Data that exists today is 2 times of what it was 2 years back.
• Due to increase in the data sources such as Social Networks, Internet of things
(IoT), Healthcare – different data types are being generated
• All the above factors have started to limit the use of RDBMS
Title
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
BIG Data Challenges and RDBMS Limitations
BIG DATA CHALLENGE RDBMS Limitation
High Velocity – Data is generated at a very high speed and required to
be ingested
It is not easy to configure RDBMS for high rate of data Ingestion.
Requires many resources and hence high cost software/hardware
High Variance – Data generated is of different data types – no
particular format or data type can be defined for certain data sources
– such as Social networks – structured, semi-structured & un-
structured
RDBMS has only certain data types. Others have to be defined, but
defining and maintaining to meet current requirements is very
expensive and still does not blend in properly.
High Volume – Data often generated is in high volume RDBMS creates a limitation in ingesting large amounts of data. To
enable more resources and more licenses and more costs
High Veracity – Uncertainty and Uncleansed data. RDBMS has to be designed to handle peak loads even if it is not
always the case and prior cleansing is required – which makes it
difficult to handle and prohibits the cost
Continuous Data and Availability RDBMS requires huge amount of investment to achieve very high HA
and DR capabilities and still not 100% RTO and RPO are met.
Location Independence -- ability to read and write to a database
regardless of where that I/O operation physically occurs and to have
any write functionality propagated out from that location, so that it’s
available to users and machines at other sites.
RDBMS hits the limit of this functionality. We cannot have multiple
nodes writing to multiple places and still have data concurrency.
Oracle RAC provides distributed computing, but not distributed
copies of database at the same time.
Flexible Data Models – not tied into any principles or schema RDBMS hits the wall if any of its principles are deviated or cannot
create schema less, dependency less model
Faster Analytics and Business Intelligence RDBMS again hits the limit with performance and scalability when it
comes to Real-Time analytics and Business Intelligence.
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
Paradigm Shift in Database Management
Title
• Organizations are increasingly conceding the fact that the exploitation of its big
data is a major factor in competitiveness in the next decade.
• We are trying to solve Today’s problems with Yesterday’s solutions.
• For everything and anything RDBMS is not the solution.
• Big Data Analytics does not need RDBMS methodology. To certain extent ACID
can be either compromised or taken care of at the source and hence do not
additionally be enforced in the Database.
• Highly Scalable, low cost solution – should be the option and hence RDBMS
cannot be used. RDBMS is a proprietary system with huge Software Cost.
• SQL is not always the Method to Extract Data – RDBMS and SQL are inseparable.
• Most organizations have started to cross of chasm of RDBMS to NoSQL
databases.
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
NoSQL Databases
• NoSQL Database is a buzz word in modern database technology world
• NoSQL is a word coined by Carlo Strozzi in 1998 to name his lightweight, Strozzi NoSQL
open-source relational database that did not expose the standard SQL interface, but
was still relational.
• NoSQL DB now has changed its original meaning OR rather added more to the original
concept of Carlo Strozzi of using just SQL to interact with database.
• Decoupling SQL from RDBMS means changing the RDBMS methodology is today’s
concept.
• And hence NoSQL Database means “Not Only SQL” database. Or in other words using a
concept beyond RDBMS.
• NoSQL databases are some times called – “Non RELATIONAL”, “Non SQL” – but in my
opinion it is not completely True – It is just beyond usage of SQL only – means shift in
the way Data is stored and Managed – another new Breed of DBMS – NoSQL
Title
Birth of NoSQL
• Johan Oskarsson of Last.fm reintroduced the term NoSQL in early 2009 when he
organized an event to discuss "open source distributed, non relational databases".
• Concept of Hadoop and Open Source have opened the doors to World of
Innovation in Database Management Systems to look beyond RDBMS.
• One of Early NoSQL Database Entry was– Google BigTable
• The key in developing the concept of NoSQL database was – Distributed
Processing, Horizontal Scalability, Use of Cheap and Commodity Hardware, Speed
of Analytics using 3GL and other languages and not just 4GL - SQL.
Title
Benefits of NoSQL Database
• NoSQL databases have different models and are purpose built.
• Compared to RDBMS NoSQL databases are more scalable and Provide superior
performance
• Large Volumes of Rapidly changing, semi-structured and unstructured data can
easily be handled
• Helps in Agile sprints, quick schema iteration and frequent code pushes
• Object oriented programming that is easy to use and flexible
• Geographically distributed scale-out architecture.
• All the challenges described for Big Data are addressed with NoSQL database.
Title
NoSQL Database Concepts
• Open Source
• Schemaless
• Scalability with Scale Out with Commodity Class Hardware
• Distribution and Sharding – Parallel Query with Engines such as MapReduce &
Spark, Distributed Caches
• Data ingestion and extraction using multiple methods.
• Eventual Consistency
• High Availability
Title
NoSQL Concepts – Open Source
• Typically most of the NoSQL databases are open source – Hbase, CouchDB
• There are many vendors today offering commercial Databases with support –
MongoDB, Vertica, Couchbase Server
• Some of the vendors have built the offering on top of Open Source Like Splice
machine which is built on Hbase and Derby.
• Almost all of these databases are integrated with many Open Source tools.
• They layer on top of some the Big Data environments or utilize the tools and
concepts already in place for Big Data Eco system.
• Does not require SQL engine – however, many of the vendors have developed
products that are more of SQL type which translates into built-in distribution
processes
Title
NoSQL Concepts – Schemaless
• This is something very hard to conceptualize coming from RDBMS world.
• NoSQL solutions do not require, or accept, a pre-planned data model whereby every
record has the same fields and each field of a table has to be accounted for in each
record
• They support a flexible data model. Though there can be strong similarities from record
to record, there is no “carry-over” from one record to the next.
• Each field is encoded with JavaScript Object Notation (JSON) or Extensible Markup
Language (XML) according to the solution’s architecture.
• The result is that developers have the agility they need to meet evolving business
requirements.
• Because of this model data can be dumped without Transformation. Transformation of
data occurs while Extracting the data – ELT Vs ETL in RDBMS. This is very much useful
in building Data Warehouse systems.
• Schema is built on Query
Title
NoSQL Database Architecture
PHYSICAL LAYER
(Represents how data is stored on the Storage
Devices)
LOGICAL LAYER
(Represents how data is accessed by the users –
Schema)
VIEW VIEW VIEW
Represents How Data has been portrayed- Using Interface Languages such as SQL, Python or Tools
like Tableau or Qlik
View & Logical Layers
are merged.
Logical Layer becomes
part of Data
Visualization OR in
other words
a Schema is built upon
Query
NoSQL Concepts – Scalability with Scale Out
• NoSQL databases are Scalable with Scale Out model.
• NoSQL solutions support a scale out model for growth by dividing the
programming across a single data set spread over many machines.
• While relational databases are engineered to scale up by adding additional
resources to the server, NoSQL databases are engineered to scale by adding
additional servers or nodes. – Distributed Processing Model
• This is the concept taken from Hadoop. But NoSQL databases do not necessarily
require Hadoop infrastructure in background.
• NoSQL databases like Hadoop can run on Commodity Class hardware and does
not require any high end Infrastructure as RDBMS.
• There is no limit to the amount of servers that NoSQL databases can run on.
Title
NoSQL Concepts – Distribution with Sharding
• These databases are Engineered to run on Multiple Installations of servers.
• NoSQL solutions utilize a partitioning pattern known as SHARDING– that places
each partition in potentially separate servers that are potentially physically
disparate.
• The result is that each server is responsible for operating its data instead of all of
the data.
• This helps in Scalability with Scale out as discussed.
• This model helps in running Parallel Query Operations using Big Data Engines
such as MapReduce or Spark.
• Sharding is implemented using Distributed Cache Model.
Distributed Processing between RDBMS & NoSQL
Title
Distributed Processing in RDBMS Distributed Processing in NoSQL DB
1. Single Copy of database
2. Possible Block level contention.
3. If same block is accessed, then the entire record or
page will be locked.
1. Multiple copies of Database.
2. Blocks are distributed across machines and hence will not lock
each other.
3. Only block level is locked – so entire record is not locked.
4. Added benefit is Higher availability
NoSQL Concepts – Data Ingestion and Extraction
• Most of the NoSQL databases support many Data ingestion tools in Big Data Eco
system such as Flume, SQOOP, Spark Streaming
• Data is extracted using many methods – not necessarily SQL. However, some
mainstream vendors have built their own implementations of SQL for jump
starting the process, actual power is utilizing Low level programming languages
such as Java, Python, Scala, R etc.
• If SQL method is used – then in the background the SQL Jobs are split into
multiple processes spread across different nodes much like MapReduce or Spark.
Or some of the databases are built on top of MapReduce or Spark and hence are
submitted as MapReduce or Spark Jobs.
• Data visualization Tools such as Tableau or Qlik support most of the NoSQL DBs.
Title
NoSQL Database Concepts – Eventual Consistency
• This is another concept very hard to visualize.
• In RDBMS world we are used to have Data consistency based on ACID.
• But Some NoSQL solutions still do not have strong consistency like a single
machine system does.
• Each record will be consistent, but transactions are usually guaranteed to be
“eventually consistent” which means changes to data could be staggered for a
short period of time due to a lower latency in the write operation.
• Sometimes CONSISTENCY can be compromised depending upon the application
that is using this database – for example Predictive Analytics or running What If
scenarios.
Title
NoSQL Database Concepts – High Availability
• By virtue of the Design High Availability is built into NoSQL databases.
• There is no extra effort or software is required for this purpose.
• Data is distributed across multiple nodes with multiple copies much like Hadoop
infrastructure.
• Failure of any node in the cluster will not affect the data loss or processing failure.
• Once the failed hardware is replaced or brought online, the data on that node is
automatically synchronized from the changed blocks on the other nodes.
Title
NoSQL DBMS Applications
• With some of the questions about ACID compliance, schema less options, support
for SQL etc, questions may arise where exactly the NoSQL Database can be
utilized.
• What type of applications are supported on NoSQL Database.
• NoSQL databases are mostly deployed for ad-hoc query purposes. These
databases are not deployed for OLTP purposes. (Even though some of the
vendors are coming out with ACID compliance and OLTP support, but largely they
are not used for OLTP).
• Primary applications – Data Warehouse, BI, Predictive Analytics, Big Data
applications.
• Data Warehouse and BI applications benefit most with NoSQL DBs as it reduces
cost of hardware, software, increased the processing output; Best of all using ELT
and not ETL.
Title
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
NoSQL Database Types
• All NoSQL Databases are not designed similarly
• They are different types of NoSQL Databases based on the design on how they
store data.
• Types of NoSQL Databases are –
– Columnar Databases stores
– Key-Value Database stores
– Document Database stores
– Graphical Database stores
– Multi-model Database stores
Title
COLUMNAR DATABASE Store
• Most popular model of database is Columnar Database model as this model is closer to RDBMS.
• It is a DBMS that stores data tables as sections of columns of data rather than as rows of data
(unlike RBMS where data is stored in rows). Explained in the next slide.
• Data is compressed by eliminating the duplicate data in the columns. On top of it, one of the most
popular compression models – LZW (Lempel-Ziv-Welch) algorithm, Run-length encoding.
• Compression is further enhanced by sorting the data in the columns.
• Some of the most popular databases of this model are –
– HP Vertica, Hbase, Cassandra, Accumulo, BigTable, Splice Machine
• SAP HANA is one of the popular columnar database store – but it is designed to support only SAP
application and very expensive. SAP has announced entire ERP (OLTP & Batch processing) -- SAP
S6 to be supported on HANA beginning of last year-2015.
• Most Common utilization of this model is – Clinical Data processing, Data Warehouse & BI, Library
card catalogs, ad-hoc query requirements requiring large amounts of small set of columns is
aggregated.
Title
Column Format Storage
ID
1
2
6
5
4
3
Last
Doe
Smith
Brown
Doak
Smith
Taylor
First
John
Jane
Dan
Richard
Mike
John
Bonus
3000
3500
3500
4000
2500
2800
Unique
Values
Possible duplicate contents
Unique
1,2,3,4,5;
Doe,Smith,Taylor,Smith,Doak,Brown;
John,Jane,John,Mike,Richard,Dan;
3000,3500,2800,2500,4000,3500
1:001;2:002;3:003;4:004;5:005;
Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006;
John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006;
3000:001;3500:002,006;2800:003;2500:004;4000:005;
ROWID
001
002
006
004
003
005
RDBMS Vs Columnar stores
Title
• 001,1,Doe,John,3000;
• 002,2,Smith,Jane,3500;
• 003,3,Taylor,John,2800;
• 004,4,Smith,Mike,2500;
• 005,5,Doak,Richard,4000;
• 006,6,Brown,Dan,3500
1:001;2:002;3:003;4:004;5:005;
Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006;
John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006;
3000:001;3500:002,006;2800:003;2500:004;4000:005;
RDBMS or ROW format storage Columnar format storage
ID
1
2
6
5
4
3
Last
Doe
Smith
Brown
Doak
Smith
Taylor
First
John
Jane
Dan
Richard
Mike
John
Bonus
3000
3500
3500
4000
2500
2800
ROWID
001
002
006
004
003
005
Pros and Cons of Columnar Database
• Pros –
– This is very much useful and efficient when an aggregate needs to be computed over many rows
but only for smaller subset of data.
– This is efficient when new values of a column are supplied for all rows at once.
– High compression helps in reduced storage requirements and reduced Disk Reads
• Cons –
– If many columns of a single row or multiple rows have to queried or fetched then this may be less
efficient – but still it outperforms RDBMS.
– If entire row has to be updated or replaced then it will take some time to perform the operation.
Title
Key-Value Database Store
• This is a method for storing, retrieving and managing Arrays of data where
Metadata is defined for each value in the array.
• This store consists of collection of Objects or Records of similar type but has
different fields.
• Each record may differ from others.
• It is different than RDBMS – where each record has pre-defined model of key-
values.
• Document based and Graphical based models are derived from this model.
• This follows more closely with modern concepts like Object Oriented
Programming (OOP).
• Most popular databases in this format are –
– REDIS, Oracle NoSQL DB, Berkley DB, DynamoDB
Title
Key-Value Database Store -- Storage
• An XML format (or JSON format) as follows represent the data storage in Key-
Value store
<contact>
<firstname>Bhaskar</firstname>
<lastname>Gunda</lastname>
<street1>605 Seward Ave. NW</street1>
<city>Grand Rapids</city>
<state>MI</state>
<zip>49504</zip>
<country>USA</country>
</contact>
– This record is of type – Contact/Address.
– Each field has metadata (key) defining the value.
Title
Document Database Store
• This is another popular method of storing the data. In fact adoption of NoSQL has
increased because of this model.
• This is designed for storing, retrieving and managing document-oriented data – semi-
structured data.
• This model is a subset of Key-value store but differs from it by not having the keys pre-
defined.
• Metadata is generated for each document separately.
• The data stored in a free-from.
• This differs from RDBMS where a fixed record structure is created for acquiring and
storing the data.
• Programmers create intelligence in parsing the data.
• Each document is a record of its own and every record may differ from others. Each
record is of same type but not necessarily have same number of fields.
Title
Document Data Store – Contd.
• Each document is retrieved using a Unique key – usually a URI.
• Database retains index on the Keys to speed up the retrieval process.
• This makes this database to be popular in Web applications.
• A free form of data store, automatic suggestions of data are the primary applications of
this data store.
• For retrieval purpose admin adds hints to the databae to look for certain type of
information.
• Any document data containing metadata – such as JSON, XML can be used to store the
data in this store.
• Most popular databases are –
– Couchbase Server
– CouchDB
– MongoDB
– Elasticsearch
Title
Document Data Store – Storage
Title
Bhaskar Gunda
OST,
605 Seward Ave NW,
Grand Rapids, MI 49504
Bhaskar Gunda
605 Seward Ave,
Grand Rapids, MI 49504
Bhaskar Gunda
OST,
PO.Box. 456
605 Seward Ave NW,
Grand Rapids, MI 49504
• Each of the above boxes represent One Document
• All three boxes are of same type – Address type document
• But they differ in the content and number of fields.
• Each of these documents are stored with Unique values and the metadata is generated for
each document.
• Programmer writes hints such as “find all my <contact>s with a <zip code>”
This document does not
contain Company Name
than the first document
This document contains
additional PO Box field
than the first document
Document Database Store – Applications
• This type of data store is more popular in Web applications.
• Largely used for semi-structured data.
• Implementations offer a variety of ways of organizing documents, including
notions of:
– Collections
– Tags
– Non-visible Metadata
– Directory hierarchies
– Buckets
Title
GRAPHICAL DATABASE STORE
• This model utilizes a Graph compute model consisting of Nodes & Relationships.
– Each Node is an Entity – a person, place, thing or an activity
– Each Relationship is how Two Nodes are connected to each other.
• Graph Database Model is a DBMS system with storing, retrieving and manipulating data
working in a Graph data model.
• Relationships take first priority in this model – applications doesn’t have to infer data
connections using foreign keys. This is the difference between RDBMS and this model.
• This is simpler and more Expressive than other models.
• This model is more useful in Social networks traversing relationships.
• Graphical databases can be OLTP databases and are fully ACID complaint.
• Some Graphical Databases implement Key-Value store internally for building the
relationships (pointers) between records.
• Most popular databases are– Neo4j, Giraph
Title
Graphical Database Store - Storage
Title
Multi-model Database Store
• Each of the databases (columnar, key-value, document, graphical) are organized in a
single database model that determines how data is stores, retrieved and manipulated.
• If an Organization has need for two different applications which are optimized by one
data model for each, then they have to have two different Models implemented for
each type of application (called Polyglot Persistence)– which defeats the purpose of
using NoSQL Database.
• This is resolved by combining two different models.
• This offers a great advantage of polyglot persistence.
• This model is also ACID compliant.
• One of the first and mostly used database is – OrientDB (supporting Graph, document,
key-value & object Models).
• Other popular database is – Couchbase server.
Title
Selecting a NoSQL Database
Title
• Selecting which model of database is suitable largely depends upon the intended
Business use of the data.
• Key Factors to be considered are -
• Model of the database store as required by Business need.
• Scalability
• ACID Compliance required
• Sharding Capability
• Ability to utilize In-Memory transactions or Not
• Data Ingestion, extraction and Visualization support
• Support for Hadoop Eco system
• Cost to support
NoSQL Database Challenges
Title
• NoSQL databases are mostly used for ad-hoc queries, predictive analytics and
recently increasing the use in DW and BI applications. It is not intended for OLTP or
support mainstream applications such as ERPs.
• Security is one of the concerns in these models. However, Vendor provided NoSQL
database are implementing to certain extent some rigid Security models.
• Selecting a right model to suit the business need requires an in-depth analysis and
understanding of each of the models – this requires a highly skilled resource
(usually outside resource) to identify the right type.
• Risk in selection can be mitigated by conducting a POC upon short listing the
models selected. Usually cloud can be used for this purpose.
Agenda
• About Me
• Introduction to DBMS – History and Evolution
• RDBMS concepts
• Overview of Big Data
• Boundaries of RDBMS—Need for DBMS beyond RDBMS
• Paradigm Shift in DBMS
• NoSQL Databases – Definition, Advantages and breaking boundaries
• Types of NoSQL Databases and their Usage
• Future of RDBMS
Future of RDBMS
• With all this discussion we may feel that RDBMS is going to die.
• Is it real that RDBMS is going to die?
– Not in Reality. RDBMS enforces certain requirements such as ACID compliance, General Model, matured
state of data storage which are all required for the mainstream applications.
– Many applications – ERPs and Transactional systems are designed for RDBMS.
– For all OLTP – RDBMS becomes a choice of database.
• In reality RDBMS and NoSQL databases will co-exist for many years to come in any
organization. But some NoSQL databases are also closing the gap between RDBMS and
NoSQL and making NoSQL database to be RDBMS as well.
• It will be very expensive preposition for any organization to replace RDBMS for their
business operations.
• However, it becomes easier, cheaper and most beneficial if they can replace RDBMS
with NoSQL Databases for applications like Data Warehouse, BI or any new Analytics
platform.
References
• To make this presentation more concise and precise, some of the information is
taken from other presentations. I could not find the references to authors of those
presentations. However, I would like to thank them for making the material
available for reference.
Title
My Info
Bhaskar Gunda
Principal Consultant,
OST
Phone: 616-574-3504
Email: bgunda@ostusa.com
Title

Weitere ähnliche Inhalte

Was ist angesagt?

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
What is new in MariaDB 10.6?
What is new in MariaDB 10.6?What is new in MariaDB 10.6?
What is new in MariaDB 10.6?Mydbops
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLChristian Antognini
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
MySQL Shell for Database Engineers
MySQL Shell for Database EngineersMySQL Shell for Database Engineers
MySQL Shell for Database EngineersMydbops
 
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까?
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까? PS향유회 세미나 - PS는 개발자 취업에 도움이 될까?
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까? SesangCho
 
Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Yauheni Akhotnikau
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Spark Summit
 
신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]Yurim Jin
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Mydbops
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
 
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPODoing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO이 의령
 
[공간정보연구원] 1일차 - QGIS 개요 및 기초
[공간정보연구원] 1일차 - QGIS 개요 및 기초[공간정보연구원] 1일차 - QGIS 개요 및 기초
[공간정보연구원] 1일차 - QGIS 개요 및 기초slhead1
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 

Was ist angesagt? (20)

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
What is new in MariaDB 10.6?
What is new in MariaDB 10.6?What is new in MariaDB 10.6?
What is new in MariaDB 10.6?
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 
security in neo4j
security in neo4jsecurity in neo4j
security in neo4j
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
MySQL Shell for Database Engineers
MySQL Shell for Database EngineersMySQL Shell for Database Engineers
MySQL Shell for Database Engineers
 
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까?
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까? PS향유회 세미나 - PS는 개발자 취업에 도움이 될까?
PS향유회 세미나 - PS는 개발자 취업에 도움이 될까?
 
Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?Actor Model and C++: what, why and how?
Actor Model and C++: what, why and how?
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Redis
RedisRedis
Redis
 
신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]신입 개발자 생활백서 [개정판]
신입 개발자 생활백서 [개정판]
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through ) Group Replication in MySQL 8.0 ( A Walk Through )
Group Replication in MySQL 8.0 ( A Walk Through )
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
Soma search
Soma searchSoma search
Soma search
 
Doing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPODoing Deep Reinforcement learning with PPO
Doing Deep Reinforcement learning with PPO
 
[공간정보연구원] 1일차 - QGIS 개요 및 기초
[공간정보연구원] 1일차 - QGIS 개요 및 기초[공간정보연구원] 1일차 - QGIS 개요 및 기초
[공간정보연구원] 1일차 - QGIS 개요 및 기초
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 

Andere mochten auch

NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December Ruru Chowdhury
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developerJesus Rodriguez
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraFolio3 Software
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014Anuj Sahni
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Chris Richardson
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big dataSteven Francia
 
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)Beat Signer
 
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...Beat Signer
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword Haitham El-Ghareeb
 

Andere mochten auch (20)

NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
Nosql part1 8th December
Nosql part1 8th December Nosql part1 8th December
Nosql part1 8th December
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014A practical introduction to Oracle NoSQL Database - OOW2014
A practical introduction to Oracle NoSQL Database - OOW2014
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)Using Spring with NoSQL databases (SpringOne China 2012)
Using Spring with NoSQL databases (SpringOne China 2012)
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
DISE - Database Concepts
DISE - Database ConceptsDISE - Database Concepts
DISE - Database Concepts
 
Rdbms
RdbmsRdbms
Rdbms
 
Mis assignment (database)
Mis assignment (database)Mis assignment (database)
Mis assignment (database)
 
The CAP Theorem
The CAP Theorem The CAP Theorem
The CAP Theorem
 
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)
Advanced SQL - Lecture 6 - Introduction to Databases (1007156ANR)
 
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...
DBMS Architectures and Features - Lecture 7 - Introduction to Databases (1007...
 
NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword NoSQL Databases, Not just a Buzzword
NoSQL Databases, Not just a Buzzword
 

Ähnlich wie NoSQL-Database-Concepts

Complete first chapter rdbm 17332
Complete first chapter rdbm 17332Complete first chapter rdbm 17332
Complete first chapter rdbm 17332Tushar Wagh
 
CST204 DBMS Module-1
CST204 DBMS Module-1CST204 DBMS Module-1
CST204 DBMS Module-1Jyothis Menon
 
9a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc29a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc2Mukund Trivedi
 
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureChapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureKunal Anand
 
Week 1 and 2 Getting started with DBMS.pptx
Week 1 and 2 Getting started with DBMS.pptxWeek 1 and 2 Getting started with DBMS.pptx
Week 1 and 2 Getting started with DBMS.pptxRiannel Tecson
 
Beginning Of DBMS (data base)
Beginning Of DBMS (data base)Beginning Of DBMS (data base)
Beginning Of DBMS (data base)Surya Swaroop
 
Introduction of DBMS,RDBMS,SQL
Introduction of DBMS,RDBMS,SQLIntroduction of DBMS,RDBMS,SQL
Introduction of DBMS,RDBMS,SQLpranavi ch
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02Jotham Gadot
 
DBMS - Database Management System
DBMS - Database Management System DBMS - Database Management System
DBMS - Database Management System Krishna Patel
 

Ähnlich wie NoSQL-Database-Concepts (20)

Data
DataData
Data
 
Complete first chapter rdbm 17332
Complete first chapter rdbm 17332Complete first chapter rdbm 17332
Complete first chapter rdbm 17332
 
Introduction to RDBMS
Introduction to RDBMSIntroduction to RDBMS
Introduction to RDBMS
 
CST204 DBMS Module-1
CST204 DBMS Module-1CST204 DBMS Module-1
CST204 DBMS Module-1
 
Database systems introduction
Database systems introductionDatabase systems introduction
Database systems introduction
 
DBMS introduction
DBMS introductionDBMS introduction
DBMS introduction
 
9a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc29a797dbms chapter1 b.sc2
9a797dbms chapter1 b.sc2
 
Chapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and ArchitectureChapter-2 Database System Concepts and Architecture
Chapter-2 Database System Concepts and Architecture
 
(Dbms) class 1 & 2 (Presentation)
(Dbms) class 1 & 2 (Presentation)(Dbms) class 1 & 2 (Presentation)
(Dbms) class 1 & 2 (Presentation)
 
Week 1 and 2 Getting started with DBMS.pptx
Week 1 and 2 Getting started with DBMS.pptxWeek 1 and 2 Getting started with DBMS.pptx
Week 1 and 2 Getting started with DBMS.pptx
 
Beginning Of DBMS (data base)
Beginning Of DBMS (data base)Beginning Of DBMS (data base)
Beginning Of DBMS (data base)
 
Dbms
DbmsDbms
Dbms
 
Introduction of DBMS,RDBMS,SQL
Introduction of DBMS,RDBMS,SQLIntroduction of DBMS,RDBMS,SQL
Introduction of DBMS,RDBMS,SQL
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
dbms introduction.pptx
dbms introduction.pptxdbms introduction.pptx
dbms introduction.pptx
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Dbms unit 1
Dbms unit 1Dbms unit 1
Dbms unit 1
 
Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02Fundamentals of Database ppt ch02
Fundamentals of Database ppt ch02
 
DBMS - Database Management System
DBMS - Database Management System DBMS - Database Management System
DBMS - Database Management System
 
Database management system
Database management systemDatabase management system
Database management system
 

NoSQL-Database-Concepts

  • 1. WWW.OSTUSA.COM DATABASES FOR BIG DATA EVOLUTION OF NoSQL DATABASES and CONCEPTS Bhaskar Gunda, Open Systems Tecnologies
  • 2. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 3. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 4. About Me • Bhaskar Gunda – Working as Principal Consultant at Open Systems Technologies • Has 28 years of IT experience • I am an Electrical Engineer with MBA • Started working with Computers while in college building Microprocessor based systems such as Logic controllers on Intel 8085 and Z-80 systems using Assembly language. • Started Career with Databases – – First ever database that I worked was – dBase III & dBase IV. – First Commercial database to workd was Sybase . – But immediately transitioned into Oracle – • was trained in 4.0, but started using 5.0 onwards. • Still continuing to work with Oracle and many other databases – SQL Server, Informix, PostgreSQL, MySQL • Started working NoSQL DBs couple of years back. • I specialize in building HA and DR systems, End-to-End Infrastructure design, implementations, migrations.
  • 5. About Today’s Presentation • NoSQL databases are gaining momentum • But there is some confusion over their concepts and different types of NoSQL Databases. • Originally I thought of only focusing on NoSQL Concepts in this presentation. • But in keeping broader audience in mind, I have included some Database 101 Concepts also in this presentation. • I tried my best to put everything together in a format that flows logically. • As this is not an interactive presentation, I welcome your feedback and any questions through email. • I will do my best to answer your questions through email. • My contact info is provided at the end of the presentation.
  • 6. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 7. Data and Information • Data can be defined as Discrete elements describing a person, thing or an activity. • Information is putting this Data together to form a meaningful Inference – – Querying What is there – simple way of displaying the data – may be a spreadsheet format or a tabular format – Visualization of data in a format that can be understood easily – dashboards, graphs, charts etc – Making some meaningful analysis – historical analysis, Incident Analysis, Post-mortem Analysis, Predictive Analysis.. Often times Data and Information are used interchangeably, which is not correct. – Data is discrete element and Information is a simple or complex compound of these elements. – Data is generated, sourced, gathered, acquired on its own – Information is generated from Data • Database Management System (DBMS) -- – Database is a location where the data is stored in certain format – DBMS is a collection of programs that allows users to specify the structure of database, create, query and modify the data in the database and control access to it.
  • 8. Data and Information • A simple and easy way to understand is to use a Lego Analogy. – Data is like Lego blocks. – Information is putting these Lego Blocks together to form a thing. – And a person who puts everything together is a Data Scientist
  • 9. POWER OF DATA • Old Saying – PEN is MIGHTIER than SWORD. • Modern Saying is – DATA is MIGHTIER than PEN and a SWORD. • Companies like Yahoo, Google, Facebook, Twitter, LinkedIn and many others are based on Using Data in a meaningful way – doing business with Data and Information. They have completely changed the relationships among people, how they communicate and how they interact with each other. Because of this a term has been coined in – Social Networks. • Companies like Amazon, Alibaba (largest e-commerce portals) are successful because of mining of data to understand the consumer behavior.
  • 10. History of DBMS and Evolution • Databases have a long history and evolved different models from early 1960’s until now. – Minimal or no-format Databases (No Frills) – These databases were like writing a transaction on a paper except was stored in Computers – pre 1960’s. – Hierarchical Database Models – early 1960’s -- Data is stored into different Units with Hierarchical relationships – Network Database Model – Late 1960’s – Multiple relationships were created with transactions. – Relational Database Management Systems (RDBMS) -- Early 1970’s – Uses Entity-Relationship model based on E.F.Codd’s 12 Principles – NoSQL Database – 2009. Deviates away from Relational Model and introduces new method of storing the data
  • 14. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 15. Relational Database Management System (RDBMS) • Most Popular Database System • Developed by E.F.Codd in early 1970’s. • The database is based on 12 Principles developed by E.F.Codd • This is based on Entity and Relationships. • The data is arranged in Databases consisting of Tables – in Row & Column format. • Data storage is optimized with Normalization. • Data in tables are bound by relationships called Constraints – which enforces the integrity of data across the database. • The tables are arranged in Schema format with access controls. • RDBMS is ACID Complaint.
  • 16. ACID - Defined • ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. • Atomicity -- Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen. • Consistency -- Consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined rules. • Isolation -- Isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method (i.e. if it uses strict - as opposed to relaxed - serializability), the effects of an incomplete transaction might not even be visible to another transaction. • Durability -- Durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory.
  • 17. Structured Query Language (SQL) • Special Purpose Programming Language designed for managing data in RDBMS • Developed by IBM in 1970’s. • SQL is 4th Generation Language. • SQL is based on relational algebra and tuple related Calculus. • It consists of DML, DCL and DDL. • RDBMS and SQL are closely tied to each other. Title
  • 18. DBMS ARCHITECTURE Title PHYSICAL LAYER (Represents how data is stored on the Storage Devices) LOGICAL LAYER (Represents how data is accessed by the users – Schema, Tables) VIEW VIEW VIEW Represents How Data has been portrayed- Using Interface Languages such as SQL
  • 19. RDBMS Concept Unique Values 001,1,Doe,John,3000; 002,2,Smith,Jane,3500; 003,3,Taylor,John,2800; 004,4,Smith,Mike,2500; 005,5,Doak,Richard,4000; 006,6,Brown,Dan,3500 Row Format Storage ID 1 2 6 5 4 3 Last Doe Smith Brown Doak Smith Taylor First John Jane Dan Richard Mike John Bonus 3000 3500 3500 4000 2500 2800 Possible duplicate contents Unique ROWID 001 002 006 004 003 005
  • 20. RDBMS Advantages • Very popular and almost all the ERPs and many mainstream applications are run on RDBMS. • Integrity and consistency of data and simple representation of data layout – tables & constraints in a schema level • Physical independence – Users are not worried about physical layer, but only interact with Logical layer. • Logical Independence – makes database portable across physical layers and applications and users are not impacted for most of the times • Support for SQL • Better backup and restore capabilities Title
  • 21. RDBMS Disadvantages • Expensive and complex Software • Expensive Hardware • Highly Skilled resources are required for setting up and managing. • Difficult to recover data if lost • Horizontal scalability is limited • Only Vertically scalable • Very difficult to utilize many complex data types • Does not completely represent real world conditions • Data processing becomes slow as the size increases or some times even simpler data sizes also due to changing data handling algorithms. • Very limited support for 3 GLs and hence Procedural handling of Data is not easy. Title
  • 22. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 23. EXPLOSION OF DATA • With advent of Social networks, increases utilization of Computers and wide spread use of Internet, the data in the world is growing at tremendous pace. • Oracle has done a study to estimate the data growth and current data content in the world from all the sources and found the following – Data is growing at very faster pace – at an annually compounded rate of 40%. – It is almost doubling every year or may be even more in next few years. – At the current rate of growth it will reach about 45 Zetabytes (ZB) by 2020 (1 zettabyte = 1021 bytes or 1 trillion GB) – Amount of Data that exists today is 2 times of what it was 2 years back. • Due to increase in the data sources such as Social Networks, Internet of things (IoT), Healthcare – different data types are being generated • All the above factors have started to limit the use of RDBMS Title
  • 24. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 25. BIG Data Challenges and RDBMS Limitations BIG DATA CHALLENGE RDBMS Limitation High Velocity – Data is generated at a very high speed and required to be ingested It is not easy to configure RDBMS for high rate of data Ingestion. Requires many resources and hence high cost software/hardware High Variance – Data generated is of different data types – no particular format or data type can be defined for certain data sources – such as Social networks – structured, semi-structured & un- structured RDBMS has only certain data types. Others have to be defined, but defining and maintaining to meet current requirements is very expensive and still does not blend in properly. High Volume – Data often generated is in high volume RDBMS creates a limitation in ingesting large amounts of data. To enable more resources and more licenses and more costs High Veracity – Uncertainty and Uncleansed data. RDBMS has to be designed to handle peak loads even if it is not always the case and prior cleansing is required – which makes it difficult to handle and prohibits the cost Continuous Data and Availability RDBMS requires huge amount of investment to achieve very high HA and DR capabilities and still not 100% RTO and RPO are met. Location Independence -- ability to read and write to a database regardless of where that I/O operation physically occurs and to have any write functionality propagated out from that location, so that it’s available to users and machines at other sites. RDBMS hits the limit of this functionality. We cannot have multiple nodes writing to multiple places and still have data concurrency. Oracle RAC provides distributed computing, but not distributed copies of database at the same time. Flexible Data Models – not tied into any principles or schema RDBMS hits the wall if any of its principles are deviated or cannot create schema less, dependency less model Faster Analytics and Business Intelligence RDBMS again hits the limit with performance and scalability when it comes to Real-Time analytics and Business Intelligence.
  • 26. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 27. Paradigm Shift in Database Management Title • Organizations are increasingly conceding the fact that the exploitation of its big data is a major factor in competitiveness in the next decade. • We are trying to solve Today’s problems with Yesterday’s solutions. • For everything and anything RDBMS is not the solution. • Big Data Analytics does not need RDBMS methodology. To certain extent ACID can be either compromised or taken care of at the source and hence do not additionally be enforced in the Database. • Highly Scalable, low cost solution – should be the option and hence RDBMS cannot be used. RDBMS is a proprietary system with huge Software Cost. • SQL is not always the Method to Extract Data – RDBMS and SQL are inseparable. • Most organizations have started to cross of chasm of RDBMS to NoSQL databases.
  • 28. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 29. NoSQL Databases • NoSQL Database is a buzz word in modern database technology world • NoSQL is a word coined by Carlo Strozzi in 1998 to name his lightweight, Strozzi NoSQL open-source relational database that did not expose the standard SQL interface, but was still relational. • NoSQL DB now has changed its original meaning OR rather added more to the original concept of Carlo Strozzi of using just SQL to interact with database. • Decoupling SQL from RDBMS means changing the RDBMS methodology is today’s concept. • And hence NoSQL Database means “Not Only SQL” database. Or in other words using a concept beyond RDBMS. • NoSQL databases are some times called – “Non RELATIONAL”, “Non SQL” – but in my opinion it is not completely True – It is just beyond usage of SQL only – means shift in the way Data is stored and Managed – another new Breed of DBMS – NoSQL Title
  • 30. Birth of NoSQL • Johan Oskarsson of Last.fm reintroduced the term NoSQL in early 2009 when he organized an event to discuss "open source distributed, non relational databases". • Concept of Hadoop and Open Source have opened the doors to World of Innovation in Database Management Systems to look beyond RDBMS. • One of Early NoSQL Database Entry was– Google BigTable • The key in developing the concept of NoSQL database was – Distributed Processing, Horizontal Scalability, Use of Cheap and Commodity Hardware, Speed of Analytics using 3GL and other languages and not just 4GL - SQL. Title
  • 31. Benefits of NoSQL Database • NoSQL databases have different models and are purpose built. • Compared to RDBMS NoSQL databases are more scalable and Provide superior performance • Large Volumes of Rapidly changing, semi-structured and unstructured data can easily be handled • Helps in Agile sprints, quick schema iteration and frequent code pushes • Object oriented programming that is easy to use and flexible • Geographically distributed scale-out architecture. • All the challenges described for Big Data are addressed with NoSQL database. Title
  • 32. NoSQL Database Concepts • Open Source • Schemaless • Scalability with Scale Out with Commodity Class Hardware • Distribution and Sharding – Parallel Query with Engines such as MapReduce & Spark, Distributed Caches • Data ingestion and extraction using multiple methods. • Eventual Consistency • High Availability Title
  • 33. NoSQL Concepts – Open Source • Typically most of the NoSQL databases are open source – Hbase, CouchDB • There are many vendors today offering commercial Databases with support – MongoDB, Vertica, Couchbase Server • Some of the vendors have built the offering on top of Open Source Like Splice machine which is built on Hbase and Derby. • Almost all of these databases are integrated with many Open Source tools. • They layer on top of some the Big Data environments or utilize the tools and concepts already in place for Big Data Eco system. • Does not require SQL engine – however, many of the vendors have developed products that are more of SQL type which translates into built-in distribution processes Title
  • 34. NoSQL Concepts – Schemaless • This is something very hard to conceptualize coming from RDBMS world. • NoSQL solutions do not require, or accept, a pre-planned data model whereby every record has the same fields and each field of a table has to be accounted for in each record • They support a flexible data model. Though there can be strong similarities from record to record, there is no “carry-over” from one record to the next. • Each field is encoded with JavaScript Object Notation (JSON) or Extensible Markup Language (XML) according to the solution’s architecture. • The result is that developers have the agility they need to meet evolving business requirements. • Because of this model data can be dumped without Transformation. Transformation of data occurs while Extracting the data – ELT Vs ETL in RDBMS. This is very much useful in building Data Warehouse systems. • Schema is built on Query Title
  • 35. NoSQL Database Architecture PHYSICAL LAYER (Represents how data is stored on the Storage Devices) LOGICAL LAYER (Represents how data is accessed by the users – Schema) VIEW VIEW VIEW Represents How Data has been portrayed- Using Interface Languages such as SQL, Python or Tools like Tableau or Qlik View & Logical Layers are merged. Logical Layer becomes part of Data Visualization OR in other words a Schema is built upon Query
  • 36. NoSQL Concepts – Scalability with Scale Out • NoSQL databases are Scalable with Scale Out model. • NoSQL solutions support a scale out model for growth by dividing the programming across a single data set spread over many machines. • While relational databases are engineered to scale up by adding additional resources to the server, NoSQL databases are engineered to scale by adding additional servers or nodes. – Distributed Processing Model • This is the concept taken from Hadoop. But NoSQL databases do not necessarily require Hadoop infrastructure in background. • NoSQL databases like Hadoop can run on Commodity Class hardware and does not require any high end Infrastructure as RDBMS. • There is no limit to the amount of servers that NoSQL databases can run on. Title
  • 37. NoSQL Concepts – Distribution with Sharding • These databases are Engineered to run on Multiple Installations of servers. • NoSQL solutions utilize a partitioning pattern known as SHARDING– that places each partition in potentially separate servers that are potentially physically disparate. • The result is that each server is responsible for operating its data instead of all of the data. • This helps in Scalability with Scale out as discussed. • This model helps in running Parallel Query Operations using Big Data Engines such as MapReduce or Spark. • Sharding is implemented using Distributed Cache Model.
  • 38. Distributed Processing between RDBMS & NoSQL Title Distributed Processing in RDBMS Distributed Processing in NoSQL DB 1. Single Copy of database 2. Possible Block level contention. 3. If same block is accessed, then the entire record or page will be locked. 1. Multiple copies of Database. 2. Blocks are distributed across machines and hence will not lock each other. 3. Only block level is locked – so entire record is not locked. 4. Added benefit is Higher availability
  • 39. NoSQL Concepts – Data Ingestion and Extraction • Most of the NoSQL databases support many Data ingestion tools in Big Data Eco system such as Flume, SQOOP, Spark Streaming • Data is extracted using many methods – not necessarily SQL. However, some mainstream vendors have built their own implementations of SQL for jump starting the process, actual power is utilizing Low level programming languages such as Java, Python, Scala, R etc. • If SQL method is used – then in the background the SQL Jobs are split into multiple processes spread across different nodes much like MapReduce or Spark. Or some of the databases are built on top of MapReduce or Spark and hence are submitted as MapReduce or Spark Jobs. • Data visualization Tools such as Tableau or Qlik support most of the NoSQL DBs. Title
  • 40. NoSQL Database Concepts – Eventual Consistency • This is another concept very hard to visualize. • In RDBMS world we are used to have Data consistency based on ACID. • But Some NoSQL solutions still do not have strong consistency like a single machine system does. • Each record will be consistent, but transactions are usually guaranteed to be “eventually consistent” which means changes to data could be staggered for a short period of time due to a lower latency in the write operation. • Sometimes CONSISTENCY can be compromised depending upon the application that is using this database – for example Predictive Analytics or running What If scenarios. Title
  • 41. NoSQL Database Concepts – High Availability • By virtue of the Design High Availability is built into NoSQL databases. • There is no extra effort or software is required for this purpose. • Data is distributed across multiple nodes with multiple copies much like Hadoop infrastructure. • Failure of any node in the cluster will not affect the data loss or processing failure. • Once the failed hardware is replaced or brought online, the data on that node is automatically synchronized from the changed blocks on the other nodes. Title
  • 42. NoSQL DBMS Applications • With some of the questions about ACID compliance, schema less options, support for SQL etc, questions may arise where exactly the NoSQL Database can be utilized. • What type of applications are supported on NoSQL Database. • NoSQL databases are mostly deployed for ad-hoc query purposes. These databases are not deployed for OLTP purposes. (Even though some of the vendors are coming out with ACID compliance and OLTP support, but largely they are not used for OLTP). • Primary applications – Data Warehouse, BI, Predictive Analytics, Big Data applications. • Data Warehouse and BI applications benefit most with NoSQL DBs as it reduces cost of hardware, software, increased the processing output; Best of all using ELT and not ETL. Title
  • 43. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 44. NoSQL Database Types • All NoSQL Databases are not designed similarly • They are different types of NoSQL Databases based on the design on how they store data. • Types of NoSQL Databases are – – Columnar Databases stores – Key-Value Database stores – Document Database stores – Graphical Database stores – Multi-model Database stores Title
  • 45. COLUMNAR DATABASE Store • Most popular model of database is Columnar Database model as this model is closer to RDBMS. • It is a DBMS that stores data tables as sections of columns of data rather than as rows of data (unlike RBMS where data is stored in rows). Explained in the next slide. • Data is compressed by eliminating the duplicate data in the columns. On top of it, one of the most popular compression models – LZW (Lempel-Ziv-Welch) algorithm, Run-length encoding. • Compression is further enhanced by sorting the data in the columns. • Some of the most popular databases of this model are – – HP Vertica, Hbase, Cassandra, Accumulo, BigTable, Splice Machine • SAP HANA is one of the popular columnar database store – but it is designed to support only SAP application and very expensive. SAP has announced entire ERP (OLTP & Batch processing) -- SAP S6 to be supported on HANA beginning of last year-2015. • Most Common utilization of this model is – Clinical Data processing, Data Warehouse & BI, Library card catalogs, ad-hoc query requirements requiring large amounts of small set of columns is aggregated. Title
  • 46. Column Format Storage ID 1 2 6 5 4 3 Last Doe Smith Brown Doak Smith Taylor First John Jane Dan Richard Mike John Bonus 3000 3500 3500 4000 2500 2800 Unique Values Possible duplicate contents Unique 1,2,3,4,5; Doe,Smith,Taylor,Smith,Doak,Brown; John,Jane,John,Mike,Richard,Dan; 3000,3500,2800,2500,4000,3500 1:001;2:002;3:003;4:004;5:005; Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006; John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006; 3000:001;3500:002,006;2800:003;2500:004;4000:005; ROWID 001 002 006 004 003 005
  • 47. RDBMS Vs Columnar stores Title • 001,1,Doe,John,3000; • 002,2,Smith,Jane,3500; • 003,3,Taylor,John,2800; • 004,4,Smith,Mike,2500; • 005,5,Doak,Richard,4000; • 006,6,Brown,Dan,3500 1:001;2:002;3:003;4:004;5:005; Doe:001;Smith:002,004;Taylor:003;Doak:005;Brown:006; John:001,003;Jane:002;,Mike:004;Richard:005;Dan:006; 3000:001;3500:002,006;2800:003;2500:004;4000:005; RDBMS or ROW format storage Columnar format storage ID 1 2 6 5 4 3 Last Doe Smith Brown Doak Smith Taylor First John Jane Dan Richard Mike John Bonus 3000 3500 3500 4000 2500 2800 ROWID 001 002 006 004 003 005
  • 48. Pros and Cons of Columnar Database • Pros – – This is very much useful and efficient when an aggregate needs to be computed over many rows but only for smaller subset of data. – This is efficient when new values of a column are supplied for all rows at once. – High compression helps in reduced storage requirements and reduced Disk Reads • Cons – – If many columns of a single row or multiple rows have to queried or fetched then this may be less efficient – but still it outperforms RDBMS. – If entire row has to be updated or replaced then it will take some time to perform the operation. Title
  • 49. Key-Value Database Store • This is a method for storing, retrieving and managing Arrays of data where Metadata is defined for each value in the array. • This store consists of collection of Objects or Records of similar type but has different fields. • Each record may differ from others. • It is different than RDBMS – where each record has pre-defined model of key- values. • Document based and Graphical based models are derived from this model. • This follows more closely with modern concepts like Object Oriented Programming (OOP). • Most popular databases in this format are – – REDIS, Oracle NoSQL DB, Berkley DB, DynamoDB Title
  • 50. Key-Value Database Store -- Storage • An XML format (or JSON format) as follows represent the data storage in Key- Value store <contact> <firstname>Bhaskar</firstname> <lastname>Gunda</lastname> <street1>605 Seward Ave. NW</street1> <city>Grand Rapids</city> <state>MI</state> <zip>49504</zip> <country>USA</country> </contact> – This record is of type – Contact/Address. – Each field has metadata (key) defining the value. Title
  • 51. Document Database Store • This is another popular method of storing the data. In fact adoption of NoSQL has increased because of this model. • This is designed for storing, retrieving and managing document-oriented data – semi- structured data. • This model is a subset of Key-value store but differs from it by not having the keys pre- defined. • Metadata is generated for each document separately. • The data stored in a free-from. • This differs from RDBMS where a fixed record structure is created for acquiring and storing the data. • Programmers create intelligence in parsing the data. • Each document is a record of its own and every record may differ from others. Each record is of same type but not necessarily have same number of fields. Title
  • 52. Document Data Store – Contd. • Each document is retrieved using a Unique key – usually a URI. • Database retains index on the Keys to speed up the retrieval process. • This makes this database to be popular in Web applications. • A free form of data store, automatic suggestions of data are the primary applications of this data store. • For retrieval purpose admin adds hints to the databae to look for certain type of information. • Any document data containing metadata – such as JSON, XML can be used to store the data in this store. • Most popular databases are – – Couchbase Server – CouchDB – MongoDB – Elasticsearch Title
  • 53. Document Data Store – Storage Title Bhaskar Gunda OST, 605 Seward Ave NW, Grand Rapids, MI 49504 Bhaskar Gunda 605 Seward Ave, Grand Rapids, MI 49504 Bhaskar Gunda OST, PO.Box. 456 605 Seward Ave NW, Grand Rapids, MI 49504 • Each of the above boxes represent One Document • All three boxes are of same type – Address type document • But they differ in the content and number of fields. • Each of these documents are stored with Unique values and the metadata is generated for each document. • Programmer writes hints such as “find all my <contact>s with a <zip code>” This document does not contain Company Name than the first document This document contains additional PO Box field than the first document
  • 54. Document Database Store – Applications • This type of data store is more popular in Web applications. • Largely used for semi-structured data. • Implementations offer a variety of ways of organizing documents, including notions of: – Collections – Tags – Non-visible Metadata – Directory hierarchies – Buckets Title
  • 55. GRAPHICAL DATABASE STORE • This model utilizes a Graph compute model consisting of Nodes & Relationships. – Each Node is an Entity – a person, place, thing or an activity – Each Relationship is how Two Nodes are connected to each other. • Graph Database Model is a DBMS system with storing, retrieving and manipulating data working in a Graph data model. • Relationships take first priority in this model – applications doesn’t have to infer data connections using foreign keys. This is the difference between RDBMS and this model. • This is simpler and more Expressive than other models. • This model is more useful in Social networks traversing relationships. • Graphical databases can be OLTP databases and are fully ACID complaint. • Some Graphical Databases implement Key-Value store internally for building the relationships (pointers) between records. • Most popular databases are– Neo4j, Giraph Title
  • 56. Graphical Database Store - Storage Title
  • 57. Multi-model Database Store • Each of the databases (columnar, key-value, document, graphical) are organized in a single database model that determines how data is stores, retrieved and manipulated. • If an Organization has need for two different applications which are optimized by one data model for each, then they have to have two different Models implemented for each type of application (called Polyglot Persistence)– which defeats the purpose of using NoSQL Database. • This is resolved by combining two different models. • This offers a great advantage of polyglot persistence. • This model is also ACID compliant. • One of the first and mostly used database is – OrientDB (supporting Graph, document, key-value & object Models). • Other popular database is – Couchbase server. Title
  • 58. Selecting a NoSQL Database Title • Selecting which model of database is suitable largely depends upon the intended Business use of the data. • Key Factors to be considered are - • Model of the database store as required by Business need. • Scalability • ACID Compliance required • Sharding Capability • Ability to utilize In-Memory transactions or Not • Data Ingestion, extraction and Visualization support • Support for Hadoop Eco system • Cost to support
  • 59. NoSQL Database Challenges Title • NoSQL databases are mostly used for ad-hoc queries, predictive analytics and recently increasing the use in DW and BI applications. It is not intended for OLTP or support mainstream applications such as ERPs. • Security is one of the concerns in these models. However, Vendor provided NoSQL database are implementing to certain extent some rigid Security models. • Selecting a right model to suit the business need requires an in-depth analysis and understanding of each of the models – this requires a highly skilled resource (usually outside resource) to identify the right type. • Risk in selection can be mitigated by conducting a POC upon short listing the models selected. Usually cloud can be used for this purpose.
  • 60. Agenda • About Me • Introduction to DBMS – History and Evolution • RDBMS concepts • Overview of Big Data • Boundaries of RDBMS—Need for DBMS beyond RDBMS • Paradigm Shift in DBMS • NoSQL Databases – Definition, Advantages and breaking boundaries • Types of NoSQL Databases and their Usage • Future of RDBMS
  • 61. Future of RDBMS • With all this discussion we may feel that RDBMS is going to die. • Is it real that RDBMS is going to die? – Not in Reality. RDBMS enforces certain requirements such as ACID compliance, General Model, matured state of data storage which are all required for the mainstream applications. – Many applications – ERPs and Transactional systems are designed for RDBMS. – For all OLTP – RDBMS becomes a choice of database. • In reality RDBMS and NoSQL databases will co-exist for many years to come in any organization. But some NoSQL databases are also closing the gap between RDBMS and NoSQL and making NoSQL database to be RDBMS as well. • It will be very expensive preposition for any organization to replace RDBMS for their business operations. • However, it becomes easier, cheaper and most beneficial if they can replace RDBMS with NoSQL Databases for applications like Data Warehouse, BI or any new Analytics platform.
  • 62. References • To make this presentation more concise and precise, some of the information is taken from other presentations. I could not find the references to authors of those presentations. However, I would like to thank them for making the material available for reference. Title
  • 63. My Info Bhaskar Gunda Principal Consultant, OST Phone: 616-574-3504 Email: bgunda@ostusa.com Title