NoSQL (Not Only SQL)

Dr. Pouria Amirian
June 2014
Dr. Pouria Amirian
Big Data Project Manager and Data Scientist
University of Oxford
Pouria.Amirian@ndm.ox.ac.uk; Pouria.Amirian@gmail.com
@pouriaamirian

 “By 2015, 4.4 million IT jobs globally will be created to support
Big Data.
 But there is a challenge. There is not enough talent in the
industry. Our public and private education systems are failing us.
Therefore only one-third of the IT jobs will be filled.These jobs are
the future of the new information economy.”
 Three Major areas of demand in Computer Science and IT:
 Big Data, Mobile and SocialComputing
(the foundation of theses three topics is Cloud Computing)
3

 SQL
 Advantages and Disadvantages
 NoSQL
 History
 CommonTraits
 Categories
 Examples
 Trends
4

6
Row
Column  Keys
 Single/Multi-column Key
 Operations on tables:
 select, join (SQL)
 Relationship on key
 Primary Key
 Foreign Key
Table
Key

 Proven and Available talent /Well-known
 Many programmers are already familiar with it.
 Transactions and ACID make development easy.
 Lots of tools to use.
 Scalable
 Free and Commercial production support
 SQL (general and high-level query language)
7

 Create a database for posts of a weblog
 Each post is authored by a user
 Each post can have multiple comments from other
users
 Users can vote for a post (stars 0-5)
 Users can like comments
 Posts have date, comments have date

How Can I Cast an object to an Interface in C#?
I have to work with COM-based system and the only way to
work with the system is to work with interfaces. the problem
is when I worked in VB 6.0 the compiler could automatically
cast any object to an interface. However since C# is more
type-safe it is not provided automatically. So how can I
convert an Obj to an Interface in C#?
Joe “2011-07-26”
Tags: C#, Cast, Interface
James “2011-07-26”
use the cast operator of C#
Ana, “11-07-27”
you can use the ‘as’ keyword, look at the following code:
Iinterface myInterface= myObj as Iinterface

What are the posts by “Joe”? How many Stars they got?
What are the comments written by “James”?

12
{
“_id” : ObjectId("4e2e3f92268cdda473b628f6"),
“title” : “How can I cast an Object to an Interface in C#?”,
“when” : Date(“2011-07-26”),
“author” : “joe”,
“text” : “I have to work with COM-based system and the only
way to work with the system is to work with interfaces. the
problem is ….”,
“tags” : [“C#”, “Cast”, “Interface”],
“voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5],
“comments” : [
{“by”:“James”, “text”:“use the cast operator of C#”,
“when”:”11-07-26”},
{“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”,
“when”:”11-07-27”}]
} db.posts.find({“author” : “joe”}).sort()
db.posts.find({“comments.by” : “James”})

 Rigid schema design
 Hard to scale (Very limited scalability)
 Hard and complex Joins across multiple nodes
 Hard to handle data growth (Schema change, High
Volume of Data, HighVolume ofTransactions,…)
 Need for interface for data access (another layer of complexity)
 Impedance mismatches
 Mapping between Relational storage and Object-based
computing (Object Relational Mapping doesn't work quite well)
13

 Relational Databases are no longer one-size-fits-all
 Examples
 Content Management Systems
 Network Data (Social Networking, Location-Based
Application)
 Spatial Data Management Systems
 High frequency of change (huge amount of read and
write)
14

15
 Tuples (rows)
 Key/Value Pairs
 Documents
 Columns
 Graphs
 Relational DBMS
 Key/Value Databases
 Documents Data Store
 Column-Family Stores
 Graphs Database

 Tuples (rows)
 Key/Value Pairs
 Documents
 Columns
 Graphs
 Relational DBMS
 Key/Value Databases
 Documents Data Store
 Column-Family Stores
 Graphs Database
16
SQL
NoSQL

 The needs of modern applications do not always
match what relational databases provide.
 Success stories of Big Data management of
internet giants such as Google, Amazon,
Facebook, LinkedIn, …
 The mentioned companies faced unique
challenge and they developed some sort of
custom solution
17

 The Google File System, October 2003
 MapReduce, December 2004
 BigTable, November 2006
 …
Massively Scalable Google’s Infrastructure for:
 Google Search Engine
 Google Map and Google Earth
 Gmail, …
18

 Open source developers have tried to replicate each
peace of Google’sTechnology Stack
 Project Hadoop and its sub projects was born atYahoo!
Google Infrastructure Hadoop Universe
Google File System
(GFS)
Hadoop Distributed File
System (HDFS)
MapReduce Hadoop
BigTable HBase
19

 Dynamo: Amazon’s Highly Available Key/Value
Store, 2007
 Then use cases from Ebay, Facebook, Netflix,
Yahoo, IBM and …
20

21
2004 BigTable (Google)
2007 Dynamo (Amazon)
2008 Cassandra (Facebook)
In 2009 in San Francisco NoSQL name proposed by Eric Evans to
describe the growing non-relational movement
In 1998Carlos Strozzi use the word “NoSQL” to describe a relational database
that did not expose a SQL interface

 Not based on the relational model
 Flexible Schema
 Supports distributed database architectures
 Provides high scalability, high availability, and fault
tolerance
 Supports very large amounts of sparse data
 Geared toward performance rather than consistency
22

 Memcached – Key value stores.
 Membase – Memcached with persistence and
improved consistent hashing.
 AppFabric Cache – Multi region Cache.
 Redis – Data structure server.
 Riak – Based on Amazon’s Dynamo.
 ProjectVoldemort – eventual consistent key value
stores, auto scaling.

 Schema Free.
 Usually JSON like interchange model.
 Query Model: JavaScript or custom.
 Aggregations: Map/Reduce.
 Indexes are done via B-Trees.

11
27
{
“_id” : ObjectId("4e2e3f92268cdda473b628f6"),
“title” : “How can I cast an Object to an Interface in C#?”,
“when” : Date(“2011-07-26”),
“author” : “joe”,
“text” : “I have to work with COM-based system and the only
way to work with the system is to work with interfaces. the
problem is ….”,
“tags” : [“C#”, “Cast”, “Interface”],
“voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5],
“comments” : [
{“by”:“James”, “text”:“use the cast operator of C#”,
“when”:”11-07-26”},
{“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”,
“when”:”11-07-27”}]
}

Id username email Department
1 John john@foo.com Sales
2 Mary mary@foo.com Marketing
3 Yoda yoda@foo.com IT
Id
1
2
3
Username
John
Mary
Yoda
email
john@foo.com
mary@foo.com
yoda@foo.com
Department
Sales
Marketing
IT
Row oriented (Relational)
Column oriented

 Based on GraphTheory.
 Scale vertically, no clustering.
 You can use graph algorithms easily.

 Relational Model
Social Network
 Who are Bob’s friends?
32

 Find all
friends of Alice’s friend
33

 In a sample social network containing 1,000,000 nodes
(people) each with approximately 50 edges
(relationship)
34
Depth RDBMS Graph Returned Records
2 0.016 0.01 ~2500
3 30.267 0.168 ~110,000
4 1543.505 1.359 ~600,000
5 Unfinished 2.132 ~800,000
Time in Seconds

1- Non-relational
 NoTables
 No Joins
 No ACIDTransaction *
 No support for SQL *
 *: a few NoSQL databases support ACID and SQL
36

2- Schema Free
 In a data collection:
 There can be records with completely different data
items (fields)
▪ Book 1 {name, publicationYear}
▪ Book 2 {author, publisher}
 The schema is in:
 the data itself or (JSON)
 usually in application not in the database
37

3- Horizontal Scalability
 Vertical (Scale up)
 Horizontal (Scale out)
38

4-Web Scale Applications:
 Simple requests (underlying database seems to be
unsophisticated)
 However:
 Sheer volume of data
 huge number of users (millions of user)
39

5- Open Source but from large internet companies:
 Google
 Facebook
 Twitter
 Linkedin
 Yahoo
40

42
Volume
• Huge amount of Collected and generated data by organizations or
individuals
• Need for huge amount of storage and processing power
Velocity
• Frequency at which data is generated, captured, shared and processed
• Need for real-time retrieval and process of data for large number of users
Variety
• Many formats and structures and sources
• Need for new types of storage and processing for structured and
Unstructured data

 many different types of tools, techniques,
technologies, algorithms and computation models for
collection, generation, storage, management, analysis
and visualization of high-volume (of size), high-velocity
(of change) and high-variety (in nature) data sets.
44

 Management
 Processing
46

 Also known as Brewer’sTheorem by Prof. Eric Brewer,
published in 2000 at University of Berkeley.
 “Of three properties of a shared data system: data
consistency, system availability and tolerance to
network partitions, only two can be achieved at any
given moment.”
 Proven by Nancy Lynch et al. MIT labs.
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-
keynote.pdf

 Consistency: All clients have same view of data
 Availability: Each client can always read and write
data
 Partition tolerance: the system works well despite
physical network partitions
 “CAP theorem” says A Database may only excels at
two of the CAP attributes
49

 ACID (Atomicity, Consistency, Isolation, Durability)
50
try{
Transaction.begin();
insert(data1);
update(data2);
insert(data3);
delete(data4);
Transaction.Commit();
}
catch(){
Transaction.Rollback();
}

 Atomicity: All or nothing.
 Consistency: Consistent state of data
 Isolation:Transactions are isolated from each other.
 Durability:When the transaction is committed, state
will be durable.
Any data store can achieve Atomicity, Isolation and
Durability but do you always need consistency? No.
By giving up ACID properties, one can achieve higher
performance and scalability.

 CAP in SQL databases >> CA (not distributed), CP (not
available distributed)
 ACID is guaranteed
 DBMS keeps users waiting (in order to propagate all
the changes to all nodes)
52

 CAP in NoSQL databases >> AP, CP
 DBMS will guarantee the consistency eventually but
meanwhile DBMS give control back to the application
(no waiting for users)
 The NoSQL database doesn’t commit the changes
right away (buffers)
 The data will be eventually consistent
53

 Acronym contrived to be the opposite of ACID
 Basically Available,
 Soft state,
 Eventually Consistent
54

55
 Basically Available
 possibilities of faults but not a fault of the whole system
 Soft state
 copies of a data item may be inconsistent
 Eventual Consistency
 When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
 copies becomes consistent at some later time if there are no
more updates to that data item

ACID:
• Strong consistency.
• Less availability.
• Pessimistic concurrency.
• Complex.
BASE:
• Availability is the most important thing.Willing to
sacrifice for this (CAP).
• Weaker consistency (Eventual).
• Simple and fast.
• Optimistic concurrency.

 Massive write performance
 Fast key value look ups
 No single point of failure
 Fast prototyping and development
 Out of the box scalability (Horizontally Scalable)
 Easy maintenance
59

 Simple APIs
 C# Example: db.collection.save(myDocument);
 Seamless language integration
 No impedance mismatch (look at the above C#
example)
 Designed to be horizontally scalable (elastic)
 Flexible data model and schema
 Majority free and/or Open Source
60

 There are more than 140 NoSQL Products
 Many are not proven
 Lack of SQL (the biggest missed feature)
 Proprietary Query Languages
 Lack of Skilled people
 Do you know a DBA for MarkLogic?
 Lack ofTools for modeling, documenting, reporting, …
(usually there are no good visual tools)
 Lack of Standards (It is the biggest threat)
61

63
e-Commerce application
SQL DB
Shopping
Cart Data
Orders
Session
Data
Web/Application
Server

64
SQL DB
Shopping Cart Data
Orders
Session Data

65
SQL DB
Orders
Key/Value
DB
Key/Value
DB
Shopping
Cart
Data
Session Data

66
SQL DB
Orders
Key/Value DBKey/Value DB
Shopping
Cart Data
Session
Data
Graph DB
Customer
Social
Graph

 It is not necessary for the application to use a single
data store for all of its needs, since different databases
are built for different purposes and not all problems
can be elegantly solved by a singe database.
 Using Different Data StorageTechnologies for
Varying Data Storage Needs
67

 Key-value stores:
 Processing a constant stream of small reads and writes.
 Document databases:
 Natural data modeling. Programmer friendly. Rapid
development. Web friendly, CRUD.
 RDMBS:
 OLTP. SQL.Transactions. Relations.
 Columnar:
 Handles size well. Massive write loads. High availability.
Multiple-data centers, MapReduce.
 Graph:
 Graph algorithms and relations.

NoSQL (Not Only SQL)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie NoSQL (Not Only SQL)

Ähnlich wie NoSQL (Not Only SQL) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

NoSQL (Not Only SQL)