SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
Dr. Pouria Amirian
June 2014
Dr. Pouria Amirian
Big Data Project Manager and Data Scientist
University of Oxford
Pouria.Amirian@ndm.ox.ac.uk; Pouria.Amirian@gmail.com
@pouriaamirian
2
 “By 2015, 4.4 million IT jobs globally will be created to support
Big Data.
 But there is a challenge. There is not enough talent in the
industry. Our public and private education systems are failing us.
Therefore only one-third of the IT jobs will be filled.These jobs are
the future of the new information economy.”
 Three Major areas of demand in Computer Science and IT:
 Big Data, Mobile and SocialComputing
(the foundation of theses three topics is Cloud Computing)
3
 SQL
 Advantages and Disadvantages
 NoSQL
 History
 CommonTraits
 Categories
 Examples
 Trends
4
5
Relational Databases
6
Row
Column  Keys
 Single/Multi-column Key
 Operations on tables:
 select, join (SQL)
 Relationship on key
 Primary Key
 Foreign Key
Table
Key
 Proven and Available talent /Well-known
 Many programmers are already familiar with it.
 Transactions and ACID make development easy.
 Lots of tools to use.
 Scalable
 Free and Commercial production support
 SQL (general and high-level query language)
7
 Create a database for posts of a weblog
 Each post is authored by a user
 Each post can have multiple comments from other
users
 Users can vote for a post (stars 0-5)
 Users can like comments
 Posts have date, comments have date
How Can I Cast an object to an Interface in C#?
I have to work with COM-based system and the only way to
work with the system is to work with interfaces. the problem
is when I worked in VB 6.0 the compiler could automatically
cast any object to an interface. However since C# is more
type-safe it is not provided automatically. So how can I
convert an Obj to an Interface in C#?
Joe “2011-07-26”
Tags: C#, Cast, Interface
James “2011-07-26”
use the cast operator of C#
Ana, “11-07-27”
you can use the ‘as’ keyword, look at the following code:
Iinterface myInterface= myObj as Iinterface
What are the posts by “Joe”? How many Stars they got?
What are the comments written by “James”?
12
{
“_id” : ObjectId("4e2e3f92268cdda473b628f6"),
“title” : “How can I cast an Object to an Interface in C#?”,
“when” : Date(“2011-07-26”),
“author” : “joe”,
“text” : “I have to work with COM-based system and the only
way to work with the system is to work with interfaces. the
problem is ….”,
“tags” : [“C#”, “Cast”, “Interface”],
“voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5],
“comments” : [
{“by”:“James”, “text”:“use the cast operator of C#”,
“when”:”11-07-26”},
{“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”,
“when”:”11-07-27”}]
} db.posts.find({“author” : “joe”}).sort()
db.posts.find({“comments.by” : “James”})
 Rigid schema design
 Hard to scale (Very limited scalability)
 Hard and complex Joins across multiple nodes
 Hard to handle data growth (Schema change, High
Volume of Data, HighVolume ofTransactions,…)
 Need for interface for data access (another layer of complexity)
 Impedance mismatches
 Mapping between Relational storage and Object-based
computing (Object Relational Mapping doesn't work quite well)
13
 Relational Databases are no longer one-size-fits-all
 Examples
 Content Management Systems
 Network Data (Social Networking, Location-Based
Application)
 Spatial Data Management Systems
 High frequency of change (huge amount of read and
write)
14
15
 Tuples (rows)
 Key/Value Pairs
 Documents
 Columns
 Graphs
 Relational DBMS
 Key/Value Databases
 Documents Data Store
 Column-Family Stores
 Graphs Database
 Tuples (rows)
 Key/Value Pairs
 Documents
 Columns
 Graphs
 Relational DBMS
 Key/Value Databases
 Documents Data Store
 Column-Family Stores
 Graphs Database
16
SQL
NoSQL
 The needs of modern applications do not always
match what relational databases provide.
 Success stories of Big Data management of
internet giants such as Google, Amazon,
Facebook, LinkedIn, …
 The mentioned companies faced unique
challenge and they developed some sort of
custom solution
17
 The Google File System, October 2003
 MapReduce, December 2004
 BigTable, November 2006
 …
Massively Scalable Google’s Infrastructure for:
 Google Search Engine
 Google Map and Google Earth
 Gmail, …
18
 Open source developers have tried to replicate each
peace of Google’sTechnology Stack
 Project Hadoop and its sub projects was born atYahoo!
Google Infrastructure Hadoop Universe
Google File System
(GFS)
Hadoop Distributed File
System (HDFS)
MapReduce Hadoop
BigTable HBase
19
 Dynamo: Amazon’s Highly Available Key/Value
Store, 2007
 Then use cases from Ebay, Facebook, Netflix,
Yahoo, IBM and …
20
21
2004 BigTable (Google)
2007 Dynamo (Amazon)
2008 Cassandra (Facebook)
In 2009 in San Francisco NoSQL name proposed by Eric Evans to
describe the growing non-relational movement
In 1998Carlos Strozzi use the word “NoSQL” to describe a relational database
that did not expose a SQL interface
 Not based on the relational model
 Flexible Schema
 Supports distributed database architectures
 Provides high scalability, high availability, and fault
tolerance
 Supports very large amounts of sparse data
 Geared toward performance rather than consistency
22
 Examples
11
K1
K2
K3
V1
V2
V2
24
 Memcached – Key value stores.
 Membase – Memcached with persistence and
improved consistent hashing.
 AppFabric Cache – Multi region Cache.
 Redis – Data structure server.
 Riak – Based on Amazon’s Dynamo.
 ProjectVoldemort – eventual consistent key value
stores, auto scaling.
 Schema Free.
 Usually JSON like interchange model.
 Query Model: JavaScript or custom.
 Aggregations: Map/Reduce.
 Indexes are done via B-Trees.
11
27
{
“_id” : ObjectId("4e2e3f92268cdda473b628f6"),
“title” : “How can I cast an Object to an Interface in C#?”,
“when” : Date(“2011-07-26”),
“author” : “joe”,
“text” : “I have to work with COM-based system and the only
way to work with the system is to work with interfaces. the
problem is ….”,
“tags” : [“C#”, “Cast”, “Interface”],
“voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5],
“comments” : [
{“by”:“James”, “text”:“use the cast operator of C#”,
“when”:”11-07-26”},
{“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”,
“when”:”11-07-27”}]
}
Id username email Department
1 John john@foo.com Sales
2 Mary mary@foo.com Marketing
3 Yoda yoda@foo.com IT
Id
1
2
3
Username
John
Mary
Yoda
email
john@foo.com
mary@foo.com
yoda@foo.com
Department
Sales
Marketing
IT
Row oriented (Relational)
Column oriented
29
 Based on GraphTheory.
 Scale vertically, no clustering.
 You can use graph algorithms easily.
 Relational Model
Social Network
 Who are Bob’s friends?
32
 Find all
friends of Alice’s friend
33
 In a sample social network containing 1,000,000 nodes
(people) each with approximately 50 edges
(relationship)
34
Depth RDBMS Graph Returned Records
2 0.016 0.01 ~2500
3 30.267 0.168 ~110,000
4 1543.505 1.359 ~600,000
5 Unfinished 2.132 ~800,000
Time in Seconds
35
1- Non-relational
 NoTables
 No Joins
 No ACIDTransaction *
 No support for SQL *
 *: a few NoSQL databases support ACID and SQL
36
2- Schema Free
 In a data collection:
 There can be records with completely different data
items (fields)
▪ Book 1 {name, publicationYear}
▪ Book 2 {author, publisher}
 The schema is in:
 the data itself or (JSON)
 usually in application not in the database
37
3- Horizontal Scalability
 Vertical (Scale up)
 Horizontal (Scale out)
38
4-Web Scale Applications:
 Simple requests (underlying database seems to be
unsophisticated)
 However:
 Sheer volume of data
 huge number of users (millions of user)
39
5- Open Source but from large internet companies:
 Google
 Facebook
 Twitter
 Linkedin
 Yahoo
40
41
42
Volume
• Huge amount of Collected and generated data by organizations or
individuals
• Need for huge amount of storage and processing power
Velocity
• Frequency at which data is generated, captured, shared and processed
• Need for real-time retrieval and process of data for large number of users
Variety
• Many formats and structures and sources
• Need for new types of storage and processing for structured and
Unstructured data
43
 many different types of tools, techniques,
technologies, algorithms and computation models for
collection, generation, storage, management, analysis
and visualization of high-volume (of size), high-velocity
(of change) and high-variety (in nature) data sets.
44
45
 Management
 Processing
46
47
 Also known as Brewer’sTheorem by Prof. Eric Brewer,
published in 2000 at University of Berkeley.
 “Of three properties of a shared data system: data
consistency, system availability and tolerance to
network partitions, only two can be achieved at any
given moment.”
 Proven by Nancy Lynch et al. MIT labs.
 http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-
keynote.pdf
 Consistency: All clients have same view of data
 Availability: Each client can always read and write
data
 Partition tolerance: the system works well despite
physical network partitions
 “CAP theorem” says A Database may only excels at
two of the CAP attributes
49
 ACID (Atomicity, Consistency, Isolation, Durability)
50
try{
Transaction.begin();
insert(data1);
update(data2);
insert(data3);
delete(data4);
Transaction.Commit();
}
catch(){
Transaction.Rollback();
}
 Atomicity: All or nothing.
 Consistency: Consistent state of data
 Isolation:Transactions are isolated from each other.
 Durability:When the transaction is committed, state
will be durable.
Any data store can achieve Atomicity, Isolation and
Durability but do you always need consistency? No.
By giving up ACID properties, one can achieve higher
performance and scalability.
 CAP in SQL databases >> CA (not distributed), CP (not
available distributed)
 ACID is guaranteed
 DBMS keeps users waiting (in order to propagate all
the changes to all nodes)
52
 CAP in NoSQL databases >> AP, CP
 DBMS will guarantee the consistency eventually but
meanwhile DBMS give control back to the application
(no waiting for users)
 The NoSQL database doesn’t commit the changes
right away (buffers)
 The data will be eventually consistent
53
 Acronym contrived to be the opposite of ACID
 Basically Available,
 Soft state,
 Eventually Consistent
54
55
 Basically Available
 possibilities of faults but not a fault of the whole system
 Soft state
 copies of a data item may be inconsistent
 Eventual Consistency
 When no updates occur for a long period of time, eventually all
updates will propagate through the system and all the nodes will
be consistent
 copies becomes consistent at some later time if there are no
more updates to that data item
ACID:
• Strong consistency.
• Less availability.
• Pessimistic concurrency.
• Complex.
BASE:
• Availability is the most important thing.Willing to
sacrifice for this (CAP).
• Weaker consistency (Eventual).
• Simple and fast.
• Optimistic concurrency.
57
58
 Massive write performance
 Fast key value look ups
 No single point of failure
 Fast prototyping and development
 Out of the box scalability (Horizontally Scalable)
 Easy maintenance
59
 Simple APIs
 C# Example: db.collection.save(myDocument);
 Seamless language integration
 No impedance mismatch (look at the above C#
example)
 Designed to be horizontally scalable (elastic)
 Flexible data model and schema
 Majority free and/or Open Source
60
 There are more than 140 NoSQL Products
 Many are not proven
 Lack of SQL (the biggest missed feature)
 Proprietary Query Languages
 Lack of Skilled people
 Do you know a DBA for MarkLogic?
 Lack ofTools for modeling, documenting, reporting, …
(usually there are no good visual tools)
 Lack of Standards (It is the biggest threat)
61
62
63
e-Commerce application
SQL DB
Shopping
Cart Data
Orders
Session
Data
Web/Application
Server
64
e-Commerce application
SQL DB
Shopping Cart Data
Orders
Session Data
65
e-Commerce application
SQL DB
Orders
Key/Value
DB
Key/Value
DB
Shopping
Cart
Data
Session Data
66
e-Commerce application
SQL DB
Orders
Key/Value DBKey/Value DB
Shopping
Cart Data
Session
Data
Graph DB
Customer
Social
Graph
 It is not necessary for the application to use a single
data store for all of its needs, since different databases
are built for different purposes and not all problems
can be elegantly solved by a singe database.
 Using Different Data StorageTechnologies for
Varying Data Storage Needs
67
 Key-value stores:
 Processing a constant stream of small reads and writes.
 Document databases:
 Natural data modeling. Programmer friendly. Rapid
development. Web friendly, CRUD.
 RDMBS:
 OLTP. SQL.Transactions. Relations.
 Columnar:
 Handles size well. Massive write loads. High availability.
Multiple-data centers, MapReduce.
 Graph:
 Graph algorithms and relations.
Thanks for your attention
69

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsSimon Price
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in pythonUmmeSalmaM1
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...KamleshKumar394
 
Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Joshua Bloom
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 

Was ist angesagt? (20)

Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science using Python
Data Science using PythonData Science using Python
Data Science using Python
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
 
Big Data for Ag (2019)
Big Data for Ag (2019)Big Data for Ag (2019)
Big Data for Ag (2019)
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Data science
Data scienceData science
Data science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
data science
data sciencedata science
data science
 
Data Science
Data ScienceData Science
Data Science
 
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
 
2005)
2005)2005)
2005)
 
Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)Industrial Machine Learning (SIGKDD17)
Industrial Machine Learning (SIGKDD17)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 

Ähnlich wie NoSQL (Not Only SQL)

Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest linkCS, NcState
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Future platform for internet of things
Future platform for internet of thingsFuture platform for internet of things
Future platform for internet of thingsColdbeans Software
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software DatasetsTao Xie
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software AnalyticsMargaret-Anne Storey
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosPaco Nathan
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringTao Xie
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 

Ähnlich wie NoSQL (Not Only SQL) (20)

Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
DataHub
DataHubDataHub
DataHub
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Future platform for internet of things
Future platform for internet of thingsFuture platform for internet of things
Future platform for internet of things
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
Mastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and ScienceMastering Software Variability for Innovation and Science
Mastering Software Variability for Innovation and Science
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software Analytics
 
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache MesosAugury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data Web
 
Synergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software EngineeringSynergy of Human and Artificial Intelligence in Software Engineering
Synergy of Human and Artificial Intelligence in Software Engineering
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 

Kürzlich hochgeladen

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Kürzlich hochgeladen (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

NoSQL (Not Only SQL)

  • 1. Dr. Pouria Amirian June 2014 Dr. Pouria Amirian Big Data Project Manager and Data Scientist University of Oxford Pouria.Amirian@ndm.ox.ac.uk; Pouria.Amirian@gmail.com @pouriaamirian
  • 2. 2
  • 3.  “By 2015, 4.4 million IT jobs globally will be created to support Big Data.  But there is a challenge. There is not enough talent in the industry. Our public and private education systems are failing us. Therefore only one-third of the IT jobs will be filled.These jobs are the future of the new information economy.”  Three Major areas of demand in Computer Science and IT:  Big Data, Mobile and SocialComputing (the foundation of theses three topics is Cloud Computing) 3
  • 4.  SQL  Advantages and Disadvantages  NoSQL  History  CommonTraits  Categories  Examples  Trends 4
  • 6. 6 Row Column  Keys  Single/Multi-column Key  Operations on tables:  select, join (SQL)  Relationship on key  Primary Key  Foreign Key Table Key
  • 7.  Proven and Available talent /Well-known  Many programmers are already familiar with it.  Transactions and ACID make development easy.  Lots of tools to use.  Scalable  Free and Commercial production support  SQL (general and high-level query language) 7
  • 8.  Create a database for posts of a weblog  Each post is authored by a user  Each post can have multiple comments from other users  Users can vote for a post (stars 0-5)  Users can like comments  Posts have date, comments have date
  • 9. How Can I Cast an object to an Interface in C#? I have to work with COM-based system and the only way to work with the system is to work with interfaces. the problem is when I worked in VB 6.0 the compiler could automatically cast any object to an interface. However since C# is more type-safe it is not provided automatically. So how can I convert an Obj to an Interface in C#? Joe “2011-07-26” Tags: C#, Cast, Interface James “2011-07-26” use the cast operator of C# Ana, “11-07-27” you can use the ‘as’ keyword, look at the following code: Iinterface myInterface= myObj as Iinterface
  • 10.
  • 11. What are the posts by “Joe”? How many Stars they got? What are the comments written by “James”?
  • 12. 12 { “_id” : ObjectId("4e2e3f92268cdda473b628f6"), “title” : “How can I cast an Object to an Interface in C#?”, “when” : Date(“2011-07-26”), “author” : “joe”, “text” : “I have to work with COM-based system and the only way to work with the system is to work with interfaces. the problem is ….”, “tags” : [“C#”, “Cast”, “Interface”], “voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5], “comments” : [ {“by”:“James”, “text”:“use the cast operator of C#”, “when”:”11-07-26”}, {“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”, “when”:”11-07-27”}] } db.posts.find({“author” : “joe”}).sort() db.posts.find({“comments.by” : “James”})
  • 13.  Rigid schema design  Hard to scale (Very limited scalability)  Hard and complex Joins across multiple nodes  Hard to handle data growth (Schema change, High Volume of Data, HighVolume ofTransactions,…)  Need for interface for data access (another layer of complexity)  Impedance mismatches  Mapping between Relational storage and Object-based computing (Object Relational Mapping doesn't work quite well) 13
  • 14.  Relational Databases are no longer one-size-fits-all  Examples  Content Management Systems  Network Data (Social Networking, Location-Based Application)  Spatial Data Management Systems  High frequency of change (huge amount of read and write) 14
  • 15. 15  Tuples (rows)  Key/Value Pairs  Documents  Columns  Graphs  Relational DBMS  Key/Value Databases  Documents Data Store  Column-Family Stores  Graphs Database
  • 16.  Tuples (rows)  Key/Value Pairs  Documents  Columns  Graphs  Relational DBMS  Key/Value Databases  Documents Data Store  Column-Family Stores  Graphs Database 16 SQL NoSQL
  • 17.  The needs of modern applications do not always match what relational databases provide.  Success stories of Big Data management of internet giants such as Google, Amazon, Facebook, LinkedIn, …  The mentioned companies faced unique challenge and they developed some sort of custom solution 17
  • 18.  The Google File System, October 2003  MapReduce, December 2004  BigTable, November 2006  … Massively Scalable Google’s Infrastructure for:  Google Search Engine  Google Map and Google Earth  Gmail, … 18
  • 19.  Open source developers have tried to replicate each peace of Google’sTechnology Stack  Project Hadoop and its sub projects was born atYahoo! Google Infrastructure Hadoop Universe Google File System (GFS) Hadoop Distributed File System (HDFS) MapReduce Hadoop BigTable HBase 19
  • 20.  Dynamo: Amazon’s Highly Available Key/Value Store, 2007  Then use cases from Ebay, Facebook, Netflix, Yahoo, IBM and … 20
  • 21. 21 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) In 2009 in San Francisco NoSQL name proposed by Eric Evans to describe the growing non-relational movement In 1998Carlos Strozzi use the word “NoSQL” to describe a relational database that did not expose a SQL interface
  • 22.  Not based on the relational model  Flexible Schema  Supports distributed database architectures  Provides high scalability, high availability, and fault tolerance  Supports very large amounts of sparse data  Geared toward performance rather than consistency 22
  • 25.  Memcached – Key value stores.  Membase – Memcached with persistence and improved consistent hashing.  AppFabric Cache – Multi region Cache.  Redis – Data structure server.  Riak – Based on Amazon’s Dynamo.  ProjectVoldemort – eventual consistent key value stores, auto scaling.
  • 26.  Schema Free.  Usually JSON like interchange model.  Query Model: JavaScript or custom.  Aggregations: Map/Reduce.  Indexes are done via B-Trees.
  • 27. 11 27 { “_id” : ObjectId("4e2e3f92268cdda473b628f6"), “title” : “How can I cast an Object to an Interface in C#?”, “when” : Date(“2011-07-26”), “author” : “joe”, “text” : “I have to work with COM-based system and the only way to work with the system is to work with interfaces. the problem is ….”, “tags” : [“C#”, “Cast”, “Interface”], “voters” : [“James”, “11-07-26”, 4],[“John”, “11-07-26”,5], “comments” : [ {“by”:“James”, “text”:“use the cast operator of C#”, “when”:”11-07-26”}, {“by”:“Ana”, “text”:“you can use the ‘as’ keyword …”, “when”:”11-07-27”}] }
  • 28. Id username email Department 1 John john@foo.com Sales 2 Mary mary@foo.com Marketing 3 Yoda yoda@foo.com IT Id 1 2 3 Username John Mary Yoda email john@foo.com mary@foo.com yoda@foo.com Department Sales Marketing IT Row oriented (Relational) Column oriented
  • 29. 29
  • 30.  Based on GraphTheory.  Scale vertically, no clustering.  You can use graph algorithms easily.
  • 31.
  • 32.  Relational Model Social Network  Who are Bob’s friends? 32
  • 33.  Find all friends of Alice’s friend 33
  • 34.  In a sample social network containing 1,000,000 nodes (people) each with approximately 50 edges (relationship) 34 Depth RDBMS Graph Returned Records 2 0.016 0.01 ~2500 3 30.267 0.168 ~110,000 4 1543.505 1.359 ~600,000 5 Unfinished 2.132 ~800,000 Time in Seconds
  • 35. 35
  • 36. 1- Non-relational  NoTables  No Joins  No ACIDTransaction *  No support for SQL *  *: a few NoSQL databases support ACID and SQL 36
  • 37. 2- Schema Free  In a data collection:  There can be records with completely different data items (fields) ▪ Book 1 {name, publicationYear} ▪ Book 2 {author, publisher}  The schema is in:  the data itself or (JSON)  usually in application not in the database 37
  • 38. 3- Horizontal Scalability  Vertical (Scale up)  Horizontal (Scale out) 38
  • 39. 4-Web Scale Applications:  Simple requests (underlying database seems to be unsophisticated)  However:  Sheer volume of data  huge number of users (millions of user) 39
  • 40. 5- Open Source but from large internet companies:  Google  Facebook  Twitter  Linkedin  Yahoo 40
  • 41. 41
  • 42. 42 Volume • Huge amount of Collected and generated data by organizations or individuals • Need for huge amount of storage and processing power Velocity • Frequency at which data is generated, captured, shared and processed • Need for real-time retrieval and process of data for large number of users Variety • Many formats and structures and sources • Need for new types of storage and processing for structured and Unstructured data
  • 43. 43
  • 44.  many different types of tools, techniques, technologies, algorithms and computation models for collection, generation, storage, management, analysis and visualization of high-volume (of size), high-velocity (of change) and high-variety (in nature) data sets. 44
  • 45. 45
  • 47. 47
  • 48.  Also known as Brewer’sTheorem by Prof. Eric Brewer, published in 2000 at University of Berkeley.  “Of three properties of a shared data system: data consistency, system availability and tolerance to network partitions, only two can be achieved at any given moment.”  Proven by Nancy Lynch et al. MIT labs.  http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC- keynote.pdf
  • 49.  Consistency: All clients have same view of data  Availability: Each client can always read and write data  Partition tolerance: the system works well despite physical network partitions  “CAP theorem” says A Database may only excels at two of the CAP attributes 49
  • 50.  ACID (Atomicity, Consistency, Isolation, Durability) 50 try{ Transaction.begin(); insert(data1); update(data2); insert(data3); delete(data4); Transaction.Commit(); } catch(){ Transaction.Rollback(); }
  • 51.  Atomicity: All or nothing.  Consistency: Consistent state of data  Isolation:Transactions are isolated from each other.  Durability:When the transaction is committed, state will be durable. Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency? No. By giving up ACID properties, one can achieve higher performance and scalability.
  • 52.  CAP in SQL databases >> CA (not distributed), CP (not available distributed)  ACID is guaranteed  DBMS keeps users waiting (in order to propagate all the changes to all nodes) 52
  • 53.  CAP in NoSQL databases >> AP, CP  DBMS will guarantee the consistency eventually but meanwhile DBMS give control back to the application (no waiting for users)  The NoSQL database doesn’t commit the changes right away (buffers)  The data will be eventually consistent 53
  • 54.  Acronym contrived to be the opposite of ACID  Basically Available,  Soft state,  Eventually Consistent 54
  • 55. 55  Basically Available  possibilities of faults but not a fault of the whole system  Soft state  copies of a data item may be inconsistent  Eventual Consistency  When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent  copies becomes consistent at some later time if there are no more updates to that data item
  • 56. ACID: • Strong consistency. • Less availability. • Pessimistic concurrency. • Complex. BASE: • Availability is the most important thing.Willing to sacrifice for this (CAP). • Weaker consistency (Eventual). • Simple and fast. • Optimistic concurrency.
  • 57. 57
  • 58. 58
  • 59.  Massive write performance  Fast key value look ups  No single point of failure  Fast prototyping and development  Out of the box scalability (Horizontally Scalable)  Easy maintenance 59
  • 60.  Simple APIs  C# Example: db.collection.save(myDocument);  Seamless language integration  No impedance mismatch (look at the above C# example)  Designed to be horizontally scalable (elastic)  Flexible data model and schema  Majority free and/or Open Source 60
  • 61.  There are more than 140 NoSQL Products  Many are not proven  Lack of SQL (the biggest missed feature)  Proprietary Query Languages  Lack of Skilled people  Do you know a DBA for MarkLogic?  Lack ofTools for modeling, documenting, reporting, … (usually there are no good visual tools)  Lack of Standards (It is the biggest threat) 61
  • 62. 62
  • 63. 63 e-Commerce application SQL DB Shopping Cart Data Orders Session Data Web/Application Server
  • 64. 64 e-Commerce application SQL DB Shopping Cart Data Orders Session Data
  • 66. 66 e-Commerce application SQL DB Orders Key/Value DBKey/Value DB Shopping Cart Data Session Data Graph DB Customer Social Graph
  • 67.  It is not necessary for the application to use a single data store for all of its needs, since different databases are built for different purposes and not all problems can be elegantly solved by a singe database.  Using Different Data StorageTechnologies for Varying Data Storage Needs 67
  • 68.  Key-value stores:  Processing a constant stream of small reads and writes.  Document databases:  Natural data modeling. Programmer friendly. Rapid development. Web friendly, CRUD.  RDMBS:  OLTP. SQL.Transactions. Relations.  Columnar:  Handles size well. Massive write loads. High availability. Multiple-data centers, MapReduce.  Graph:  Graph algorithms and relations.
  • 69. Thanks for your attention 69