SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Physical Design for
Non-relational Data Systems
Michael Mior • University of Waterloo
Proper design and configuration of
data systems is critical for achieving
good performance
2
3
Many tools exist for relational
database design optimization
Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm
https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html
Microsoft AutoAdmin (1998)
DB2 Design Advisor (2004)
Oracle SQL Tuning (2004)
We want applications to be up 24/7
We're frequently dealing with changing
data or with unstructured data
We require sub-second responses to queries
4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media
Relational databases are not
always sufficient for these uses
“Over 30 years, we've learned how to
write business intelligence
applications on top of relational
databases -- there are patterns. With
NoSQL today, we have no cookie
cutters. We don't have any blueprints.”
--Ravi Krishnappa, NetApp solutions architect
5 Source: TechTarget, 2015
• NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
• NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
Model column families around query patterns
But start your design with entities and relationships, if you can
De-normalize and duplicate for read performance
But don’t de-normalize if you don’t need to
Leverage wide rows for ordering, grouping, and filtering
But don’t go too wide
Schema Design Best Practices
Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
But
But
But
?
?
?
8
NoSQL Application Development
Requirements ImplementationData Model
App LogicDB Access
NoSE[MSAL, ICDE ‘16] [MSAL, TKDE ‘17]9
Database
Design
Example
Comment
com_id
com_date
text
User
user_id
nickname
Post
user_id
post_date
title
10
Database
Design
Example
11
SELECT post_id, post_title
FROM users u JOIN comments
c ON u.user_id = c.user_id
JOIN posts p
ON p.post_id = c.post_id
ORDER BY p.post_date
Query
Find information on all posts a user has
commented on in order by post date
Database
Design
Example
user_id
↓
nickname
comment_id
post_id
↓
title
post_date
comment_id
↓
post_id
post_date
nickname
nickname
↓
title
post_date
Execution A
Execution B
12
NoSE
Workload
Query Plans
Data Model
1. Candidate Enumeration
13
2. Query Planning
3. Design Optimization
4. Plan Recommendation
Database Design
14
3
4 5
Database Design Optimization
NoSE considers all
possible query plans
and picks the one
with minimum
expected cost
Evaluation
15
Overall workload performance
improves by 5x
• NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
Physical
Logical
17
{user_id: 1, post_date: "2017-04-05",
com_id: 3, …}
{user_id: 2, post_date: "2017-04-05",
com_id: 7, …}
{post_id: 6, com_date: "2017-04-03",
com_id: 3, user_id: 1, …}
{post_id: 6, com_date: "2017-04-01",
com_id: 7, user_id: 2, …}
?
Existing NoSQL designs are a black box
?!?
JSON!
Removes redundancy implied by both
functional and inclusion dependencies
Recovering
Logical
Schemas
Extract the structure of existing data
Discover dependencies
Produce a logical model of the database
18
user_comments
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, …}
comments_by_date
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
{░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░",
░░░░░░: ░, ░░░░░░░: ░, …}
We want to go from raw data
to a logical model
Comment
User
Post
19 [MS, ER ‘18] (to appear)
20
user_comments
user_id post_date com_id post_id title
1 2017-04-05 3 6 Stargate
2 2017-04-05 7 6 Stargate
Data on the same logical entity
appears multiple times
user_comments
user_id com_id post_id
1 3 6
2 7 6
posts
post_date post_id title
2017-04-05 6 Stargate
21
Post data can be
(logically) extracted
to normalize
22
user_comments_user
user_id
user_comments_post
post_id
post_date,
title
comments_by_date_post
post_id
comments_by_date_com
com_id
com_date, text
comments_by_date_user
user_id, nickname22
2323
posts
post_id
post_date,
title
comments
com_id
com_date, text
users
user_id, nickname
This is the original logical model!
Comment
User
Post
• NoSQL Database Design Optimization
• Understanding Existing NoSQL Designs
• Optimizing Big Data Applications
Apache
Spark
Model
▸ Series of lazy transformations which
are followed by actions that force
evaluation of all transformations
▸ Each step produces a resilient
distributed dataset (RDD)
▸ Intermediate results can be cached on
memory or disk, optionally serialized
25
Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
Caching is very useful for applications that re-use an RDD multiple times.
Caching all of the generated RDDs is not a good strategy…
…deciding which ones to cache may be challenging.
Spark Caching Best Practices
Source: https://unraveldata.com/to-cache-or-not-to-cache/26
PageRank Example
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
}
rankGraph.vertices.values.sum()
27
Transformations
var rankGraph = graph
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph
prevRankGraph = rankGraph
rankGraph = rankGraph
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
}
rankGraph.vertices.values.sum()
.outerJoinVertices(...).map(...)
.aggregateMessages(...)
.outerJoinVertices(rankUpdates)
28
Actions
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
prevRankGraph.unpersist()
}
rankGraph.edges.foreachPartition(...)
rankGraph.vertices.values.sum()
29
30
PageRank RDDs
Some RDDs are used more than once
Spark
Model
Caching
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
while (iteration < numIter) {
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
rankGraph.edges.foreachPartition(...)
}
rankGraph.vertices.values.sum()
rankGraph.persist()
.persist()
prevRankGraph.unpersist()
31
ReSpark
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
whileLoop (sc, iteration < numIter {
rankGraph.persist()
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
.persist()
rankGraph.edges.foreachPartition(...)
prevRankGraph.unpersist()
})
rankGraph.vertices.values.sum()
32
ReSpark
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
whileLoop (sc, iteration < numIter {
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
})
rankGraph.vertices.values.sum()
33
rankGraph: 0
ReSpark
var rankGraph = graph.outerJoinVertices(...).map(...)
var iteration = 0
whileLoop (sc, iteration < numIter {
val rankUpdates = rankGraph.aggregateMessages(...)
prevRankGraph = rankGraph
rankGraph = rankGraph.outerJoinVertices(rankUpdates)
})
rankGraph.vertices.values.sum()
34
rankGraph: 2 Persist!
PageRank
on
ReSpark
35
Without any caching,
many jobs take hours!
36
Questions?
xkcd.com

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersEdureka!
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of RWinston Chen
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics Farheen Nilofer
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationGreg Goltsov
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...IJSRD
 
JDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big DataJDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big DataPROIDEA
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project ReportTushar Dalvi
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 

Was ist angesagt? (20)

Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
Real-time analytics with HBase
Real-time analytics with HBaseReal-time analytics with HBase
Real-time analytics with HBase
 
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
Topic modeling using big data analytics
Topic modeling using big data analytics Topic modeling using big data analytics
Topic modeling using big data analytics
 
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data ExplorationSeeing at the Speed of Thought: Empowering Others Through Data Exploration
Seeing at the Speed of Thought: Empowering Others Through Data Exploration
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
Big Data Mining, Techniques, Handling Technologies and Some Related Issues: A...
 
JDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big DataJDD 2016 - Michal Matloka - Small Intro To Big Data
JDD 2016 - Michal Matloka - Small Intro To Big Data
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 

Ähnlich wie Physical Design for Non-Relational Data Systems

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesScyllaDB
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
Microsoft sql-server-2016 Tutorial & Overview
Microsoft sql-server-2016 Tutorial & OverviewMicrosoft sql-server-2016 Tutorial & Overview
Microsoft sql-server-2016 Tutorial & OverviewQA TrainingHub
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...Mydbops
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
MongoDB Tips and Tricks
MongoDB Tips and TricksMongoDB Tips and Tricks
MongoDB Tips and TricksM Malai
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performanceMydbops
 
Financial, Retail And Shopping Domains
Financial, Retail And Shopping DomainsFinancial, Retail And Shopping Domains
Financial, Retail And Shopping DomainsSonia Sanchez
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Mark Tabladillo
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 

Ähnlich wie Physical Design for Non-Relational Data Systems (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & PrinciplesNoSQL Data Modeling Foundations — Introducing Concepts & Principles
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Microsoft sql-server-2016 Tutorial & Overview
Microsoft sql-server-2016 Tutorial & OverviewMicrosoft sql-server-2016 Tutorial & Overview
Microsoft sql-server-2016 Tutorial & Overview
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...
Use Performance Insights To Enhance MongoDB Performance - (Manosh Malai - Myd...
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
MongoDB Tips and Tricks
MongoDB Tips and TricksMongoDB Tips and Tricks
MongoDB Tips and Tricks
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performance
 
Financial, Retail And Shopping Domains
Financial, Retail And Shopping DomainsFinancial, Retail And Shopping Domains
Financial, Retail And Shopping Domains
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
Secrets of Enterprise Data Mining: SQL Saturday Oregon 201411
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 

Mehr von Michael Mior

A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaMichael Mior
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Locomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database codeLocomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database codeMichael Mior
 
Automated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL DatabasesAutomated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL DatabasesMichael Mior
 
NoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL ApplicationsNoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL ApplicationsMichael Mior
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...Michael Mior
 

Mehr von Michael Mior (6)

A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Locomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database codeLocomotor: transparent migration of client-side database code
Locomotor: transparent migration of client-side database code
 
Automated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL DatabasesAutomated Schema Design for NoSQL Databases
Automated Schema Design for NoSQL Databases
 
NoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL ApplicationsNoSE: Schema Design for NoSQL Applications
NoSE: Schema Design for NoSQL Applications
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
 

Kürzlich hochgeladen

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Physical Design for Non-Relational Data Systems

  • 1. Physical Design for Non-relational Data Systems Michael Mior • University of Waterloo
  • 2. Proper design and configuration of data systems is critical for achieving good performance 2
  • 3. 3 Many tools exist for relational database design optimization Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html Microsoft AutoAdmin (1998) DB2 Design Advisor (2004) Oracle SQL Tuning (2004)
  • 4. We want applications to be up 24/7 We're frequently dealing with changing data or with unstructured data We require sub-second responses to queries 4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media Relational databases are not always sufficient for these uses
  • 5. “Over 30 years, we've learned how to write business intelligence applications on top of relational databases -- there are patterns. With NoSQL today, we have no cookie cutters. We don't have any blueprints.” --Ravi Krishnappa, NetApp solutions architect 5 Source: TechTarget, 2015
  • 6. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  • 7. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  • 8. Model column families around query patterns But start your design with entities and relationships, if you can De-normalize and duplicate for read performance But don’t de-normalize if you don’t need to Leverage wide rows for ordering, grouping, and filtering But don’t go too wide Schema Design Best Practices Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/ But But But ? ? ? 8
  • 9. NoSQL Application Development Requirements ImplementationData Model App LogicDB Access NoSE[MSAL, ICDE ‘16] [MSAL, TKDE ‘17]9
  • 11. Database Design Example 11 SELECT post_id, post_title FROM users u JOIN comments c ON u.user_id = c.user_id JOIN posts p ON p.post_id = c.post_id ORDER BY p.post_date Query Find information on all posts a user has commented on in order by post date
  • 13. NoSE Workload Query Plans Data Model 1. Candidate Enumeration 13 2. Query Planning 3. Design Optimization 4. Plan Recommendation Database Design
  • 14. 14 3 4 5 Database Design Optimization NoSE considers all possible query plans and picks the one with minimum expected cost
  • 16. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  • 17. Physical Logical 17 {user_id: 1, post_date: "2017-04-05", com_id: 3, …} {user_id: 2, post_date: "2017-04-05", com_id: 7, …} {post_id: 6, com_date: "2017-04-03", com_id: 3, user_id: 1, …} {post_id: 6, com_date: "2017-04-01", com_id: 7, user_id: 2, …} ? Existing NoSQL designs are a black box ?!? JSON!
  • 18. Removes redundancy implied by both functional and inclusion dependencies Recovering Logical Schemas Extract the structure of existing data Discover dependencies Produce a logical model of the database 18
  • 19. user_comments {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} comments_by_date {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} We want to go from raw data to a logical model Comment User Post 19 [MS, ER ‘18] (to appear)
  • 20. 20 user_comments user_id post_date com_id post_id title 1 2017-04-05 3 6 Stargate 2 2017-04-05 7 6 Stargate Data on the same logical entity appears multiple times
  • 21. user_comments user_id com_id post_id 1 3 6 2 7 6 posts post_date post_id title 2017-04-05 6 Stargate 21 Post data can be (logically) extracted to normalize
  • 24. • NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications
  • 25. Apache Spark Model ▸ Series of lazy transformations which are followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 25
  • 26. Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/26
  • 27. PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 27
  • 28. Transformations var rankGraph = graph var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 28
  • 29. Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 29
  • 30. 30 PageRank RDDs Some RDDs are used more than once
  • 31. Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) rankGraph.edges.foreachPartition(...) } rankGraph.vertices.values.sum() rankGraph.persist() .persist() prevRankGraph.unpersist() 31
  • 32. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() }) rankGraph.vertices.values.sum() 32
  • 33. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 33 rankGraph: 0
  • 34. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 34 rankGraph: 2 Persist!
  • 36. 36