SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Proprietary & Confidential
Proliferation of New Database Technologies and
Implications for Data Science Workflows
November 2017
Manny Bernabe | James Lamb
Section 1
Intro to Uptake
3Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio
Uptake at a glance
AVIATION CONSTRUCTION ENERGY MANUFACTURING
4MM+
Predictions/week
2014
founded in Chicago
75%
across Data Science
& Engineering
800+ Employees
Uptake has developed partnerships in:
MINING OIL & GAS RAIL RETAIL
Ranked #5 on CNBC’s 2017 Disruptor
50 list – May 2017
Uptake’s Industry Thought Leaders featured in:
Recognized as World Economic
Forum 2017 Technology Pioneer –
June 2017
4Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio
Rail Uptime: Predictive events & conditions – actual screenshot
Real time alerts are too late. In this case we are predicting 2 weeks into the future.
5Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio
Our strength lies in data science.
1 2 3 4 5
Cutting edge tech Top tier talent Fast deployment Industry knowledge Applied experience
Built from scratch
for quality
Over 60 data
scientists
Core platform
built to scale out
Our data scientists
train in your field
We work in many
industries
Failure
Prediction
Event/Alert
Filtering
Anomaly
Detection
Image Analytics Suggestion
Our core machine learning engines can be deployed in any industry.
Label
Correction
Section 2
Emergence of NoSQL Databases
7Copyright © 2017 Uptake
To be clear: Relational DBs are awesome and they’re here to stay
8Copyright © 2017 Uptake
Relational databases are popular because they’re intuitive to reason
about, easy to query, and come with some nice guarantees
● Normalized data model
○ Entities, relationships that
look like the real world
● Declarative code
○ “I want this”
● Query Planning
○ “I know how to get this for
you”
● Strong correctness guarantees
○ ACID principles (see next
slide)
9Copyright © 2017 Uptake
What if a node writes data to disk and then dies
before it tells you it’s done?
Are you willing to wait for every node in your cluster
to respond to a write?
Are you willing to forgo some forms of
parallelization?
If you lose a block of data, are you ok with your
application being down until it’s all restored?
When your data are big and/or coming in fast, the guarantees made
by relational DBs can be very difficult to maintain
Atomicity → transactions cannot “partially succeed”
Consistency → transactions cannot produce an
invalid state (all reads see the same data)
Isolation → executing transactions concurrently
results in the same state as executing them
sequentially
Durability → once a transaction happens, the only
way to reverse its effect is with another transaction
10Copyright © 2017 Uptake
NoSQL DBs exist to give your business the flexibility to make
tradeoffs between accuracy, speed, and reliability
Once you distribute your data, you have to pick one of these strategies:
Consistent & Available
“I’d rather my app be down than wrong”
Examples:
● mobile payments
● ticketing
Tech: Oracle, Postgres, MySQL
Consistent & Partition-Tolerant
“whatever data is up needs to be right”
Examples:
● sports apps
● Slack
Tech: MongoDB, Memcache
Available & Partition-Tolerant
“all data is available even if nodes fail”
Examples:
● social media
● news aggregators
Tech: Cassandra, CouchDB
11Copyright © 2017 Uptake
Relational DBs are (rightfully) still king, but NoSQL alternatives
have been on the rise in recent years
Image credit: db-engines
12Copyright © 2017 Uptake
NoSQL (“not only SQL”) DBs come in many shapes and sizes
Document Stores Key-Value Stores Column Stores
Section 3
NoSQL Case Study: Elasticsearch
14Copyright © 2017 Uptake
To make this concrete, we’ll cover a document database called
Elasticsearch
15Copyright © 2017 Uptake
Elasticsearch is a document-based, non-relational, schema-optional,
distributed, highly-available data store
● Document-based → Single “record” is a JSON object which follows some schema (called a
“mapping”) but is extensible and whose content varies within an index
● Non-relational → Documents are stored in indices and keyed by unique IDs, but explicit
definition of relationships between fields is not required
● Schema-optional → You can enforce schema-on-write restrictions on incoming data but don’t
have to
● Distributed → data in ES are distributed across multiple shards stored on multiple physical
nodes (at least in production ES clusters)
● Available → Query load is distributed across the cluster without the need for a master node. No
single point of failure
Let’s go through each of these points...
16Copyright © 2017 Uptake
Document stores are databases that store unstructured or
semi-structured text
Each “record” in Elasticsearch is a JSON document.
Information on
how the cluster
responded. In this
case, 4 shards
participated in
responded to the
request.
This tells you how
many documents
matched your
query.
The “hits.hits” portion of the
response contains an array
of documents. Each
document in this array is
equivalent to one “record”
(think 1 row in a relational
DB)
The fields starting with “_”
are default ES fields, not
data we indexed into the
cluster
17Copyright © 2017 Uptake
Schemas are optional but strongly encouraged in Elasticsearch
Elasticsearch is “schema-optional” because you can enforce type restrictions on certain fields, but
the databases will not reject documents that have additional fields not present in your mapping
Example mapping for a field
called firstContactDate
store: true = tells Elasticsearch
to store the raw values of this
field, not just references in an
index
fields: {} = additional alternative
fields to create from raw values
passed to this one. In this case, a
field called
firstContactDate.search will exist
that users can query with the
“dateOptionalTime” format
This block tells ES to
index a timestamp with
every new document
passed to this index. Can
be user-generated or
auto-generated by ES
This applies to the customer
index. For now, just think of:
index in ES = table in
RDBMS
18Copyright © 2017 Uptake
Non-relational = No Joins!
Elasticsearch has no support for query-time joins.
Data that need to be used together by applications must be stored together. This is called
“denormalization”.
Image credit:
Contactually
19Copyright © 2017 Uptake
Elasticsearch presents as a single logical data store, but it stores data
distributed across multiple physical machines
This is not specific to ES. Lots of distributed databases do this. Commit this image to memory:
Image credit: LIIP
20Copyright © 2017 Uptake
A cool trick called “consistent hashing” allows ES to tolerate node
failures, stay available, distribute load evenly, and scale up and down
smoothly (if done correctly)
Each document has a unique id that gets hashed to a physical location in the cluster. Because you
only need the id to identify where a document lives, and all nodes know the hashing scheme, there is
no need for a “master” or “namenode” and any node can respond to any request
Image credit: Parse.ly
Section 4
Data Science Workflows with NoSQL
Databases
22Copyright © 2017 Uptake
NoSQL involves “denormalizing” your data. This makes these
databases very efficient for serving certain queries, but inefficient
for arbitrary questions
Execute Query
(DB handles joins)
Train Model
Execute several
queries
(join results) (Make a rectangle) Train Model
RDBMS
Workflow
NoSQL
Workflow
Section 5
Introducing: uptasticsearch
24Copyright © 2017 Uptake
We wrote an R package called “uptasticsearch” to reduce friction
between data scientists and data in Elasticsearch. We wanted data
scientists to say “give me data” and get it
25Copyright © 2017 Uptake
uptasticsearch ropensci/elastic:
uptasticsearch’s API is intentionally less expressive than the
Elasticsearch HTTP API. We wanted to narrow the focus to make it
easy to use for people who are not sys admins or engineers
26Copyright © 2017 Uptake
We open-sourced uptasticsearch to give back to the R community
and to hopefully get bright developers like you to help us make it
better!
How you can get involved:
● Submit a PR addressing one of the
open issues
(https://github.com/UptakeOpenSo
urce/uptasticsearch/issues)
● Download from CRAN and report
any issues you encounter!
● Open issues on GitHub with
feature requests and proposals
James Lamb Manny Bernabe
james.lamb@uptake.com manny.bernabe@uptake.com
Appendix: Notes on Eventual Consistency
29Copyright © 2017 Uptake
Eventual Consistency
Some
databases (like
Cassandra)
implement
“tunable”
consistency
Consistency strategies involve setting two parameters dictating how
your cluster responds to actions:
R = “min number of nodes that have to ack a successful read”
W = “min number of nodes that have to ack a successful write”
To determine appropriate values for these, you need to also know
how big your cluster is:
N = “total number of available nodes in your cluster”
30Copyright © 2017 Uptake
Eventual Consistency
“Go fast”:
R + W < N
- This strategy will give you a fast response because less nodes
are involved in the decision to acknowledge a new action
- However, it is possible to get some incorrect
responses...writes good go to one group of nodes and reads
could hit a totally separate set of nodes (none of which have
the correct value)
- Example with R = 1, W = 1, N = 3:
box1 box2
box3
R
W
31Copyright © 2017 Uptake
Eventual Consistency
“Majority Rules”:
R + W > N
- This strategy is faster than total consistency but can still give
good guarantees about correctness
- With this strategy, you are guaranteed to have at least one
node that has the most recent write and acknowledges the
new read
- Example with R = 2, W = 2, N = 3:
box1
box3
box2
W
R
R
W
32Copyright © 2017 Uptake
Eventual Consistency
“Total Certainty”:
R + W = 2N
- This strategy is equivalent to consistency in an RDMBS
- Every node has to participate in every read / write
- Response latency will be controlled by the slowest node
box1 box2
box3
W
W R
RWR
33Copyright © 2017 Uptake
Eventual Consistency
Try this demo
to get a
hands-on look
at different
consistency
strategies
Demo + awesome resource to learn more:
http://pbs.cs.berkeley.edu/#demo

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Future of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native worldFuture of Data Platform in Cloud Native world
Future of Data Platform in Cloud Native world
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture OpportunityThe Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
The Virtualization of Clouds - The New Enterprise Data Architecture Opportunity
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Random Decision Forests at Scale
Random Decision Forests at ScaleRandom Decision Forests at Scale
Random Decision Forests at Scale
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 

Ähnlich wie The Proliferation of New Database Technologies and Implications for Data Science Workflows

NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
Adi Challa
 
access.2021.3077680.pdf
access.2021.3077680.pdfaccess.2021.3077680.pdf
access.2021.3077680.pdf
neju3
 

Ähnlich wie The Proliferation of New Database Technologies and Implications for Data Science Workflows (20)

NOSQL
NOSQLNOSQL
NOSQL
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
 
No sql database
No sql databaseNo sql database
No sql database
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
NoSQL Basics and MongDB
NoSQL Basics and  MongDBNoSQL Basics and  MongDB
NoSQL Basics and MongDB
 
Introduction to MySQL Document Store
Introduction to MySQL Document StoreIntroduction to MySQL Document Store
Introduction to MySQL Document Store
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربي
 
SQL vs NoSQL deep dive
SQL vs NoSQL deep diveSQL vs NoSQL deep dive
SQL vs NoSQL deep dive
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
Modern databases and its challenges (SQL ,NoSQL, NewSQL)
Modern databases and its challenges (SQL ,NoSQL, NewSQL)Modern databases and its challenges (SQL ,NoSQL, NewSQL)
Modern databases and its challenges (SQL ,NoSQL, NewSQL)
 
Managing Large Amounts of Data with Salesforce
Managing Large Amounts of Data with SalesforceManaging Large Amounts of Data with Salesforce
Managing Large Amounts of Data with Salesforce
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Build Application With MongoDB
Build Application With MongoDBBuild Application With MongoDB
Build Application With MongoDB
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sql
 
Your data layer - Choosing the right database solutions for the future
Your data layer - Choosing the right database solutions for the futureYour data layer - Choosing the right database solutions for the future
Your data layer - Choosing the right database solutions for the future
 
access.2021.3077680.pdf
access.2021.3077680.pdfaccess.2021.3077680.pdf
access.2021.3077680.pdf
 

Mehr von Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
Domino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
 

Mehr von Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 
Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
 
How to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical FeaturesHow to Effectively Combine Numerical Features and Categorical Features
How to Effectively Combine Numerical Features and Categorical Features
 
Building Up Local Models of Customers
Building Up Local Models of CustomersBuilding Up Local Models of Customers
Building Up Local Models of Customers
 

Kürzlich hochgeladen

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

The Proliferation of New Database Technologies and Implications for Data Science Workflows

  • 1. Proprietary & Confidential Proliferation of New Database Technologies and Implications for Data Science Workflows November 2017 Manny Bernabe | James Lamb
  • 3. 3Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio Uptake at a glance AVIATION CONSTRUCTION ENERGY MANUFACTURING 4MM+ Predictions/week 2014 founded in Chicago 75% across Data Science & Engineering 800+ Employees Uptake has developed partnerships in: MINING OIL & GAS RAIL RETAIL Ranked #5 on CNBC’s 2017 Disruptor 50 list – May 2017 Uptake’s Industry Thought Leaders featured in: Recognized as World Economic Forum 2017 Technology Pioneer – June 2017
  • 4. 4Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio Rail Uptime: Predictive events & conditions – actual screenshot Real time alerts are too late. In this case we are predicting 2 weeks into the future.
  • 5. 5Copyright © 2017 Uptake – CONFIDENTIAL13-Nov-17Collaboration Portfolio Our strength lies in data science. 1 2 3 4 5 Cutting edge tech Top tier talent Fast deployment Industry knowledge Applied experience Built from scratch for quality Over 60 data scientists Core platform built to scale out Our data scientists train in your field We work in many industries Failure Prediction Event/Alert Filtering Anomaly Detection Image Analytics Suggestion Our core machine learning engines can be deployed in any industry. Label Correction
  • 6. Section 2 Emergence of NoSQL Databases
  • 7. 7Copyright © 2017 Uptake To be clear: Relational DBs are awesome and they’re here to stay
  • 8. 8Copyright © 2017 Uptake Relational databases are popular because they’re intuitive to reason about, easy to query, and come with some nice guarantees ● Normalized data model ○ Entities, relationships that look like the real world ● Declarative code ○ “I want this” ● Query Planning ○ “I know how to get this for you” ● Strong correctness guarantees ○ ACID principles (see next slide)
  • 9. 9Copyright © 2017 Uptake What if a node writes data to disk and then dies before it tells you it’s done? Are you willing to wait for every node in your cluster to respond to a write? Are you willing to forgo some forms of parallelization? If you lose a block of data, are you ok with your application being down until it’s all restored? When your data are big and/or coming in fast, the guarantees made by relational DBs can be very difficult to maintain Atomicity → transactions cannot “partially succeed” Consistency → transactions cannot produce an invalid state (all reads see the same data) Isolation → executing transactions concurrently results in the same state as executing them sequentially Durability → once a transaction happens, the only way to reverse its effect is with another transaction
  • 10. 10Copyright © 2017 Uptake NoSQL DBs exist to give your business the flexibility to make tradeoffs between accuracy, speed, and reliability Once you distribute your data, you have to pick one of these strategies: Consistent & Available “I’d rather my app be down than wrong” Examples: ● mobile payments ● ticketing Tech: Oracle, Postgres, MySQL Consistent & Partition-Tolerant “whatever data is up needs to be right” Examples: ● sports apps ● Slack Tech: MongoDB, Memcache Available & Partition-Tolerant “all data is available even if nodes fail” Examples: ● social media ● news aggregators Tech: Cassandra, CouchDB
  • 11. 11Copyright © 2017 Uptake Relational DBs are (rightfully) still king, but NoSQL alternatives have been on the rise in recent years Image credit: db-engines
  • 12. 12Copyright © 2017 Uptake NoSQL (“not only SQL”) DBs come in many shapes and sizes Document Stores Key-Value Stores Column Stores
  • 13. Section 3 NoSQL Case Study: Elasticsearch
  • 14. 14Copyright © 2017 Uptake To make this concrete, we’ll cover a document database called Elasticsearch
  • 15. 15Copyright © 2017 Uptake Elasticsearch is a document-based, non-relational, schema-optional, distributed, highly-available data store ● Document-based → Single “record” is a JSON object which follows some schema (called a “mapping”) but is extensible and whose content varies within an index ● Non-relational → Documents are stored in indices and keyed by unique IDs, but explicit definition of relationships between fields is not required ● Schema-optional → You can enforce schema-on-write restrictions on incoming data but don’t have to ● Distributed → data in ES are distributed across multiple shards stored on multiple physical nodes (at least in production ES clusters) ● Available → Query load is distributed across the cluster without the need for a master node. No single point of failure Let’s go through each of these points...
  • 16. 16Copyright © 2017 Uptake Document stores are databases that store unstructured or semi-structured text Each “record” in Elasticsearch is a JSON document. Information on how the cluster responded. In this case, 4 shards participated in responded to the request. This tells you how many documents matched your query. The “hits.hits” portion of the response contains an array of documents. Each document in this array is equivalent to one “record” (think 1 row in a relational DB) The fields starting with “_” are default ES fields, not data we indexed into the cluster
  • 17. 17Copyright © 2017 Uptake Schemas are optional but strongly encouraged in Elasticsearch Elasticsearch is “schema-optional” because you can enforce type restrictions on certain fields, but the databases will not reject documents that have additional fields not present in your mapping Example mapping for a field called firstContactDate store: true = tells Elasticsearch to store the raw values of this field, not just references in an index fields: {} = additional alternative fields to create from raw values passed to this one. In this case, a field called firstContactDate.search will exist that users can query with the “dateOptionalTime” format This block tells ES to index a timestamp with every new document passed to this index. Can be user-generated or auto-generated by ES This applies to the customer index. For now, just think of: index in ES = table in RDBMS
  • 18. 18Copyright © 2017 Uptake Non-relational = No Joins! Elasticsearch has no support for query-time joins. Data that need to be used together by applications must be stored together. This is called “denormalization”. Image credit: Contactually
  • 19. 19Copyright © 2017 Uptake Elasticsearch presents as a single logical data store, but it stores data distributed across multiple physical machines This is not specific to ES. Lots of distributed databases do this. Commit this image to memory: Image credit: LIIP
  • 20. 20Copyright © 2017 Uptake A cool trick called “consistent hashing” allows ES to tolerate node failures, stay available, distribute load evenly, and scale up and down smoothly (if done correctly) Each document has a unique id that gets hashed to a physical location in the cluster. Because you only need the id to identify where a document lives, and all nodes know the hashing scheme, there is no need for a “master” or “namenode” and any node can respond to any request Image credit: Parse.ly
  • 21. Section 4 Data Science Workflows with NoSQL Databases
  • 22. 22Copyright © 2017 Uptake NoSQL involves “denormalizing” your data. This makes these databases very efficient for serving certain queries, but inefficient for arbitrary questions Execute Query (DB handles joins) Train Model Execute several queries (join results) (Make a rectangle) Train Model RDBMS Workflow NoSQL Workflow
  • 24. 24Copyright © 2017 Uptake We wrote an R package called “uptasticsearch” to reduce friction between data scientists and data in Elasticsearch. We wanted data scientists to say “give me data” and get it
  • 25. 25Copyright © 2017 Uptake uptasticsearch ropensci/elastic: uptasticsearch’s API is intentionally less expressive than the Elasticsearch HTTP API. We wanted to narrow the focus to make it easy to use for people who are not sys admins or engineers
  • 26. 26Copyright © 2017 Uptake We open-sourced uptasticsearch to give back to the R community and to hopefully get bright developers like you to help us make it better! How you can get involved: ● Submit a PR addressing one of the open issues (https://github.com/UptakeOpenSo urce/uptasticsearch/issues) ● Download from CRAN and report any issues you encounter! ● Open issues on GitHub with feature requests and proposals
  • 27. James Lamb Manny Bernabe james.lamb@uptake.com manny.bernabe@uptake.com
  • 28. Appendix: Notes on Eventual Consistency
  • 29. 29Copyright © 2017 Uptake Eventual Consistency Some databases (like Cassandra) implement “tunable” consistency Consistency strategies involve setting two parameters dictating how your cluster responds to actions: R = “min number of nodes that have to ack a successful read” W = “min number of nodes that have to ack a successful write” To determine appropriate values for these, you need to also know how big your cluster is: N = “total number of available nodes in your cluster”
  • 30. 30Copyright © 2017 Uptake Eventual Consistency “Go fast”: R + W < N - This strategy will give you a fast response because less nodes are involved in the decision to acknowledge a new action - However, it is possible to get some incorrect responses...writes good go to one group of nodes and reads could hit a totally separate set of nodes (none of which have the correct value) - Example with R = 1, W = 1, N = 3: box1 box2 box3 R W
  • 31. 31Copyright © 2017 Uptake Eventual Consistency “Majority Rules”: R + W > N - This strategy is faster than total consistency but can still give good guarantees about correctness - With this strategy, you are guaranteed to have at least one node that has the most recent write and acknowledges the new read - Example with R = 2, W = 2, N = 3: box1 box3 box2 W R R W
  • 32. 32Copyright © 2017 Uptake Eventual Consistency “Total Certainty”: R + W = 2N - This strategy is equivalent to consistency in an RDMBS - Every node has to participate in every read / write - Response latency will be controlled by the slowest node box1 box2 box3 W W R RWR
  • 33. 33Copyright © 2017 Uptake Eventual Consistency Try this demo to get a hands-on look at different consistency strategies Demo + awesome resource to learn more: http://pbs.cs.berkeley.edu/#demo