SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Aditya Parameswaran
Assistant Professor
University of Illinois
http://data-people.cs.illinois.edu
ThreeTools for
“Human-in-the-loop”
Data Science
Many many contributors!
• PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol
Deshpande (Spanning Illinois,UMD, MIT,Chicago)
• PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim,
ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke
• MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui
• Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean
Zou, Jialin Liu, Changfeng Liu, XiaofoYu
2
Scale is a Solved Problem
Most work in the database community is myopically focused on
scale: the ability to pose SQL queries on larger and larger datasets.
My claim:
Scale is a solved problem.
Findings:
– Median job size at Microsoft andYahoo is 16GB;
– >90% of the jobs within Facebook are <100GB
The bottleneck is no longer our ability to pose SQL queries on
large datasets!
Of course, exceptions exist: the “1%” of data analysis needs
3
What about the Needs of the 99%?
The bottleneck is actually the “humans-in-the-loop”
As our data size has grown, what has stayed constant is
• the time for analysis,
• the human cognitive load,
• the skills to extract value from data
There is a severe need for tools that can help analysts extract
value from even moderately sized datasets
From “Big data and and itsTechnical Challenges”, CACM 2014
For big data to fully reach its potential, we need to consider scale not just for the system but
also from the perspective of humans.We have to make sure that the end points—humans—
can properly “absorb” the results of the analysis and not get lost in a sea of data.
4
Need of the hour: Human-In-the-Loop
Data AnalyticsTools
HILDA tools:
• treat both humans and data
as first-class citizens
• reduce human labor
• minimize complexity
Interaction Data Mining
Databases
Taking the human
perspective into
account
Go beyond SQL
Scalability/Interacti
vity is still
important
Magic happens here
5
A Maslow’s Hierarchy for HILDA
Background: Maslow developed a theory for what motivates
individuals in 1943; highly influential
Complex Needs
Basic Needs
6
A Maslow’s Hierarchy for HILDA
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
7
Touch and Feel:
DataSpread is a spreadsheet-database hybrid:
Goal: Marrying the flexibility and ease of use of
spreadsheets with the scalability and power of databases
Enables the “99%” with large datasets but limited prog.
skills to open, touch, and examine their datasets
http://dataspread.github.io
[VLDB’15,VLDB’15,ICDE’16]
8
Play andView:
Zenvisage is effortless visual exploration tool.
Goal: “fast-forward” to visual patterns, trends, without
having analyst step through each one individually
Enables individuals to play with, and extract insights
from large datasets at a fraction of the time.
http://zenvisage.github.io
[TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14]
9
Collaborate and Share:
OrpheusDB is a tool for managing dataset versions with a database
Goal: building a versioned database system to reduce the burden of
recording datasets in various stages of analysis
Enables individuals to collaborate on data analysis, and share, keep
track of, and retrieve dataset versions.
http://orpheus-db.github.io
[VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15]
(also part of : a collab. analysis system w/ MIT & UMD)
datahub
10
This talk
About 10 minutes per system:
overview + architecture + one key technical challenge
Common theme: if you torture databases enough, you can get them to do
what you want!
Share &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
11
12
Motivation
Most of the people doing ad-hoc data
manipulation and analysis use spreadsheets,
e.g., Excel
Why?
• Easy to use: direct manipulation
• Built-in visualization capabilities
• Flexible: no need for a schema
13
But Spreadsheets areTerrible!
– Slow
• single change  wait minutes on a 10,000 x 10 spreadsheet
• can’t even open a spreadsheet with >1M cells
• speed by itself can prevent analysis
– Tedious + not Powerful
• filters via copy-paste
• only FK joins viaVLOOKUPs; others impossible
• even simple operations are cumbersome
– Brittle
• sharing excel sheets around, no collab/recovery
• using spreadsheets for collaboration is painful and error-prone
14
Let’s turn to Databases
Databases are:
• Slow Scalable
• Tedious + not Powerful Powerful and expressive (SQL)
• Brittle Collaboration, recovery, succinct
So why not use databases?
Well, for the same reason why spreadsheets are so useful:
• Easy to use Not easy to use
• Built-in visualization No built-in visualization
• Flexible Not flexible
15
Combining the benefits of
spreadsheets and databases
Spreadsheet as a frontend interface
Databases as a backend engine
Result: retain the benefits of both!
But it’s not that simple…
16
Different Ideologies
Databases and spreadsheets have different
ideologies that need to be reconciled…
Due to this, the integration is not trivial…
Feature Databases Spreadsheets
Data Model Schema-first Dynamic/No Schema
Addressing Tuples with PK Cells, using Row/Col
Presentation Set-oriented, no such
notion
Notion of current window,
order
Modifications Must correspond to
queries
Can be done at any
granularity
Computation Query at a time Value at a time
17
First Problem: Representation
Q: how do we represent spreadsheet data?
Dense spreadsheets: represent as tables
(Row #, Col1 val, Col2 val, …)
Sparse spreadsheets: represent as triples
(Row #, Column #,Value)
18
First Problem: Representation
Q: how do we represent spreadsheet data?
Can we do even better than the two
extremes?Yes!
Carve out
dense areas  store as tables,
sparse areas  store as triples
19
First Problem: Representation
However, even if we only use “tables”, carving out
the ideal # partitions (min. storage, modif., access)
is NP-Hard
Reduction from min. edge-length partition of
rectilinear polygons
Thankfully, we have a way out…
20
Solution: Constrain the Problem
A new class of partitionings: recursive decomp.
A very natural class of partitionings! 21
Solution: Constrain the Problem
The optimal recursive
decomp. partitioning can be
found in PTIME using DP
 Still quadratic in # rows,
columns 
Merge rows/columns with
identical signatures
~ the time for a single scan
22
Initial Progress and Architecture
Postgres backend
ZK spreadsheet
• open-source web
frontend
Comfortably scales to
arbitrarily many rows
+ handle SQL queries
Hopefully bring
spreadsheets to the big
data age!
Underlying Data
Interface-Embedded
Queries
Interface-Aware
Indexes
Interface Query Processor
Interface Storage Manager
Spreadsheet
SQL
Spreadsheet
Formulae
New Interface
Algebra
…
Vanilla
SQL
Interface Transaction Manager
Other Applications Sally Bob Sue
23
1224560
StandardVisual Data Analysis Recipe:
1. Load dataset into viz tool
2. Select viz to be generated
3. See if it matches desired visual
pattern or insight
4. Repeat until you find a match
25
Tedious andTime-consuming!
26
Key Issue:
Visualizations can be generated by
• varying subsets of data, and
• varying attributes being visualized
Too many visualizations to look at to find
desired visual patterns!
27
Motivation
This is a real problem!
• Advertisers atTurn
– find keywords with similar CTRs to a specific one
• Bioinformaticians at an NIH genomics center
– find aspects on which two sets of genes differ
• Battery scientists at CMU
– find solvents with desired properties
Common theme: finding the “right” visualization can take
several hours of combing through visualizations manually.
28
Key Insight
We can automate that!
• instead of combing through visualizations manually
• tell us what you want, and we can “fast-forward” to desired insights
Desiderata for automation:
• Expressive – the ability to specify what you want
• Interactive – interact with the results, catering to non-programmers
• Scalable – get interesting results quickly
Enter Zenvisage:
(zen + envisage: to effortlessly visualize)
29
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and sketch based interactions
• to find specific patterns
• Sophisticated visual exploration language, ZQL
• to ask more elaborate questions
• Scalable visualization generation engine
• preprocess, batch and parallel eval. for interactive results
• Rapid pattern matching algorithms
• sampling-based techniques
30
Attribute Selection
Sketching Canvas
Matches TypicalTrends and Outliers
ZQL:Advanced Exploration Interface
Screenshots
31
Screenshots
32
Challenges: One Specific Instance
Find visualizations on which two groups of data differ most.
Examples:
• find visualizations where solvent x differs from solvent y
• find visualizations where product x differs from product y
We represent a visualization using [d, m, f]
• dimension = x axis
• measure = y axis
• function = aggregate applied to y
Each [d,m,f] on a specific subset of data can be computed using a
single SQL query.
33
Challenge: One Specific Instance
Find visualizations on which two groups of data differ most.
Naïve approach:
For each [d, m, f]:
Compute visualization for both products (two SQL queries),
then compare
Pick k best (“highest utility”) [d, m, f]
Utility Metric:We ignore how to compare for now, but there are
many standard distance metrics
Scale: 10s of dimensions, 10s of measures, handful of
aggregates  100s of queries for a single user task!
34
Issues w/ Naïve Approach
• Repeated processing of same
data in sequence across queries
• Computation wasted on low-
utility visualizations
Sharing
Pruning
35
Sharing Optimizations
1. Minimize # of queries: Group queries together
• Combine multiple aggregates:
(d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1)
• Combine multiple group-bys:
(d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1)
2. Minimize sequential execution: Parallel query evaluation
A bit tricky!
36
Pruning Optimizations
• Keep running estimates of utility
• Prune visualizations based on estimates:
Two flavors
– Vanilla Confidence Interval based Pruning
– Multi-armed Bandit Pruning
Discard low-utility views early to avoid wasted computation
37
Visualizations
Queries (100s)
Sharing
Pruning
Optimizer
DBMS
Middleware
Layer
Viz
interface
38
Up to 300X speedup: <1s for SM, 4s for L
Experimental Findings
39
EffortlessVisual Exploration
of Large Datasets with
Ingredients
• Drag-and-drop and
sketch based
interactions
• Sophisticated visual
exploration language,
ZQL
• Scalable visualization
generation engine
• Rapid pattern matching
algorithms
40
41
Motivation
Collaborative data science is
ubiquitous
• Many users, many versions of the
same dataset stored at many
stages of analysis
• Status quo:
– Stored in a file system, relationships
unknown
Challenge: can we build a versioned
data store?
– Support efficient access, retrieval,
querying, and modification of
versions
42
Motivation: Starting Points
• VCS: Git/svn is inefficient and unsuitable
– Ordered semantics
– No data manipulation API
– No efficient multi-version queries
– Poor support for massive files
• DBMS: Relational databases don’t support
versioning, but are efficient and scalable
43
OrpheusDB: Current Focus
PostgreSQL +Versioning Commands
44
Challenge: StoringVersions
Compactly/RetrievingVersions Quickly
1000s of versions, spanning millions of records.
Store all versions independently
Huge storage, version access time is very small
Store one version, all others via chains of “deltas”
Very small storage, version access time is high
45
And Answer Queries…
• Retrieve the first version that contains this tuple
• Find versions where the average(salary) is
greater than 1000
• Find all pairs of versions where over 100 new
tuples were added
• Show the history of the tuple with record id 34.
For more examples, see [TAPP’15]
46
Framework
“Versioning” Layer
(translation/bookkeeping)
User Interface Layer
47
UnmodifiedPostgres Backend
(not aware of versions)
Parser &
Translator
Layout
Optimizer
DBMS
git commands, or
SQL (versions as rel)
Summary:
Make Data Analytics Great Again!
orpheus-db.github.ioShare &
Collaborate
Play &
View
Touch &
Feel
Increasingsophisticationofanalysis
zenvisage.github.io
dataspread.github.io
My website: http://data-people.cs.illinois.edu
Twitter: @adityagp 48

Weitere ähnliche Inhalte

Was ist angesagt?

Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big DataDataWorks Summit
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 

Was ist angesagt? (20)

Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 

Ähnlich wie Three Tools for "Human-in-the-loop" Data Science

Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information qualityPeter O'Kelly
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessInfiniteGraph
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsSri Ambati
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning台灣資料科學年會
 
Tour of Big Data
Tour of Big DataTour of Big Data
Tour of Big DataRaymond Yu
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Bill Chambers
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 

Ähnlich wie Three Tools for "Human-in-the-loop" Data Science (20)

Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality201407 MIT CDO IQ conceptual data modeling, big data, and information quality
201407 MIT CDO IQ conceptual data modeling, big data, and information quality
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
NoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-lessNoSQL Simplified: Schema vs. Schema-less
NoSQL Simplified: Schema vs. Schema-less
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Tour of Big Data
Tour of Big DataTour of Big Data
Tour of Big Data
 
Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)Patterns for Successful Data Science Projects (Spark AI Summit)
Patterns for Successful Data Science Projects (Spark AI Summit)
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Three Tools for "Human-in-the-loop" Data Science

  • 1. Aditya Parameswaran Assistant Professor University of Illinois http://data-people.cs.illinois.edu ThreeTools for “Human-in-the-loop” Data Science
  • 2. Many many contributors! • PIs: Kevin Chang, Karrie Karahalios,Aaron Elmore, Sam Madden, Amol Deshpande (Spanning Illinois,UMD, MIT,Chicago) • PhD Students: Mangesh Bendre, Himel Dev, John Lee, Albert Kim, ManasiVartak, Liqi Xu, Silu Huang, Sajjadur Rahman, Stephen Macke • MS Students:VipulVenkataraman,Tarique Siddiqui, ChaoWang, Sili Hui • Undergrads: Paul Zhou, Ding Zhang, Kejia Jiang, Bofan Sun, Ed Xue, Sean Zou, Jialin Liu, Changfeng Liu, XiaofoYu 2
  • 3. Scale is a Solved Problem Most work in the database community is myopically focused on scale: the ability to pose SQL queries on larger and larger datasets. My claim: Scale is a solved problem. Findings: – Median job size at Microsoft andYahoo is 16GB; – >90% of the jobs within Facebook are <100GB The bottleneck is no longer our ability to pose SQL queries on large datasets! Of course, exceptions exist: the “1%” of data analysis needs 3
  • 4. What about the Needs of the 99%? The bottleneck is actually the “humans-in-the-loop” As our data size has grown, what has stayed constant is • the time for analysis, • the human cognitive load, • the skills to extract value from data There is a severe need for tools that can help analysts extract value from even moderately sized datasets From “Big data and and itsTechnical Challenges”, CACM 2014 For big data to fully reach its potential, we need to consider scale not just for the system but also from the perspective of humans.We have to make sure that the end points—humans— can properly “absorb” the results of the analysis and not get lost in a sea of data. 4
  • 5. Need of the hour: Human-In-the-Loop Data AnalyticsTools HILDA tools: • treat both humans and data as first-class citizens • reduce human labor • minimize complexity Interaction Data Mining Databases Taking the human perspective into account Go beyond SQL Scalability/Interacti vity is still important Magic happens here 5
  • 6. A Maslow’s Hierarchy for HILDA Background: Maslow developed a theory for what motivates individuals in 1943; highly influential Complex Needs Basic Needs 6
  • 7. A Maslow’s Hierarchy for HILDA Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 7
  • 8. Touch and Feel: DataSpread is a spreadsheet-database hybrid: Goal: Marrying the flexibility and ease of use of spreadsheets with the scalability and power of databases Enables the “99%” with large datasets but limited prog. skills to open, touch, and examine their datasets http://dataspread.github.io [VLDB’15,VLDB’15,ICDE’16] 8
  • 9. Play andView: Zenvisage is effortless visual exploration tool. Goal: “fast-forward” to visual patterns, trends, without having analyst step through each one individually Enables individuals to play with, and extract insights from large datasets at a fraction of the time. http://zenvisage.github.io [TR’16,VLDB’16,VLDB’15,DSIA’15,VLDB’14,VLDB’14] 9
  • 10. Collaborate and Share: OrpheusDB is a tool for managing dataset versions with a database Goal: building a versioned database system to reduce the burden of recording datasets in various stages of analysis Enables individuals to collaborate on data analysis, and share, keep track of, and retrieve dataset versions. http://orpheus-db.github.io [VLDB’16,VLDB’15,VLDB’15,TAPP’15,CIDR’15] (also part of : a collab. analysis system w/ MIT & UMD) datahub 10
  • 11. This talk About 10 minutes per system: overview + architecture + one key technical challenge Common theme: if you torture databases enough, you can get them to do what you want! Share & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis 11
  • 12. 12
  • 13. Motivation Most of the people doing ad-hoc data manipulation and analysis use spreadsheets, e.g., Excel Why? • Easy to use: direct manipulation • Built-in visualization capabilities • Flexible: no need for a schema 13
  • 14. But Spreadsheets areTerrible! – Slow • single change  wait minutes on a 10,000 x 10 spreadsheet • can’t even open a spreadsheet with >1M cells • speed by itself can prevent analysis – Tedious + not Powerful • filters via copy-paste • only FK joins viaVLOOKUPs; others impossible • even simple operations are cumbersome – Brittle • sharing excel sheets around, no collab/recovery • using spreadsheets for collaboration is painful and error-prone 14
  • 15. Let’s turn to Databases Databases are: • Slow Scalable • Tedious + not Powerful Powerful and expressive (SQL) • Brittle Collaboration, recovery, succinct So why not use databases? Well, for the same reason why spreadsheets are so useful: • Easy to use Not easy to use • Built-in visualization No built-in visualization • Flexible Not flexible 15
  • 16. Combining the benefits of spreadsheets and databases Spreadsheet as a frontend interface Databases as a backend engine Result: retain the benefits of both! But it’s not that simple… 16
  • 17. Different Ideologies Databases and spreadsheets have different ideologies that need to be reconciled… Due to this, the integration is not trivial… Feature Databases Spreadsheets Data Model Schema-first Dynamic/No Schema Addressing Tuples with PK Cells, using Row/Col Presentation Set-oriented, no such notion Notion of current window, order Modifications Must correspond to queries Can be done at any granularity Computation Query at a time Value at a time 17
  • 18. First Problem: Representation Q: how do we represent spreadsheet data? Dense spreadsheets: represent as tables (Row #, Col1 val, Col2 val, …) Sparse spreadsheets: represent as triples (Row #, Column #,Value) 18
  • 19. First Problem: Representation Q: how do we represent spreadsheet data? Can we do even better than the two extremes?Yes! Carve out dense areas  store as tables, sparse areas  store as triples 19
  • 20. First Problem: Representation However, even if we only use “tables”, carving out the ideal # partitions (min. storage, modif., access) is NP-Hard Reduction from min. edge-length partition of rectilinear polygons Thankfully, we have a way out… 20
  • 21. Solution: Constrain the Problem A new class of partitionings: recursive decomp. A very natural class of partitionings! 21
  • 22. Solution: Constrain the Problem The optimal recursive decomp. partitioning can be found in PTIME using DP  Still quadratic in # rows, columns  Merge rows/columns with identical signatures ~ the time for a single scan 22
  • 23. Initial Progress and Architecture Postgres backend ZK spreadsheet • open-source web frontend Comfortably scales to arbitrarily many rows + handle SQL queries Hopefully bring spreadsheets to the big data age! Underlying Data Interface-Embedded Queries Interface-Aware Indexes Interface Query Processor Interface Storage Manager Spreadsheet SQL Spreadsheet Formulae New Interface Algebra … Vanilla SQL Interface Transaction Manager Other Applications Sally Bob Sue 23 1224560
  • 24.
  • 25. StandardVisual Data Analysis Recipe: 1. Load dataset into viz tool 2. Select viz to be generated 3. See if it matches desired visual pattern or insight 4. Repeat until you find a match 25
  • 27. Key Issue: Visualizations can be generated by • varying subsets of data, and • varying attributes being visualized Too many visualizations to look at to find desired visual patterns! 27
  • 28. Motivation This is a real problem! • Advertisers atTurn – find keywords with similar CTRs to a specific one • Bioinformaticians at an NIH genomics center – find aspects on which two sets of genes differ • Battery scientists at CMU – find solvents with desired properties Common theme: finding the “right” visualization can take several hours of combing through visualizations manually. 28
  • 29. Key Insight We can automate that! • instead of combing through visualizations manually • tell us what you want, and we can “fast-forward” to desired insights Desiderata for automation: • Expressive – the ability to specify what you want • Interactive – interact with the results, catering to non-programmers • Scalable – get interesting results quickly Enter Zenvisage: (zen + envisage: to effortlessly visualize) 29
  • 30. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • to find specific patterns • Sophisticated visual exploration language, ZQL • to ask more elaborate questions • Scalable visualization generation engine • preprocess, batch and parallel eval. for interactive results • Rapid pattern matching algorithms • sampling-based techniques 30
  • 31. Attribute Selection Sketching Canvas Matches TypicalTrends and Outliers ZQL:Advanced Exploration Interface Screenshots 31
  • 33. Challenges: One Specific Instance Find visualizations on which two groups of data differ most. Examples: • find visualizations where solvent x differs from solvent y • find visualizations where product x differs from product y We represent a visualization using [d, m, f] • dimension = x axis • measure = y axis • function = aggregate applied to y Each [d,m,f] on a specific subset of data can be computed using a single SQL query. 33
  • 34. Challenge: One Specific Instance Find visualizations on which two groups of data differ most. Naïve approach: For each [d, m, f]: Compute visualization for both products (two SQL queries), then compare Pick k best (“highest utility”) [d, m, f] Utility Metric:We ignore how to compare for now, but there are many standard distance metrics Scale: 10s of dimensions, 10s of measures, handful of aggregates  100s of queries for a single user task! 34
  • 35. Issues w/ Naïve Approach • Repeated processing of same data in sequence across queries • Computation wasted on low- utility visualizations Sharing Pruning 35
  • 36. Sharing Optimizations 1. Minimize # of queries: Group queries together • Combine multiple aggregates: (d1, m1, f1), (d1, m2, f1) —> (d1, [m1, m2], f1) • Combine multiple group-bys: (d1, m1, f1), (d2, m1, f1) —> ([d1, d2], m1, f1) 2. Minimize sequential execution: Parallel query evaluation A bit tricky! 36
  • 37. Pruning Optimizations • Keep running estimates of utility • Prune visualizations based on estimates: Two flavors – Vanilla Confidence Interval based Pruning – Multi-armed Bandit Pruning Discard low-utility views early to avoid wasted computation 37
  • 39. Up to 300X speedup: <1s for SM, 4s for L Experimental Findings 39
  • 40. EffortlessVisual Exploration of Large Datasets with Ingredients • Drag-and-drop and sketch based interactions • Sophisticated visual exploration language, ZQL • Scalable visualization generation engine • Rapid pattern matching algorithms 40
  • 41. 41
  • 42. Motivation Collaborative data science is ubiquitous • Many users, many versions of the same dataset stored at many stages of analysis • Status quo: – Stored in a file system, relationships unknown Challenge: can we build a versioned data store? – Support efficient access, retrieval, querying, and modification of versions 42
  • 43. Motivation: Starting Points • VCS: Git/svn is inefficient and unsuitable – Ordered semantics – No data manipulation API – No efficient multi-version queries – Poor support for massive files • DBMS: Relational databases don’t support versioning, but are efficient and scalable 43
  • 44. OrpheusDB: Current Focus PostgreSQL +Versioning Commands 44
  • 45. Challenge: StoringVersions Compactly/RetrievingVersions Quickly 1000s of versions, spanning millions of records. Store all versions independently Huge storage, version access time is very small Store one version, all others via chains of “deltas” Very small storage, version access time is high 45
  • 46. And Answer Queries… • Retrieve the first version that contains this tuple • Find versions where the average(salary) is greater than 1000 • Find all pairs of versions where over 100 new tuples were added • Show the history of the tuple with record id 34. For more examples, see [TAPP’15] 46
  • 47. Framework “Versioning” Layer (translation/bookkeeping) User Interface Layer 47 UnmodifiedPostgres Backend (not aware of versions) Parser & Translator Layout Optimizer DBMS git commands, or SQL (versions as rel)
  • 48. Summary: Make Data Analytics Great Again! orpheus-db.github.ioShare & Collaborate Play & View Touch & Feel Increasingsophisticationofanalysis zenvisage.github.io dataspread.github.io My website: http://data-people.cs.illinois.edu Twitter: @adityagp 48