2. This morning
⢠UW eScience Institute
â A âData Science Environmentâ
⢠SQLShare and High Variety Data
⢠Myria and âRelational Algorithmicsâ
7/10/2014 Bill Howe, UW 2
3. 3
âItâs a great time to be a data geek.â
-- Roger Barga, Microsoft Research
âThe greatest minds of my generation are trying
to figure out how to make people click on adsâ
-- Jeff Hammerbacher, co-founder, Cloudera
4. The Fourth Paradigm
1. Empirical + experimental
2. Theoretical
3. Computational
4. Data-Intensive
Jim Gray
7/10/2014 Bill Howe, UW 4
5. âAll across our campus, the process of discovery will increasingly rely
on researchersâ ability to extract knowledge from vast amounts of
data⌠In order to remain at the forefront, UW must be a leader in
advancing these techniques and technologies, and in making [them]
accessible to researchers in the broadest imaginable range of fields.â
2005-2008
In other words:
⢠Data-driven discovery will be ubiquitous
⢠UW must be a leader in inventing the
capabilities
⢠UW must be a leader in translational
activities â in putting these capabilities
to work
⢠Itâs about intellectual infrastructure (human capital) and software
infrastructure (shared tools and services â digital capital)
6. A 5-year, US$37.8 million cross-institutional
collaboration to create a data science environment
6
2014
7. 7/10/2014 Bill Howe, UW 7
Data Science Kickoff Session:
137 posters from 30+ departments and units
8. Establish a virtuous cycle
⢠6 working groups, each with
⢠3-6 faculty from each institution
9. UW Data Science Education Efforts
7/10/2014 Bill Howe, UW 9
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
MOOC Intro to Data Science
IGERT: Big Data PhD Track
New CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters (planned)
Incubator: hands-on training
10. 7/10/2014 Bill Howe, UW 10
Next Session begins June 30, 2014
https://www.coursera.org/course/datasci
11. MOOC Participation numbers
⢠âRegisteredâ: 119,517 totally irrelevant
⢠Clicked play in first 2 weeks: 78,589
⢠Turned in 1st homework: 10,663
⢠Completed all assignments: ~9000 typical attrition for a MOOC
⢠âPassedâ: 7022
⢠Forum threads: 4661
⢠Forum posts: 22,900
Fairly consistent with Coursera data across âhardâ courses
11
12. Educational transformation:
A new generation of âPi-shapedâ scientists
12
PhD ď¨ ĎhD
Educational
transformation
Magda Balazinska
13. 13
Educational
transformation
Big Data access
and management
Big Data
modeling
Big Data analytics
Collaborative
Big Data scienceData
Education and Research in Data Science
⢠Ultimate goal: A new PhD program
â Initial goal: A new certificate based on Big Data tracks in all departments
â Education highlights: data science courses, co-advising, and internships
⢠End-to-End Research Agenda
â Big Data mgmt, analytics, modeling, & collaboration
⢠Cyberinfrastructure Development
â Big Data analysis service
14. The Data Science Studio
⢠An open collaborative research space
⢠A resident data science team
â Permanent staff of ~5 data scientists â applied research and
development
â ~15-20 data science fellows (research scientists, visitors, postdocs,
students)
⢠How to Engage:
â Drop-in open workspace
â Studio âOffice Hoursâ
â Incubation Program
14
15. 15
6th floor Physics Astronomy
Building
A partnership among âŚ
⢠Provost
⢠UW Libraries
⢠Physics, Astronomy,
Arts & Sciences
⢠eScience Institute
20. How much time do you spend âhandling
dataâ as opposed to âdoing scienceâ?
Mode answer: â90%â
7/10/2014 Bill Howe, UW 20
Key question: How can we reduce this âdata overheadâ?
21. 7/10/2014 Bill Howe, UW
Simple Example
###query length COG hit #1 e-value #1 identity #1 score #1 hit length #1 description #1
chr_4[480001-580000].287 4500
chr_4[560001-660000].1 3556
chr_9[400001-500000].503 4211 COG4547 2.00E-04 19 44.6 620 Cobalamin biosynthesis protein C
chr_9[320001-420000].548 2833 COG5406 2.00E-04 38 43.9 1001 Nucleosome binding factor SPN,
chr_27[320001-404298].20 3991 COG4547 5.00E-05 18 46.2 620 Cobalamin biosynthesis protein C
chr_26[320001-420000].378 3963 COG5099 5.00E-05 17 46.2 777 RNA-binding protein of the Puf f
chr_26[400001-441226].196 2949 COG5099 2.00E-04 17 43.9 777 RNA-binding protein of the Puf f
chr_24[160001-260000].65 3542
chr_5[720001-820000].339 3141 COG5099 4.00E-09 20 59.3 777 RNA-binding protein of the Puf f
chr_9[160001-260000].243 3002 COG5077 1.00E-25 26 114 1089 Ubiquitin carboxyl-terminal hydr
chr_12[720001-820000].86 2895 COG5032 2.00E-09 30 60.5 2105 Phosphatidylinositol kinase and p
chr_12[800001-900000].109 1463 COG5032 1.00E-09 30 60.1 2105 Phosphatidylinositol kinase and p
chr_11[1-100000].70 2886
chr_11[80001-180000].100 1523
ANNOTATIONSUMMARY-COMBINEDORFANNOTATION16_Phaeo_genome
id query hit e_value identity_ score query_start query_end hit_start hit_end hit_length
1 FHJ7DRN01A0TND.1 COG0414 1.00E-08 28 51 1 74 180 257 285
2 FHJ7DRN01A1AD2.2 COG0092 3.00E-20 47 89.9 6 85 41 120 233
3 FHJ7DRN01A2HWZ.4 COG3889 0.0006 26 35.8 9 94 758 845 872
âŚ
2853 FHJ7DRN02HXTBY.5 COG5077 7.00E-09 37 52.3 3 77 313 388 1089
2854 FHJ7DRN02HZO4J.2 COG0444 2.00E-31 67 127 1 73 135 207 316
âŚ
3566 FHJ7DRN02FUJW3.1 COG5032 1.00E-09 32 54.7 1 75 1965 2038 2105
âŚ
COGAnnotation_coastal_sample.txt
SELECT * FROM Phaeo_genome p, coastal_sample c WHERE p.COG_hit = c.hit
21
22. Data Science Workflow:
7/10/2014 Bill Howe, UW 22
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
â80% of the workâ
-- Aaron Kimball
âThe other 80% of the workâ
23. â[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].â
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
28. 1) Upload data âas isâ
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues â
anyone with access can query
http://sqlshare.escience.washington.edu
29. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
31. Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methylation ratio
Link methylation
with gene description
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Join
Reorder
columns
Count Count
JoinJoin
Reorder
columns
Reorder
columns
Compute
Trim
Excel
Join Join
misstep: join
w/ wrong ďŹll
Calculate #
methylated
CGs
Calculate #
all CGs
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Calculate
methylation ratio
and link with gene
description
Popular service for
Bioinformatics Workflows
33. Two Problems with SQLShare
⢠No help for truly big datasets
⢠No help for âalgorithmicsâ
33
Limitations of SQLShare
34. 7/10/2014 Bill Howe, UW 34
Relational Algorithmics-as-a-Service
Version 2
http://myria.cs.washington.edu
35. Myria isâŚ
⢠MyriaQ: A compiler framework for multiple
iterative RA-based languages and multiple
big data back ends
⢠MyriaX: A parallel, shared-nothing,
iterative execution engine
⢠MyriaWeb: A RESTful Analytics-as-a-
Service platform and web-based interface
35
Myria is âŚ
36. Magda Balazinska, Bill Howe, and Dan Suciu
Dan Halperin (technical lead)
Victor Almeida
Andrew Whitaker
PhD Students
Shumo Chu
Eric Gribkoff
Jeremy Hyrkas
Paris Koutris
Ryan Maas
Dominik Moritz
Laurel Orr
Jennifer Ortiz
Emad Soroush
Jingjing Wang
ShengLiang Xu
Undergraduate Students
Lee Lee Choo
Vaspol Ruamviboonsuk
Myria Team
37. Myria Architecture
Coordinator
Language Parser
Myria
Compiler
Logical Optimizer for RA+While
REST Server
Worker Catalog
Catalog
âŚ
json query plan
netty
protocols
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
MyriaX (Java)
C Compiler Grappa
Web UI
MyriaQ (Python)
HDFS HDFS HDFS
Datalog SQL MyriaL
REST
SciDB
42. SeaFlow in Myria
⢠âThat 5-line MyriaL program was 100x faster than my R cluster,
and much simplerâ
Dan Halperin Sophie Clayton
43. 7/10/2014 Bill Howe, UW 43
1) BD experiments are ridiculously labor-intensive
â N systems x M real-world applications
â Big clusters and big datasets
2) No âone size fits all solutionâ
â Realistic environments will use more than one system
3) A return to distributed, federated databases
â Erase the distinction between ETL and Analytics
Why a big data middleware?
45. 7/10/2014 Bill Howe, UW 45
What can we conclude?
Hadoop was probably just pretty bad
The rest of the story not so clear
46. Relational Algebra is the Calculus of Big Data
⢠Hadoopspawn: Pig, HIVE, blah
⢠Hadoop contemporaries: Cascalog, Flume, blah
⢠Post-Hadoop: Spark/Shark, Dremel, blah
⢠etc.
7/10/2014 Bill Howe, UW 46
47. HBase
7/10/2014 Bill Howe, UW 47
BigTable
Dremel
Tenzing
2004
Pregel
Hadoop
2005
MapReduce
2006
2007
2008
2009
Spanner
Megastore
2010
2011
2012
Google Big Data Systems
non-Google open
source implementation
direct influence /
shared features
compatible
implementation of
SQL-like interface
BigQuery
48. Relational Algebra is the Calculus of Small Data
⢠Galaxy â âbioinformatics workflowsâ
⢠Pandas (Python)
merge(left, right, on=âkeyâ)
⢠dplyr (R)
filter(x), select(x), arrange(x), groupby(x),
inner_join(x, y), left_join(x, y), âŚ.
⢠Manimal, Pyxis/StatusQuo, others
â Extract RA operators implemented manually in Java code
7/10/2014 Bill Howe, UW 48
ââŚOperate on Genomics Intervals -> Joinâ
49. 7/10/2014 Bill Howe, UW 49
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
50. A closer look at an example
ROI(id, start, stop) is a set of âregions of interestâ
Read(id, start, stop) is a set of âreadsâ from sequencer
Task: For each region of interest, count the number
of reads it contains
start stop
stopstart
51. SELECT roi.id, count(rd.id)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.idâ
As a query
âregion of interestâ
sequence âreadâ
52. SELECT roi.id, count(rd.start)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.idâ
Why databases get
a bad reputation
many minutes
SELECT roi.id, count(rd.start) as cnt
FROM regions_of_interest roi, indexed_reads rd
WHERE roi.start <= rd.start AND rd.start <= roi.[end]
AND roi.start <= rd.[end] AND rd.[end] >= roi.[end]
GROUP BY roi.id
3 seconds!
roi
read
two-sided index scan
one-sided index scan,
plus filter
The broken promise of declarative queryâŚ
60. ⢠Hypothesis: Loops + RA covers everything anyone wants to do
â and it scales, itâs optimizable, and itâs accessible
⢠We can smooth the ROI curve for novices
â Start with simple queriesâŚ
â âŚend up working on advanced parallel algorithms
⢠âWhite Box Analyticsâ
â Compose queries, inspect plans, monitoring, debugging, âUDRsâ â
user-defined optimization rules
⢠Multiple languages, multiple backends, one data/query model
â Ask me about graph data
â Ask me about array data (or, rather, mesh data)
âRelational Algorithmicsâ
61. Takeaways
⢠We hope to see âData Science Environmentsâ at
universities worldwide
â We try to make our programs and activities reusable
⢠Software-as-a-service to reach the âlong tailâ of science
⢠âRelational Algorithmicsâ
â The relational algebra is the calculus of big data
â âItâs not just for databases anymoreâ
â Learn it, use it, teach it
â Myria is a platform for ârelational algorithmicsâ
http://escience.washington.edu
@billghowe
billhowe@cs.washington.edu
63. 63
Maslowâs Needs Hierarchy
âAs each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.â
-- Maslow 43
64. A âNeeds Hierarchyâ of Science Data Management
storage
sharing
64
query
integration
analytics
âAs each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.â
-- Maslow 43
65. A âNeeds Hierarchyâ of Science Data Management
storage
sharing
65
integration
query
analytics
âAs each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.â
-- Maslow 43