Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
End-to-End eScience
1. End-to-End eScience
Integrating Query, Workflow,
Visualization, and Mashups
at an Ocean Observatory
Bill Howe,
University of Washington
Harrison Green-Fishback, PSU
David Maier, PSU
Erik Anderson, Utah
Emanuele Santos, Utah
Juliana Freire, Utah
Carlos Scheidegger, Utah
Claudio Silva, Utah
Antonio Baptista, OHSU
Peter Lawson, OSU
Renee Bellinger, OSU
http://dev.pacificfishtrax.org/
QuickTime™ and a
decompressor
are needed to see this picture.
2. 01/30/15 Bill Howe, eScience Institute 2
Outline
eScience
Brief Demo
A Domain-Specific Query Algebra
Mashups
8. 01/30/15 Bill Howe, eScience Institute 8
All Science is becoming eScience
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
But: Acquisition now outpaces analysis
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
9. 01/30/15 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Taildatainventory
ordinal position
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Overreliance on simple tools (e.g., spreadsheets)
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.”-- William Gibson
10. 01/30/15 Bill Howe, eScience Institute 10
eScience Institute at UW
Mission
Help position the University of Washington at the
forefront of research both in modern eScience
techniques and technologies, and in the fields that
depend upon these techniques and technologies
Strategy
Increase the sharing of expertise and facilities
Bootstrap a cadre of Research Scientists
Add faculty in key fields
Make the entire University more effective
Launched July 1 with $1 million in permanent
funding from the Washington State Legislature
Sought, and need, $2 million
11. 01/30/15 Bill Howe, eScience Institute 11
Web
Services
Facets of Database Research
Query
Languages
Storage
Management
Visualization;
Workflow
Data Integration
Knowledge Extraction,
Crawlers
Access
Methods
Data Mining,
Parallel Programming Models,
Provenance
complexity-hiding interfaces
My research: customize and optimize for science
12. 01/30/15 Bill Howe, eScience Institute 12
The eScience Elephant
eScience
Cloud/Cluster
Workflow
Databases
Visualization Provenance
“flexibility;
web services;
integration”
“query processing;
data independence;
algebraic optimization;
needles in haystacks”
“Exploratory science; mapping
quantitative data to intuition”
“Reproducibility;
forensics;
sharing/reuse”
“Massive data
parallelism”
Mashups
“Rapid Prototyping;
Simplified web
programming”
13. 01/30/15 Bill Howe, eScience Institute 13
Some eScience Research
Query Algebra for new Data Type
Scientific Workflow Systems
Science Mashups
“Dataspace” systems
[Howe, Freire, Silva, et al. 2008]
[Howe, Green-Fishback, Maier, 2009]
[Howe, Maier, Rayner, Rucker 2008]
[Howe, Maier. 2004, 2005, 2006]
thistalk
14. 01/30/15 Bill Howe, eScience Institute 14
Outline
eScience
Brief Demo
A Domain-Specific Query Algebra
Science Mashups
16. Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new
techniques, new opportunities fortechniques, new opportunities for
ecosystem-based managementecosystem-based management
Peter LawsonPeter Lawson11
, Lorenzo Cianelli, Lorenzo Cianelli22
, Bobby Ireland, Bobby Ireland22
12
17. 01/30/15 Bill Howe, eScience Institute 17
Enabling Scientific Discourse between
Fishermen and Fisheries Managers
21. 01/30/15 Bill Howe, eScience Institute 21
VisTrails for Collaboration
Bill Howe @ CMOP
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ CMOP
adds an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
22. 01/30/15 Bill Howe, eScience Institute 22
Outline
eScience
Brief Demo
A Domain-Specific Query Algebra
Mashups
24. 01/30/15 Bill Howe, eScience Institute 24
Columbia River Estuary
red = high salinity (~34psu)
blue = fresh water (~0 psu)
25. 01/30/15 Bill Howe, eScience Institute 25
Accessing Model Results
CMOP ocean circulation models run in forecast or
hindcast mode
Models run serially in ~1/5 real time
On MPICH2, about 10x speedup before overhead dominates
Forecasts kept for 10 days, hindcasts kept indefinitely
(40TB + 25TB/year)
Access via a GridFields Web Service
GFServer optimizes and evaluates GF expressions and returns
the result
26. 01/30/15 Bill Howe, eScience Institute 26
Unstructured Grids
“unstructured grids” model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing
27. 01/30/15 Bill Howe, eScience Institute 27
“Structured” Grids
“structured grids” do a poor job of
modeling complex features and
complicate multi-scale analysis.
But:Coastlines are not rectilinear
x x
xx
xx xx
xx
xx
x
1) Missing values = wasted effort
Higher resolution = wasted
effort in areas of low dynamism
2) Data associated
with cells at
multiple
dimensions
Simple: Isomorphic to
multidimensional arrays
28. 01/30/15 Bill Howe, eScience Institute 28
Structured grids are easy
The data model
(Cartesian products of coordinate variables)
immediately implies a representation,
(multidimensional arrays)
an API,
(reading and writing subslabs)
and an efficient implementation
(address calculation using array “shape”)
29. 01/30/15 Bill Howe, eScience Institute 29
Structured grid example
f( i, j )
x( i)
y( j)
for i in [4:6]:
for j in [1:4]:
addr = &f + j*|x| + i
= f[4:6, 1:4] =
NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
30. 01/30/15 Bill Howe, eScience Institute 30
Unstructured Grids
2
3
4
( E, I ) = A
y
x
z
E0 = {2,3,4}
E1 = {x,y,z}
E2 = {A}
I =
z2
z4
Az
x2
x3
Ax
Ay
y4
y3
…plus the
transitive closure
31. 01/30/15 Bill Howe, eScience Institute 31
Subsetting
Full grid: Eastern Pacific Subset: mouth of
Columbia River
color: bathymetry
Washington
Oregon
California
32. 01/30/15 Bill Howe, eScience Institute 32
Correctness properties preserved
Grid is well-supported
(no ragged edges)
33. 01/30/15 Bill Howe, eScience Institute 33
Subset semantics
01
1
1
1 0
0
1
1
1
1
1
1
1
1
Input Simple Drop “Exact”
1
1
11
0
01
1 0
0 1
1
1
1
2
1
1
Cut everything labeled “0”. What should be kept?
34. 01/30/15 Bill Howe, eScience Institute 34
What about Visualization Libs?
Different C++ classes, each dependent on data characteristics.
Changes to data characteristics require changes to the program
Logical equivalences obscured
No data independence
vtkExtractGeometry
vtkThreshold
vtkExtractGrid
vtkExtractVOI
vtkThresholdPoints
We want:
in VTK:
35. 01/30/15 Bill Howe, eScience Institute 35
GridField Data Model
A GridField with two attributes bound to the 2-cells
and four attributes bound to the 0-cells
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5
36. 01/30/15 Bill Howe, eScience Institute 36
GridField Operations
Lifted set operations
Union, Intersection, Cross Product
Scan/Bind
Read a grid/attribute
Restrict
Remove cells that do not satisfy a predicate
Accrete
Grow a grid by adding neighbors of cells
Regrid
Map the data of one grid onto another
37. 01/30/15 Bill Howe, eScience Institute 37
Usage Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry
38. 01/30/15 Bill Howe, eScience Institute 38
Usage Example (2)
H = Scan(context, “H")
rH = Restrict(“h<500", 0, H)
H = rH =
color: bathymetry
39. 01/30/15 Bill Howe, eScience Institute 39
Longer Example
H : (x,y,b)
V : (z)
render
H V
⊗
(H × V)
r(z>b)
r(H × V)
b(s)
b(r(H × V))
r(region)
r(b(r(H × V)))
42. 01/30/15 Bill Howe, eScience Institute 42
Transect: Bad Plan
⊗
H(x,y,b)
V(z)
r(z>b) b(s) regrid
⊗
P
P ⊗ V
1) Construct full-size 3D grid
2) Construct 2D transect grid
3) Spatial Join 1) with 2)
43. 01/30/15 Bill Howe, eScience Institute 43
Transect: Optimized Plan
P ⊗ V
V(z)
P
H(x,y,b)
regrid b(s)⊗ regrid
⊗
1) Find 2D cells containing points
2) Create “stacks” of 2D cells carrying data
3) Create 2D transect grid
4) Spatial Join 2) with 3)
44. 01/30/15 Bill Howe, eScience Institute 44
1) Find cells containing points in P
45. 01/30/15 Bill Howe, eScience Institute 45
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Join 2) with 3)
46. 01/30/15 Bill Howe, eScience Institute 46
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
Transect: Results
secs
800 MB
dataset
simple = nearest neighbor interpolation
*_o = optimized by restricting
to the region of interest
47. 01/30/15 Bill Howe, eScience Institute 47
Ongoing work
NSF Cluster Exploratory Award:
Where the Ocean Meets the Cloud:
Ad Hoc Longitudinal Analysis of Massive Mesh Data
Partnership between NSF, IBM, Google
Data-intensive computing
massive queries, not massive simulations
To “Cloud-Enable” GridFields and VisTrails
Goal: 10+-year climatologies at interactive speeds
Parallel implementations of GridField operators
via Hadoop (and Dryad!)
Provenance, repeatability, visualization via VisTrails
Connect rich desktop experience
Co-PIs from University of Utah
Claudio Silva and Juliana Freire
48. 01/30/15 Bill Howe, eScience Institute 48
Outline
eScience
Brief Demo
A Domain-Specific Query Algebra
Scientific Mashups
49. 01/30/15 Bill Howe, eScience Institute 49
Why Mashups?
Jim Gray: # of datasets scales as N2
Each pairwise comparison generates a new dataset
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new mashup
To keep up, we need to
entrain new programmers,
make existing programmers more productive,
or both
50. 01/30/15 Bill Howe, eScience Institute 50
Satellite Images + Crime Incidence Reports
52. 01/30/15 Bill Howe, eScience Institute 52
Mashup Frameworks
A bottom up approach
Start with a GPL, add
Visual programming
Interactive type checking
Exploit a corpus of
previous examples
bootstrapping a mashup
mashup “autocomplete”
emit warnings
56. 01/30/15 Bill Howe, eScience Institute 56
Scientific Mashup Characteristics
Turn over more data per operation
Involve subtle visualizations
Must serve a diverse audience
57. 01/30/15 Bill Howe, eScience Institute 57
A Model for Scientific Mashups
The “Data Product” is the currency of scientific
communication with the public
Scientists are already adept at crafting them
(consider powerpoint slides and figures)
We take a top down approach:
Take a static data product ensemble,
endow it with interactivity,
publish it online,
allow others to repurpose it at runtime
67. 01/30/15 Bill Howe, eScience Institute 67
Key Concepts
A mashup is a synchronized
ensemble of data products
A data product is a mashable that
has been adapted for a particular
purpose
A mashable is an arbitrarily-complex
computation that returns a relation
An adaptor displays the relation to
the user and returns a subset
All adapted mashables accept input
Hence, user controls are modeled
as adapted mashables just like
“visual” data products
70. 01/30/15 Bill Howe, eScience Institute 70
Inferring Data Flow
provides: {ABC}
requires: {AB}
71. 01/30/15 Bill Howe, eScience Institute 71
Inferring Data Flow
provides: {AC}
requires: {AB}
provides: {B}
72. 01/30/15 Bill Howe, eScience Institute 72
Inferring Data Flow
provides: {AC}
requires: {AB}
underspecified mashup
Solution:
1) use defaults
2) root environment
3) hand-specified parameter
73. 01/30/15 Bill Howe, eScience Institute 73
Inferring Data Flow
provides: {AB}
requires: {AB}
provides: {B}
overspecified mashup
Solution: Break ties:
1) Prefer nodes on longer paths
2) Use layout information
74. 01/30/15 Bill Howe, eScience Institute 74
Audience-Tailored Mashups
K12 studentsExperts
75. 01/30/15 Bill Howe, eScience Institute 75
Conclusions and Future Directions
We want to augment scientists, not programmers
Requires limiting expressiveness -- not yet clear where
to draw the line
More work on semi-automatically tailoring a
mashup at runtime
Automatically insert “context products”
See salinity, add a salinity colorbar
See a time, add a tide chart
See a location, add a map
Re-skin data products
“Dashboard-style” vs. “Wizard-style” apps
76. 01/30/15 Bill Howe, eScience Institute 76
http://escience.washington.edu
(retooled website coming soon)
77. 01/30/15 Bill Howe, eScience Institute 77
ComparisonData Model Operations Services
GPL * * Typing, maybe
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
data parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance
MS Dryad IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism,
full control
78. 01/30/15 Bill Howe, eScience Institute 78
Mashups serve a diverse audience
student
public
scientist
79. 01/30/15 Bill Howe, eScience Institute 79
Computational Science
Theory
Experiment
Observation
Simulation (in silico)
Analysis (in ferro)
Data acquisition is
hypothesis-driven
Data acquisition is
technology-driven
80. 01/30/15 Bill Howe, eScience Institute 80
Explore architectures blending techniques from
• mashups (rapid prototyping),
• visualization (interactivity, richness),
• workflow (data integration, provenance),
• databases (optimization, data independence)
to answer science questions at an Ocean
Observatory
Motivation
81. 01/30/15 Bill Howe, eScience Institute 81
Source: MayaVi website
PLOT3D, GDAL,
ShapeFile, OGC,
.obj, .vtk,
netCDF, HDF5,
FITS, others
Optimized for “throwing datasets”
and interactivity
Declarative query, interoperability,
repeatability generally lacking
Source: http://pogl.wordpress.com/2007/06/
Visualization
82. 01/30/15 Bill Howe, eScience Institute 82
Workflow
Emphasis on integration, web
services, flexibility
Unconstrained boxes-and-arrows
Any operation on any data type
Very expressive, but limited
opportunities for static reasoning
Type safety
Task parallelism
Cache safety
Optimization via rewrite rules
Result size / execution time estimation
Transparent data parallelism
Platform portability
To move the earth, you
need somewhere to
stand
83. 01/30/15 Bill Howe, eScience Institute 83
Databases
Pre-relational DBMS brittleness: if your
data changed, your application broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independent of
physical data representation
85. 01/30/15 Bill Howe, eScience Institute 85
The eScience Elephant
“Like a snake”
“
“Like a hand fan” “Like a wall” “Like tree trunk”
“Like a spear”
“Like a rope”
There have traditionally been three legs to the scientific stool: theory, experiment, and observation.
These were mutually reinforcing: for example, observations might suggest theories, which could be tested by experiments.
Over the past 50 years, we have augmented the traditional “three legs of the stool” with an incredibly powerful new tool: high-speed computation.
In traditional “computational science,” we use simulation to conduct “virtual experiments” – experiments that can’t be conducted in the lab, for various reasons.
In the past 10 years, a fourth method of scientific discovery has emerged: Acquire data en masse, independent of any hypothesis, and then ask questions about it post hoc.
eScience is about massive and complex data -- data large enough to require automated or semi-automated analysis -- there’s too much to look at manually. Relevant tools are databases, visualization, cluster computing, data mining, machine learning, workflow, web services -- all integrated and optimized for scientific use.
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT reesources -- no clusters, no system administrators, no programmers, and no computer scientists.
They rely on spreadsheets, email, and maybe a shared file system.
Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources.
However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel?
Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
Data Management != Storage Management
Storage Management is
SATA/SCSI/Fiber
Backup policies and procedures
redundancy decisions (RAID 0, 1+0, 0+1, 5
Access methods
Query languages
Data Mining, Analysis, Visualization
Data Integration
The blind men and the eScience elephant.
Collaboration across many state federal and university agencies
&lt;number&gt;
Very interesting datasets, and all of it is freely available for any purpose.
File formats, programming languages, and DBMS exist that are organized around this simple property
With an unstructured grid, you explicitly track cells of various dimensions and the incidence relationship that connects them.
Specialized “Boundary Representation” data structures exist for special cases, but there is no general data model.
Different semantics for subsetting may be defined. One particular semantics tends to preserve intuitive correctness properties.
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
On the order of hundreds of points. Manual browsing.
“Make mashups easy to create”
“Raise the level of abstraction”
“Empower non-programmers to be programmers”
The seven people who know your
Climatology is long-term average
~1 million observations. Can’t render each dot in, say, javascript. We need services that can produce these visualizations given parameters. This work is about synchronizing visualizations, blessing them with interactivity, and publishing them on the web.