End-to-End eScience

End-to-End eScience
Integrating Query, Workflow,
Visualization, and Mashups
at an Ocean Observatory
Bill Howe,
University of Washington
Harrison Green-Fishback, PSU
David Maier, PSU
Erik Anderson, Utah
Emanuele Santos, Utah
Juliana Freire, Utah
Carlos Scheidegger, Utah
Claudio Silva, Utah
Antonio Baptista, OHSU
Peter Lawson, OSU
Renee Bellinger, OSU
http://dev.pacificfishtrax.org/
QuickTime™ and a
decompressor
are needed to see this picture.

01/30/15 Bill Howe, eScience Institute 2
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Mashups

Theory
Experiment
Observation
slide: Ed Lazowska

Theory
Experiment
Observation
Computational
Science
slide: Ed Lazowska

Theory
Experiment
Observation
Computational
Science
eScience

All Science is becoming eScience
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
But: Acquisition now outpaces analysis
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics

The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Taildatainventory
ordinal position
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Overreliance on simple tools (e.g., spreadsheets)
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.”-- William Gibson

eScience Institute at UW
 Mission
 Help position the University of Washington at the
forefront of research both in modern eScience
techniques and technologies, and in the fields that
depend upon these techniques and technologies
 Strategy
 Increase the sharing of expertise and facilities
 Bootstrap a cadre of Research Scientists
 Add faculty in key fields
 Make the entire University more effective
 Launched July 1 with $1 million in permanent
funding from the Washington State Legislature
 Sought, and need, $2 million

Web
Services
Facets of Database Research
Query
Languages
Storage
Management
Visualization;
Workflow
Data Integration
Knowledge Extraction,
Crawlers
Access
Methods
Data Mining,
Parallel Programming Models,
Provenance
complexity-hiding interfaces
My research: customize and optimize for science

The eScience Elephant
eScience
Cloud/Cluster
Workflow
Databases
Visualization Provenance
“flexibility;
web services;
integration”
“query processing;
data independence;
algebraic optimization;
needles in haystacks”
“Exploratory science; mapping
quantitative data to intuition”
“Reproducibility;
forensics;
sharing/reuse”
“Massive data
parallelism”
Mashups
“Rapid Prototyping;
Simplified web
programming”

Some eScience Research
Query Algebra for new Data Type
Scientific Workflow Systems
Science Mashups
“Dataspace” systems
[Howe, Freire, Silva, et al. 2008]
[Howe, Green-Fishback, Maier, 2009]
[Howe, Maier, Rayner, Rucker 2008]
[Howe, Maier. 2004, 2005, 2006]
thistalk

Outline
 eScience
 Brief Demo
 Science Mashups

VisTrails for Computation

Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new
techniques, new opportunities fortechniques, new opportunities for
ecosystem-based managementecosystem-based management
Peter LawsonPeter Lawson11
, Lorenzo Cianelli, Lorenzo Cianelli22
, Bobby Ireland, Bobby Ireland22
12

Enabling Scientific Discourse between
Fishermen and Fisheries Managers

VisTrails for Collaboration
Bill Howe @ CMOP
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ CMOP
adds an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation

Outline
 eScience
 Brief Demo
 Mashups

CMOP

Columbia River Estuary
red = high salinity (~34psu)
blue = fresh water (~0 psu)

Accessing Model Results
 CMOP ocean circulation models run in forecast or
hindcast mode
 Models run serially in ~1/5 real time
 On MPICH2, about 10x speedup before overhead dominates
 Forecasts kept for 10 days, hindcasts kept indefinitely
(40TB + 25TB/year)
 Access via a GridFields Web Service
 GFServer optimizes and evaluates GF expressions and returns
the result

Unstructured Grids
“unstructured grids” model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing

“Structured” Grids
“structured grids” do a poor job of
modeling complex features and
complicate multi-scale analysis.
But:Coastlines are not rectilinear
x x
xx
xx xx
xx
xx
x
1) Missing values = wasted effort
Higher resolution = wasted
effort in areas of low dynamism
2) Data associated
with cells at
multiple
dimensions
Simple: Isomorphic to
multidimensional arrays

Structured grids are easy
 The data model
(Cartesian products of coordinate variables)
 immediately implies a representation,
(multidimensional arrays)
 an API,
(reading and writing subslabs)
 and an efficient implementation
(address calculation using array “shape”)

Structured grid example
f( i, j )
x( i)
y( j)
for i in [4:6]:
for j in [1:4]:
addr = &f + j*|x| + i
= f[4:6, 1:4] =
NetCDF, MATLAB, RasDaMan, SciDB (soon), many more

Unstructured Grids
2
3
4
( E, I ) = A
y
x
z
E0 = {2,3,4}
E1 = {x,y,z}
E2 = {A}
I =
z2
z4
Az
x2
x3
Ax
Ay
y4
y3
…plus the
transitive closure

Subsetting
Full grid: Eastern Pacific Subset: mouth of
Columbia River
color: bathymetry
Washington
Oregon
California

Correctness properties preserved
Grid is well-supported
(no ragged edges)

Subset semantics
01
1
1
1 0
0
1
1
1
1
1
1
1
1
Input Simple Drop “Exact”
1
1
11
0
01
1 0
0 1
1
1
1
2
1
1
Cut everything labeled “0”. What should be kept?

What about Visualization Libs?
 Different C++ classes, each dependent on data characteristics.
 Changes to data characteristics require changes to the program
 Logical equivalences obscured
 No data independence
vtkExtractGeometry
vtkThreshold
vtkExtractGrid
vtkExtractVOI
vtkThresholdPoints
We want:
in VTK:

GridField Data Model
A GridField with two attributes bound to the 2-cells
and four attributes bound to the 0-cells
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5

GridField Operations
 Lifted set operations
 Union, Intersection, Cross Product
 Scan/Bind
 Read a grid/attribute
 Restrict
 Remove cells that do not satisfy a predicate
 Accrete
 Grow a grid by adding neighbors of cells
 Regrid
 Map the data of one grid onto another

Usage Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry

Usage Example (2)
H = Scan(context, “H")
rH = Restrict(“h<500", 0, H)
H = rH =
color: bathymetry

Longer Example
H : (x,y,b)
V : (z)
render
H V
⊗
(H × V)
r(z>b)
r(H × V)
b(s)
b(r(H × V))
r(region)
r(b(r(H × V)))

⊗
H(x,y,b)
V(z)
r(z>b) b(s) r(region)
⊗
H(x,y,b)
V(z)
r(z>b) b(s)
r(x,y)
r(z)
Optimization
*Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005

Transect (Vertical Slice)
P

Transect: Bad Plan
⊗
H(x,y,b)
V(z)
r(z>b) b(s) regrid
⊗
P
P ⊗ V
1) Construct full-size 3D grid
2) Construct 2D transect grid
3) Spatial Join 1) with 2)

Transect: Optimized Plan
P ⊗ V
V(z)
P
H(x,y,b)
regrid b(s)⊗ regrid
⊗
1) Find 2D cells containing points
2) Create “stacks” of 2D cells carrying data
3) Create 2D transect grid
4) Spatial Join 2) with 3)

1) Find cells containing points in P

1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Join 2) with 3)

0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
Transect: Results
secs
800 MB
dataset
simple = nearest neighbor interpolation
*_o = optimized by restricting
to the region of interest

Ongoing work
 NSF Cluster Exploratory Award:
 Where the Ocean Meets the Cloud:
Ad Hoc Longitudinal Analysis of Massive Mesh Data
 Partnership between NSF, IBM, Google
 Data-intensive computing
 massive queries, not massive simulations
 To “Cloud-Enable” GridFields and VisTrails
 Goal: 10+-year climatologies at interactive speeds
 Parallel implementations of GridField operators

via Hadoop (and Dryad!)
 Provenance, repeatability, visualization via VisTrails

Connect rich desktop experience
 Co-PIs from University of Utah
 Claudio Silva and Juliana Freire

Outline
 eScience
 Brief Demo
 Scientific Mashups

Why Mashups?
 Jim Gray: # of datasets scales as N2
 Each pairwise comparison generates a new dataset
 Corollary: # of apps scales as N2
 Every pairwise comparison motivates a new mashup
 To keep up, we need to
 entrain new programmers,
 make existing programmers more productive,
 or both

Satellite Images + Crime Incidence Reports

Twitter Feed + Flickr Stream

Mashup Frameworks
 A bottom up approach
 Start with a GPL, add
 Visual programming
 Interactive type checking
 Exploit a corpus of
previous examples

bootstrapping a mashup

mashup “autocomplete”

emit warnings

Scientific Mashup Characteristics
 Turn over more data per operation
 Involve subtle visualizations
 Must serve a diverse audience

A Model for Scientific Mashups
 The “Data Product” is the currency of scientific
communication with the public
 Scientists are already adept at crafting them
(consider powerpoint slides and figures)
 We take a top down approach:
 Take a static data product ensemble,
 endow it with interactivity,
 publish it online,
 allow others to repurpose it at runtime

Data Product Ensemble

Mashup

CTD: Conducitvity, Temperature, Depth

Sampling

Event Detection: Red Water

CTD Cast

Flowthrough

Mashup

Key Concepts
 A mashup is a synchronized
ensemble of data products
 A data product is a mashable that
has been adapted for a particular
purpose
 A mashable is an arbitrarily-complex
computation that returns a relation
 An adaptor displays the relation to
the user and returns a subset
 All adapted mashables accept input
 Hence, user controls are modeled
as adapted mashables just like
“visual” data products

Adapted Mashables

Data Flow Graph

Inferring Data Flow
provides: {ABC}
requires: {AB}

Inferring Data Flow
provides: {AC}
requires: {AB}
provides: {B}

Inferring Data Flow
provides: {AC}
requires: {AB}
underspecified mashup
Solution:
1) use defaults
2) root environment
3) hand-specified parameter

Inferring Data Flow
provides: {AB}
requires: {AB}
provides: {B}
overspecified mashup
Solution: Break ties:
1) Prefer nodes on longer paths
2) Use layout information

Audience-Tailored Mashups
K12 studentsExperts

Conclusions and Future Directions
 We want to augment scientists, not programmers
 Requires limiting expressiveness -- not yet clear where
to draw the line
 More work on semi-automatically tailoring a
mashup at runtime
 Automatically insert “context products”

See salinity, add a salinity colorbar

See a time, add a tide chart

See a location, add a map
 Re-skin data products
 “Dashboard-style” vs. “Wizard-style” apps

http://escience.washington.edu
(retooled website coming soon)

ComparisonData Model Operations Services
GPL * * Typing, maybe
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
data parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance
MS Dryad IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism,
full control

Mashups serve a diverse audience
student
public
scientist

Computational Science
 Theory
 Experiment
 Observation
 Simulation (in silico)
 Analysis (in ferro)
Data acquisition is
hypothesis-driven
Data acquisition is
technology-driven

Explore architectures blending techniques from
• mashups (rapid prototyping),
• visualization (interactivity, richness),
• workflow (data integration, provenance),
• databases (optimization, data independence)
to answer science questions at an Ocean
Observatory
Motivation

Source: MayaVi website
PLOT3D, GDAL,
ShapeFile, OGC,
.obj, .vtk,
netCDF, HDF5,
FITS, others
Optimized for “throwing datasets”
and interactivity
Declarative query, interoperability,
repeatability generally lacking
Source: http://pogl.wordpress.com/2007/06/
Visualization

Workflow
 Emphasis on integration, web
services, flexibility
 Unconstrained boxes-and-arrows
 Any operation on any data type
 Very expressive, but limited
opportunities for static reasoning
 Type safety
 Task parallelism
 Cache safety
 Optimization via rewrite rules
 Result size / execution time estimation
 Transparent data parallelism
 Platform portability
To move the earth, you
need somewhere to
stand

Databases
Pre-relational DBMS brittleness: if your
data changed, your application broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independent of
physical data representation

Heterogeneity also drives costs#ofbytes
# of data types
CERN
(~15PB/year, particle interactions)
LSST
(~100PB; images, objects)
PanSTARRS
(~40PB; images, objects, trajectories)
OOI
(~50TB/year; sim. results, satellite,
gliders, AUVs, vessels, more)
SDSS
(~100TB; images, objects)
Biologists
(~10TB, sequences, alignments, annotations,
BLAST hits, metadata, phylogeny trees)

The eScience Elephant
“Like a snake”
“
“Like a hand fan” “Like a wall” “Like tree trunk”
“Like a spear”
“Like a rope”

End-to-End eScience

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to End-to-End eScience

Similar to End-to-End eScience (20)

More from University of Washington

More from University of Washington (18)

Recently uploaded

Recently uploaded (20)

End-to-End eScience

Editor's Notes