SlideShare a Scribd company logo
1 of 86
End-to-End eScience
Integrating Query, Workflow,
Visualization, and Mashups
at an Ocean Observatory
Bill Howe,
University of Washington
Harrison Green-Fishback, PSU
David Maier, PSU
Erik Anderson, Utah
Emanuele Santos, Utah
Juliana Freire, Utah
Carlos Scheidegger, Utah
Claudio Silva, Utah
Antonio Baptista, OHSU
Peter Lawson, OSU
Renee Bellinger, OSU
http://dev.pacificfishtrax.org/
QuickTime™ and a
decompressor
are needed to see this picture.
01/30/15 Bill Howe, eScience Institute 2
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Mashups
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
slide: Ed Lazowska
Theory
Experiment
Observation
Computational
Science
slide: Ed Lazowska
Theory
Experiment
Observation
Computational
Science
eScience
01/30/15 Bill Howe, eScience Institute 8
All Science is becoming eScience
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
But: Acquisition now outpaces analysis
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics
01/30/15 Bill Howe, eScience Institute 9
The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Taildatainventory
ordinal position
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Overreliance on simple tools (e.g., spreadsheets)
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.”-- William Gibson
01/30/15 Bill Howe, eScience Institute 10
eScience Institute at UW
 Mission
 Help position the University of Washington at the
forefront of research both in modern eScience
techniques and technologies, and in the fields that
depend upon these techniques and technologies
 Strategy
 Increase the sharing of expertise and facilities
 Bootstrap a cadre of Research Scientists
 Add faculty in key fields
 Make the entire University more effective
 Launched July 1 with $1 million in permanent
funding from the Washington State Legislature
 Sought, and need, $2 million
01/30/15 Bill Howe, eScience Institute 11
Web
Services
Facets of Database Research
Query
Languages
Storage
Management
Visualization;
Workflow
Data Integration
Knowledge Extraction,
Crawlers
Access
Methods
Data Mining,
Parallel Programming Models,
Provenance
complexity-hiding interfaces
My research: customize and optimize for science
01/30/15 Bill Howe, eScience Institute 12
The eScience Elephant
eScience
Cloud/Cluster
Workflow
Databases
Visualization Provenance
“flexibility;
web services;
integration”
“query processing;
data independence;
algebraic optimization;
needles in haystacks”
“Exploratory science; mapping
quantitative data to intuition”
“Reproducibility;
forensics;
sharing/reuse”
“Massive data
parallelism”
Mashups
“Rapid Prototyping;
Simplified web
programming”
01/30/15 Bill Howe, eScience Institute 13
Some eScience Research
Query Algebra for new Data Type
Scientific Workflow Systems
Science Mashups
“Dataspace” systems
[Howe, Freire, Silva, et al. 2008]
[Howe, Green-Fishback, Maier, 2009]
[Howe, Maier, Rayner, Rucker 2008]
[Howe, Maier. 2004, 2005, 2006]
thistalk
01/30/15 Bill Howe, eScience Institute 14
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Science Mashups
01/30/15 Bill Howe, eScience Institute 15
VisTrails for Computation
Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new
techniques, new opportunities fortechniques, new opportunities for
ecosystem-based managementecosystem-based management
Peter LawsonPeter Lawson11
, Lorenzo Cianelli, Lorenzo Cianelli22
, Bobby Ireland, Bobby Ireland22
12
01/30/15 Bill Howe, eScience Institute 17
Enabling Scientific Discourse between
Fishermen and Fisheries Managers
01/30/15 Bill Howe, eScience Institute 18
01/30/15 Bill Howe, eScience Institute 19
01/30/15 Bill Howe, eScience Institute 20
01/30/15 Bill Howe, eScience Institute 21
VisTrails for Collaboration
Bill Howe @ CMOP
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ CMOP
adds an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
01/30/15 Bill Howe, eScience Institute 22
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Mashups
01/30/15 Bill Howe, eScience Institute 23
CMOP
01/30/15 Bill Howe, eScience Institute 24
Columbia River Estuary
red = high salinity (~34psu)
blue = fresh water (~0 psu)
01/30/15 Bill Howe, eScience Institute 25
Accessing Model Results
 CMOP ocean circulation models run in forecast or
hindcast mode
 Models run serially in ~1/5 real time
 On MPICH2, about 10x speedup before overhead dominates
 Forecasts kept for 10 days, hindcasts kept indefinitely
(40TB + 25TB/year)
 Access via a GridFields Web Service
 GFServer optimizes and evaluates GF expressions and returns
the result
01/30/15 Bill Howe, eScience Institute 26
Unstructured Grids
“unstructured grids” model
complex domains at multiple
scales simultaneously
red = high salinity (~34psu)
blue = fresh water (~0 psu)
Columbia River Estuary
….but complicate processing
01/30/15 Bill Howe, eScience Institute 27
“Structured” Grids
“structured grids” do a poor job of
modeling complex features and
complicate multi-scale analysis.
But:Coastlines are not rectilinear
x x
xx
xx xx
xx
xx
x
1) Missing values = wasted effort
Higher resolution = wasted
effort in areas of low dynamism
2) Data associated
with cells at
multiple
dimensions
Simple: Isomorphic to
multidimensional arrays
01/30/15 Bill Howe, eScience Institute 28
Structured grids are easy
 The data model
(Cartesian products of coordinate variables)
 immediately implies a representation,
(multidimensional arrays)
 an API,
(reading and writing subslabs)
 and an efficient implementation
(address calculation using array “shape”)
01/30/15 Bill Howe, eScience Institute 29
Structured grid example
f( i, j )
x( i)
y( j)
for i in [4:6]:
for j in [1:4]:
addr = &f + j*|x| + i
= f[4:6, 1:4] =
NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
01/30/15 Bill Howe, eScience Institute 30
Unstructured Grids
2
3
4
( E, I ) = A
y
x
z
E0 = {2,3,4}
E1 = {x,y,z}
E2 = {A}
I =
z2
z4
Az
x2
x3
Ax
Ay
y4
y3
…plus the
transitive closure
01/30/15 Bill Howe, eScience Institute 31
Subsetting
Full grid: Eastern Pacific Subset: mouth of
Columbia River
color: bathymetry
Washington
Oregon
California
01/30/15 Bill Howe, eScience Institute 32
Correctness properties preserved
Grid is well-supported
(no ragged edges)
01/30/15 Bill Howe, eScience Institute 33
Subset semantics
01
1
1
1 0
0
1
1
1
1
1
1
1
1
Input Simple Drop “Exact”
1
1
11
0
01
1 0
0 1
1
1
1
2
1
1
Cut everything labeled “0”. What should be kept?
01/30/15 Bill Howe, eScience Institute 34
What about Visualization Libs?
 Different C++ classes, each dependent on data characteristics.
 Changes to data characteristics require changes to the program
 Logical equivalences obscured
 No data independence
vtkExtractGeometry
vtkThreshold
vtkExtractGrid
vtkExtractVOI
vtkThresholdPoints
We want:
in VTK:
01/30/15 Bill Howe, eScience Institute 35
GridField Data Model
A GridField with two attributes bound to the 2-cells
and four attributes bound to the 0-cells
x y salt temp
13.8 10.6 29.4 12.1
13.9 9.4 29.8 12.5
14.3 9.0 28.0 12.0
13.4 9.0 30.1 13.2
flux area
11.5 3.3
13.9 5.5
13.1 4.5
01/30/15 Bill Howe, eScience Institute 36
GridField Operations
 Lifted set operations
 Union, Intersection, Cross Product
 Scan/Bind
 Read a grid/attribute
 Restrict
 Remove cells that do not satisfy a predicate
 Accrete
 Grow a grid by adding neighbors of cells
 Regrid
 Map the data of one grid onto another
01/30/15 Bill Howe, eScience Institute 37
Usage Example (1)
H = Scan(context, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
H = rH =
dimensionpredicate
color: bathymetry
01/30/15 Bill Howe, eScience Institute 38
Usage Example (2)
H = Scan(context, “H")
rH = Restrict(“h<500", 0, H)
H = rH =
color: bathymetry
01/30/15 Bill Howe, eScience Institute 39
Longer Example
H : (x,y,b)
V : (z)
render
H V
⊗
(H × V)
r(z>b)
r(H × V)
b(s)
b(r(H × V))
r(region)
r(b(r(H × V)))
01/30/15 Bill Howe, eScience Institute 40
⊗
H(x,y,b)
V(z)
r(z>b) b(s) r(region)
⊗
H(x,y,b)
V(z)
r(z>b) b(s)
r(x,y)
r(z)
Optimization
*Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
01/30/15 Bill Howe, eScience Institute 41
Transect (Vertical Slice)
P
01/30/15 Bill Howe, eScience Institute 42
Transect: Bad Plan
⊗
H(x,y,b)
V(z)
r(z>b) b(s) regrid
⊗
P
P ⊗ V
1) Construct full-size 3D grid
2) Construct 2D transect grid
3) Spatial Join 1) with 2)
01/30/15 Bill Howe, eScience Institute 43
Transect: Optimized Plan
P ⊗ V
V(z)
P
H(x,y,b)
regrid b(s)⊗ regrid
⊗
1) Find 2D cells containing points
2) Create “stacks” of 2D cells carrying data
3) Create 2D transect grid
4) Spatial Join 2) with 3)
01/30/15 Bill Howe, eScience Institute 44
1) Find cells containing points in P
01/30/15 Bill Howe, eScience Institute 45
1)
4)
2)
1) Find cells containing points in P
2) Construct “stacks” of cells
4) Join 2) with 3)
01/30/15 Bill Howe, eScience Institute 46
0
5
10
15
20
25
30
35
40
45
vtk(3D) interpolate simple interp_o simple_o
Transect: Results
secs
800 MB
dataset
simple = nearest neighbor interpolation
*_o = optimized by restricting
to the region of interest
01/30/15 Bill Howe, eScience Institute 47
Ongoing work
 NSF Cluster Exploratory Award:
 Where the Ocean Meets the Cloud:
Ad Hoc Longitudinal Analysis of Massive Mesh Data
 Partnership between NSF, IBM, Google
 Data-intensive computing
 massive queries, not massive simulations
 To “Cloud-Enable” GridFields and VisTrails
 Goal: 10+-year climatologies at interactive speeds
 Parallel implementations of GridField operators

via Hadoop (and Dryad!)
 Provenance, repeatability, visualization via VisTrails

Connect rich desktop experience
 Co-PIs from University of Utah
 Claudio Silva and Juliana Freire
01/30/15 Bill Howe, eScience Institute 48
Outline
 eScience
 Brief Demo
 A Domain-Specific Query Algebra
 Scientific Mashups
01/30/15 Bill Howe, eScience Institute 49
Why Mashups?
 Jim Gray: # of datasets scales as N2
 Each pairwise comparison generates a new dataset
 Corollary: # of apps scales as N2
 Every pairwise comparison motivates a new mashup
 To keep up, we need to
 entrain new programmers,
 make existing programmers more productive,
 or both
01/30/15 Bill Howe, eScience Institute 50
Satellite Images + Crime Incidence Reports
01/30/15 Bill Howe, eScience Institute 51
Twitter Feed + Flickr Stream
01/30/15 Bill Howe, eScience Institute 52
Mashup Frameworks
 A bottom up approach
 Start with a GPL, add
 Visual programming
 Interactive type checking
 Exploit a corpus of
previous examples

bootstrapping a mashup

mashup “autocomplete”

emit warnings
01/30/15 Bill Howe, eScience Institute 53
01/30/15 Bill Howe, eScience Institute 54
01/30/15 Bill Howe, eScience Institute 55
01/30/15 Bill Howe, eScience Institute 56
Scientific Mashup Characteristics
 Turn over more data per operation
 Involve subtle visualizations
 Must serve a diverse audience
01/30/15 Bill Howe, eScience Institute 57
A Model for Scientific Mashups
 The “Data Product” is the currency of scientific
communication with the public
 Scientists are already adept at crafting them
(consider powerpoint slides and figures)
 We take a top down approach:
 Take a static data product ensemble,
 endow it with interactivity,
 publish it online,
 allow others to repurpose it at runtime
01/30/15 Bill Howe, eScience Institute 58
Data Product Ensemble
01/30/15 Bill Howe, eScience Institute 59
Mashup
01/30/15 Bill Howe, eScience Institute 60
CTD: Conducitvity, Temperature, Depth
01/30/15 Bill Howe, eScience Institute 61
Sampling
01/30/15 Bill Howe, eScience Institute 62
Event Detection: Red Water
01/30/15 Bill Howe, eScience Institute 63
CTD Cast
01/30/15 Bill Howe, eScience Institute 64
Flowthrough
01/30/15 Bill Howe, eScience Institute 65
Mashup
01/30/15 Bill Howe, eScience Institute 66
Mashup
01/30/15 Bill Howe, eScience Institute 67
Key Concepts
 A mashup is a synchronized
ensemble of data products
 A data product is a mashable that
has been adapted for a particular
purpose
 A mashable is an arbitrarily-complex
computation that returns a relation
 An adaptor displays the relation to
the user and returns a subset
 All adapted mashables accept input
 Hence, user controls are modeled
as adapted mashables just like
“visual” data products
01/30/15 Bill Howe, eScience Institute 68
Adapted Mashables
01/30/15 Bill Howe, eScience Institute 69
Data Flow Graph
01/30/15 Bill Howe, eScience Institute 70
Inferring Data Flow
provides: {ABC}
requires: {AB}
01/30/15 Bill Howe, eScience Institute 71
Inferring Data Flow
provides: {AC}
requires: {AB}
provides: {B}
01/30/15 Bill Howe, eScience Institute 72
Inferring Data Flow
provides: {AC}
requires: {AB}
underspecified mashup
Solution:
1) use defaults
2) root environment
3) hand-specified parameter
01/30/15 Bill Howe, eScience Institute 73
Inferring Data Flow
provides: {AB}
requires: {AB}
provides: {B}
overspecified mashup
Solution: Break ties:
1) Prefer nodes on longer paths
2) Use layout information
01/30/15 Bill Howe, eScience Institute 74
Audience-Tailored Mashups
K12 studentsExperts
01/30/15 Bill Howe, eScience Institute 75
Conclusions and Future Directions
 We want to augment scientists, not programmers
 Requires limiting expressiveness -- not yet clear where
to draw the line
 More work on semi-automatically tailoring a
mashup at runtime
 Automatically insert “context products”

See salinity, add a salinity colorbar

See a time, add a tide chart

See a location, add a map
 Re-skin data products
 “Dashboard-style” vs. “Wizard-style” apps
01/30/15 Bill Howe, eScience Institute 76
http://escience.washington.edu
(retooled website coming soon)
01/30/15 Bill Howe, eScience Institute 77
ComparisonData Model Operations Services
GPL * * Typing, maybe
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
data parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance
MS Dryad IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism,
full control
01/30/15 Bill Howe, eScience Institute 78
Mashups serve a diverse audience
student
public
scientist
01/30/15 Bill Howe, eScience Institute 79
Computational Science
 Theory
 Experiment
 Observation
 Simulation (in silico)
 Analysis (in ferro)
Data acquisition is
hypothesis-driven
Data acquisition is
technology-driven
01/30/15 Bill Howe, eScience Institute 80
Explore architectures blending techniques from
• mashups (rapid prototyping),
• visualization (interactivity, richness),
• workflow (data integration, provenance),
• databases (optimization, data independence)
to answer science questions at an Ocean
Observatory
Motivation
01/30/15 Bill Howe, eScience Institute 81
Source: MayaVi website
PLOT3D, GDAL,
ShapeFile, OGC,
.obj, .vtk,
netCDF, HDF5,
FITS, others
Optimized for “throwing datasets”
and interactivity
Declarative query, interoperability,
repeatability generally lacking
Source: http://pogl.wordpress.com/2007/06/
Visualization
01/30/15 Bill Howe, eScience Institute 82
Workflow
 Emphasis on integration, web
services, flexibility
 Unconstrained boxes-and-arrows
 Any operation on any data type
 Very expressive, but limited
opportunities for static reasoning
 Type safety
 Task parallelism
 Cache safety
 Optimization via rewrite rules
 Result size / execution time estimation
 Transparent data parallelism
 Platform portability
To move the earth, you
need somewhere to
stand
01/30/15 Bill Howe, eScience Institute 83
Databases
Pre-relational DBMS brittleness: if your
data changed, your application broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independent of
physical data representation
01/30/15 Bill Howe, eScience Institute 84
Heterogeneity also drives costs#ofbytes
# of data types
CERN
(~15PB/year, particle interactions)
LSST
(~100PB; images, objects)
PanSTARRS
(~40PB; images, objects, trajectories)
OOI
(~50TB/year; sim. results, satellite,
gliders, AUVs, vessels, more)
SDSS
(~100TB; images, objects)
Biologists
(~10TB, sequences, alignments, annotations,
BLAST hits, metadata, phylogeny trees)
01/30/15 Bill Howe, eScience Institute 85
The eScience Elephant
“Like a snake”
“
“Like a hand fan” “Like a wall” “Like tree trunk”
“Like a spear”
“Like a rope”
01/30/15 Bill Howe, eScience Institute 86

More Related Content

What's hot

Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013osimod
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable PapersJose Enrique Ruiz
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Alexandru Iosup
 
Love for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLove for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLourdes Verdes-Montenegro
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
 
Making Small Data BIG (UT Austin, March 2016)
Making Small Data BIG (UT Austin, March 2016)Making Small Data BIG (UT Austin, March 2016)
Making Small Data BIG (UT Austin, March 2016)Kerstin Lehnert
 
Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Societyosimod
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.Alexandru Iosup
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsStefan Dietze
 

What's hot (20)

eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Science20brussels osimo april2013
Science20brussels osimo april2013Science20brussels osimo april2013
Science20brussels osimo april2013
 
Cifar
CifarCifar
Cifar
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable Papers
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Dat...
 
Love for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 versionLove for science or 'Academic Prostitution' - DFD2014 version
Love for science or 'Academic Prostitution' - DFD2014 version
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"Carpenter "The Future of the Scholarly Record"
Carpenter "The Future of the Scholarly Record"
 
Making Small Data BIG (UT Austin, March 2016)
Making Small Data BIG (UT Austin, March 2016)Making Small Data BIG (UT Austin, March 2016)
Making Small Data BIG (UT Austin, March 2016)
 
Presentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical SocietyPresentation of science 2.0 at European Astronomical Society
Presentation of science 2.0 at European Astronomical Society
 
GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.GrenchMark at CCGrid, May 2006.
GrenchMark at CCGrid, May 2006.
 
From Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web DatasetsFrom Data to Knowledge - Profiling & Interlinking Web Datasets
From Data to Knowledge - Profiling & Interlinking Web Datasets
 

Similar to End-to-End eScience

A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce University of Washington
 
Freeing scientific data using CC0
Freeing scientific data using CC0Freeing scientific data using CC0
Freeing scientific data using CC0Karen Cranston
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?Pieter Pauwels
 
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...EarthCube
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience, BGI Hong Kong
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overviewdgarijo
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceAndrew Sallans
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?ARDC
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...GigaScience, BGI Hong Kong
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsMaria Koutraki
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 

Similar to End-to-End eScience (20)

A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Freeing scientific data using CC0
Freeing scientific data using CC0Freeing scientific data using CC0
Freeing scientific data using CC0
 
Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?
Datasalon6 2011 - "Rise of the robo scientists": where is data coming from?
 
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
Data Facilities Workshop - Panel on Current Concepts in Data Sharing & Intero...
 
Ci days notre_dame_april2010
Ci days notre_dame_april2010Ci days notre_dame_april2010
Ci days notre_dame_april2010
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Understanding the Big Picture of e-Science
Understanding the Big Picture of e-ScienceUnderstanding the Big Picture of e-Science
Understanding the Big Picture of e-Science
 
E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
How can drone data be used in modelling?
How can drone data be used in modelling?How can drone data be used in modelling?
How can drone data be used in modelling?
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked Datasets
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 

More from University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

More from University of Washington (18)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

End-to-End eScience

  • 1. End-to-End eScience Integrating Query, Workflow, Visualization, and Mashups at an Ocean Observatory Bill Howe, University of Washington Harrison Green-Fishback, PSU David Maier, PSU Erik Anderson, Utah Emanuele Santos, Utah Juliana Freire, Utah Carlos Scheidegger, Utah Claudio Silva, Utah Antonio Baptista, OHSU Peter Lawson, OSU Renee Bellinger, OSU http://dev.pacificfishtrax.org/ QuickTime™ and a decompressor are needed to see this picture.
  • 2. 01/30/15 Bill Howe, eScience Institute 2 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Mashups
  • 8. 01/30/15 Bill Howe, eScience Institute 8 All Science is becoming eScience Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Medicine: ubiquitous digital records, MRI, ultrasound  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics
  • 9. 01/30/15 Bill Howe, eScience Institute 9 The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) The Long Taildatainventory ordinal position Researchers with growing data management challenges but limited resources for cyberinfrastructure • No dedicated IT staff • Overreliance on simple tools (e.g., spreadsheets) CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers <Spreadsheet users> SDSS (~100TB) Seis- mologists MicrobiologistsCARMEN (~50TB) “The future is already here. It’s just not very evenly distributed.”-- William Gibson
  • 10. 01/30/15 Bill Howe, eScience Institute 10 eScience Institute at UW  Mission  Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon these techniques and technologies  Strategy  Increase the sharing of expertise and facilities  Bootstrap a cadre of Research Scientists  Add faculty in key fields  Make the entire University more effective  Launched July 1 with $1 million in permanent funding from the Washington State Legislature  Sought, and need, $2 million
  • 11. 01/30/15 Bill Howe, eScience Institute 11 Web Services Facets of Database Research Query Languages Storage Management Visualization; Workflow Data Integration Knowledge Extraction, Crawlers Access Methods Data Mining, Parallel Programming Models, Provenance complexity-hiding interfaces My research: customize and optimize for science
  • 12. 01/30/15 Bill Howe, eScience Institute 12 The eScience Elephant eScience Cloud/Cluster Workflow Databases Visualization Provenance “flexibility; web services; integration” “query processing; data independence; algebraic optimization; needles in haystacks” “Exploratory science; mapping quantitative data to intuition” “Reproducibility; forensics; sharing/reuse” “Massive data parallelism” Mashups “Rapid Prototyping; Simplified web programming”
  • 13. 01/30/15 Bill Howe, eScience Institute 13 Some eScience Research Query Algebra for new Data Type Scientific Workflow Systems Science Mashups “Dataspace” systems [Howe, Freire, Silva, et al. 2008] [Howe, Green-Fishback, Maier, 2009] [Howe, Maier, Rayner, Rucker 2008] [Howe, Maier. 2004, 2005, 2006] thistalk
  • 14. 01/30/15 Bill Howe, eScience Institute 14 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Science Mashups
  • 15. 01/30/15 Bill Howe, eScience Institute 15 VisTrails for Computation
  • 16. Spatial Patterns in Fisheries: newSpatial Patterns in Fisheries: new techniques, new opportunities fortechniques, new opportunities for ecosystem-based managementecosystem-based management Peter LawsonPeter Lawson11 , Lorenzo Cianelli, Lorenzo Cianelli22 , Bobby Ireland, Bobby Ireland22 12
  • 17. 01/30/15 Bill Howe, eScience Institute 17 Enabling Scientific Discourse between Fishermen and Fisheries Managers
  • 18. 01/30/15 Bill Howe, eScience Institute 18
  • 19. 01/30/15 Bill Howe, eScience Institute 19
  • 20. 01/30/15 Bill Howe, eScience Institute 20
  • 21. 01/30/15 Bill Howe, eScience Institute 21 VisTrails for Collaboration Bill Howe @ CMOP computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Bill Howe @ CMOP adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation
  • 22. 01/30/15 Bill Howe, eScience Institute 22 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Mashups
  • 23. 01/30/15 Bill Howe, eScience Institute 23 CMOP
  • 24. 01/30/15 Bill Howe, eScience Institute 24 Columbia River Estuary red = high salinity (~34psu) blue = fresh water (~0 psu)
  • 25. 01/30/15 Bill Howe, eScience Institute 25 Accessing Model Results  CMOP ocean circulation models run in forecast or hindcast mode  Models run serially in ~1/5 real time  On MPICH2, about 10x speedup before overhead dominates  Forecasts kept for 10 days, hindcasts kept indefinitely (40TB + 25TB/year)  Access via a GridFields Web Service  GFServer optimizes and evaluates GF expressions and returns the result
  • 26. 01/30/15 Bill Howe, eScience Institute 26 Unstructured Grids “unstructured grids” model complex domains at multiple scales simultaneously red = high salinity (~34psu) blue = fresh water (~0 psu) Columbia River Estuary ….but complicate processing
  • 27. 01/30/15 Bill Howe, eScience Institute 27 “Structured” Grids “structured grids” do a poor job of modeling complex features and complicate multi-scale analysis. But:Coastlines are not rectilinear x x xx xx xx xx xx x 1) Missing values = wasted effort Higher resolution = wasted effort in areas of low dynamism 2) Data associated with cells at multiple dimensions Simple: Isomorphic to multidimensional arrays
  • 28. 01/30/15 Bill Howe, eScience Institute 28 Structured grids are easy  The data model (Cartesian products of coordinate variables)  immediately implies a representation, (multidimensional arrays)  an API, (reading and writing subslabs)  and an efficient implementation (address calculation using array “shape”)
  • 29. 01/30/15 Bill Howe, eScience Institute 29 Structured grid example f( i, j ) x( i) y( j) for i in [4:6]: for j in [1:4]: addr = &f + j*|x| + i = f[4:6, 1:4] = NetCDF, MATLAB, RasDaMan, SciDB (soon), many more
  • 30. 01/30/15 Bill Howe, eScience Institute 30 Unstructured Grids 2 3 4 ( E, I ) = A y x z E0 = {2,3,4} E1 = {x,y,z} E2 = {A} I = z2 z4 Az x2 x3 Ax Ay y4 y3 …plus the transitive closure
  • 31. 01/30/15 Bill Howe, eScience Institute 31 Subsetting Full grid: Eastern Pacific Subset: mouth of Columbia River color: bathymetry Washington Oregon California
  • 32. 01/30/15 Bill Howe, eScience Institute 32 Correctness properties preserved Grid is well-supported (no ragged edges)
  • 33. 01/30/15 Bill Howe, eScience Institute 33 Subset semantics 01 1 1 1 0 0 1 1 1 1 1 1 1 1 Input Simple Drop “Exact” 1 1 11 0 01 1 0 0 1 1 1 1 2 1 1 Cut everything labeled “0”. What should be kept?
  • 34. 01/30/15 Bill Howe, eScience Institute 34 What about Visualization Libs?  Different C++ classes, each dependent on data characteristics.  Changes to data characteristics require changes to the program  Logical equivalences obscured  No data independence vtkExtractGeometry vtkThreshold vtkExtractGrid vtkExtractVOI vtkThresholdPoints We want: in VTK:
  • 35. 01/30/15 Bill Howe, eScience Institute 35 GridField Data Model A GridField with two attributes bound to the 2-cells and four attributes bound to the 0-cells x y salt temp 13.8 10.6 29.4 12.1 13.9 9.4 29.8 12.5 14.3 9.0 28.0 12.0 13.4 9.0 30.1 13.2 flux area 11.5 3.3 13.9 5.5 13.1 4.5
  • 36. 01/30/15 Bill Howe, eScience Institute 36 GridField Operations  Lifted set operations  Union, Intersection, Cross Product  Scan/Bind  Read a grid/attribute  Restrict  Remove cells that do not satisfy a predicate  Accrete  Grow a grid by adding neighbors of cells  Regrid  Map the data of one grid onto another
  • 37. 01/30/15 Bill Howe, eScience Institute 37 Usage Example (1) H = Scan(context, "H") rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H) H = rH = dimensionpredicate color: bathymetry
  • 38. 01/30/15 Bill Howe, eScience Institute 38 Usage Example (2) H = Scan(context, “H") rH = Restrict(“h<500", 0, H) H = rH = color: bathymetry
  • 39. 01/30/15 Bill Howe, eScience Institute 39 Longer Example H : (x,y,b) V : (z) render H V ⊗ (H × V) r(z>b) r(H × V) b(s) b(r(H × V)) r(region) r(b(r(H × V)))
  • 40. 01/30/15 Bill Howe, eScience Institute 40 ⊗ H(x,y,b) V(z) r(z>b) b(s) r(region) ⊗ H(x,y,b) V(z) r(z>b) b(s) r(x,y) r(z) Optimization *Howe, Maier, Algebraic Manipulation of Scientific Datasets. VLDB Journal, 14:4, 2005
  • 41. 01/30/15 Bill Howe, eScience Institute 41 Transect (Vertical Slice) P
  • 42. 01/30/15 Bill Howe, eScience Institute 42 Transect: Bad Plan ⊗ H(x,y,b) V(z) r(z>b) b(s) regrid ⊗ P P ⊗ V 1) Construct full-size 3D grid 2) Construct 2D transect grid 3) Spatial Join 1) with 2)
  • 43. 01/30/15 Bill Howe, eScience Institute 43 Transect: Optimized Plan P ⊗ V V(z) P H(x,y,b) regrid b(s)⊗ regrid ⊗ 1) Find 2D cells containing points 2) Create “stacks” of 2D cells carrying data 3) Create 2D transect grid 4) Spatial Join 2) with 3)
  • 44. 01/30/15 Bill Howe, eScience Institute 44 1) Find cells containing points in P
  • 45. 01/30/15 Bill Howe, eScience Institute 45 1) 4) 2) 1) Find cells containing points in P 2) Construct “stacks” of cells 4) Join 2) with 3)
  • 46. 01/30/15 Bill Howe, eScience Institute 46 0 5 10 15 20 25 30 35 40 45 vtk(3D) interpolate simple interp_o simple_o Transect: Results secs 800 MB dataset simple = nearest neighbor interpolation *_o = optimized by restricting to the region of interest
  • 47. 01/30/15 Bill Howe, eScience Institute 47 Ongoing work  NSF Cluster Exploratory Award:  Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis of Massive Mesh Data  Partnership between NSF, IBM, Google  Data-intensive computing  massive queries, not massive simulations  To “Cloud-Enable” GridFields and VisTrails  Goal: 10+-year climatologies at interactive speeds  Parallel implementations of GridField operators  via Hadoop (and Dryad!)  Provenance, repeatability, visualization via VisTrails  Connect rich desktop experience  Co-PIs from University of Utah  Claudio Silva and Juliana Freire
  • 48. 01/30/15 Bill Howe, eScience Institute 48 Outline  eScience  Brief Demo  A Domain-Specific Query Algebra  Scientific Mashups
  • 49. 01/30/15 Bill Howe, eScience Institute 49 Why Mashups?  Jim Gray: # of datasets scales as N2  Each pairwise comparison generates a new dataset  Corollary: # of apps scales as N2  Every pairwise comparison motivates a new mashup  To keep up, we need to  entrain new programmers,  make existing programmers more productive,  or both
  • 50. 01/30/15 Bill Howe, eScience Institute 50 Satellite Images + Crime Incidence Reports
  • 51. 01/30/15 Bill Howe, eScience Institute 51 Twitter Feed + Flickr Stream
  • 52. 01/30/15 Bill Howe, eScience Institute 52 Mashup Frameworks  A bottom up approach  Start with a GPL, add  Visual programming  Interactive type checking  Exploit a corpus of previous examples  bootstrapping a mashup  mashup “autocomplete”  emit warnings
  • 53. 01/30/15 Bill Howe, eScience Institute 53
  • 54. 01/30/15 Bill Howe, eScience Institute 54
  • 55. 01/30/15 Bill Howe, eScience Institute 55
  • 56. 01/30/15 Bill Howe, eScience Institute 56 Scientific Mashup Characteristics  Turn over more data per operation  Involve subtle visualizations  Must serve a diverse audience
  • 57. 01/30/15 Bill Howe, eScience Institute 57 A Model for Scientific Mashups  The “Data Product” is the currency of scientific communication with the public  Scientists are already adept at crafting them (consider powerpoint slides and figures)  We take a top down approach:  Take a static data product ensemble,  endow it with interactivity,  publish it online,  allow others to repurpose it at runtime
  • 58. 01/30/15 Bill Howe, eScience Institute 58 Data Product Ensemble
  • 59. 01/30/15 Bill Howe, eScience Institute 59 Mashup
  • 60. 01/30/15 Bill Howe, eScience Institute 60 CTD: Conducitvity, Temperature, Depth
  • 61. 01/30/15 Bill Howe, eScience Institute 61 Sampling
  • 62. 01/30/15 Bill Howe, eScience Institute 62 Event Detection: Red Water
  • 63. 01/30/15 Bill Howe, eScience Institute 63 CTD Cast
  • 64. 01/30/15 Bill Howe, eScience Institute 64 Flowthrough
  • 65. 01/30/15 Bill Howe, eScience Institute 65 Mashup
  • 66. 01/30/15 Bill Howe, eScience Institute 66 Mashup
  • 67. 01/30/15 Bill Howe, eScience Institute 67 Key Concepts  A mashup is a synchronized ensemble of data products  A data product is a mashable that has been adapted for a particular purpose  A mashable is an arbitrarily-complex computation that returns a relation  An adaptor displays the relation to the user and returns a subset  All adapted mashables accept input  Hence, user controls are modeled as adapted mashables just like “visual” data products
  • 68. 01/30/15 Bill Howe, eScience Institute 68 Adapted Mashables
  • 69. 01/30/15 Bill Howe, eScience Institute 69 Data Flow Graph
  • 70. 01/30/15 Bill Howe, eScience Institute 70 Inferring Data Flow provides: {ABC} requires: {AB}
  • 71. 01/30/15 Bill Howe, eScience Institute 71 Inferring Data Flow provides: {AC} requires: {AB} provides: {B}
  • 72. 01/30/15 Bill Howe, eScience Institute 72 Inferring Data Flow provides: {AC} requires: {AB} underspecified mashup Solution: 1) use defaults 2) root environment 3) hand-specified parameter
  • 73. 01/30/15 Bill Howe, eScience Institute 73 Inferring Data Flow provides: {AB} requires: {AB} provides: {B} overspecified mashup Solution: Break ties: 1) Prefer nodes on longer paths 2) Use layout information
  • 74. 01/30/15 Bill Howe, eScience Institute 74 Audience-Tailored Mashups K12 studentsExperts
  • 75. 01/30/15 Bill Howe, eScience Institute 75 Conclusions and Future Directions  We want to augment scientists, not programmers  Requires limiting expressiveness -- not yet clear where to draw the line  More work on semi-automatically tailoring a mashup at runtime  Automatically insert “context products”  See salinity, add a salinity colorbar  See a time, add a tide chart  See a location, add a map  Re-skin data products  “Dashboard-style” vs. “Wizard-style” apps
  • 76. 01/30/15 Bill Howe, eScience Institute 76 http://escience.washington.edu (retooled website coming soon)
  • 77. 01/30/15 Bill Howe, eScience Institute 77 ComparisonData Model Operations Services GPL * * Typing, maybe Workflow * arbitrary boxes- and-arrows typing, provenance, Pegasus-style resource mapping, task parallelism Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, data parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance MS Dryad IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control
  • 78. 01/30/15 Bill Howe, eScience Institute 78 Mashups serve a diverse audience student public scientist
  • 79. 01/30/15 Bill Howe, eScience Institute 79 Computational Science  Theory  Experiment  Observation  Simulation (in silico)  Analysis (in ferro) Data acquisition is hypothesis-driven Data acquisition is technology-driven
  • 80. 01/30/15 Bill Howe, eScience Institute 80 Explore architectures blending techniques from • mashups (rapid prototyping), • visualization (interactivity, richness), • workflow (data integration, provenance), • databases (optimization, data independence) to answer science questions at an Ocean Observatory Motivation
  • 81. 01/30/15 Bill Howe, eScience Institute 81 Source: MayaVi website PLOT3D, GDAL, ShapeFile, OGC, .obj, .vtk, netCDF, HDF5, FITS, others Optimized for “throwing datasets” and interactivity Declarative query, interoperability, repeatability generally lacking Source: http://pogl.wordpress.com/2007/06/ Visualization
  • 82. 01/30/15 Bill Howe, eScience Institute 82 Workflow  Emphasis on integration, web services, flexibility  Unconstrained boxes-and-arrows  Any operation on any data type  Very expressive, but limited opportunities for static reasoning  Type safety  Task parallelism  Cache safety  Optimization via rewrite rules  Result size / execution time estimation  Transparent data parallelism  Platform portability To move the earth, you need somewhere to stand
  • 83. 01/30/15 Bill Howe, eScience Institute 83 Databases Pre-relational DBMS brittleness: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. physical data independence logical data independence files and pointers relations view s “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independent of physical data representation
  • 84. 01/30/15 Bill Howe, eScience Institute 84 Heterogeneity also drives costs#ofbytes # of data types CERN (~15PB/year, particle interactions) LSST (~100PB; images, objects) PanSTARRS (~40PB; images, objects, trajectories) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) SDSS (~100TB; images, objects) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogeny trees)
  • 85. 01/30/15 Bill Howe, eScience Institute 85 The eScience Elephant “Like a snake” “ “Like a hand fan” “Like a wall” “Like tree trunk” “Like a spear” “Like a rope”
  • 86. 01/30/15 Bill Howe, eScience Institute 86

Editor's Notes

  1. There have traditionally been three legs to the scientific stool: theory, experiment, and observation.
  2. These were mutually reinforcing: for example, observations might suggest theories, which could be tested by experiments.
  3. Over the past 50 years, we have augmented the traditional “three legs of the stool” with an incredibly powerful new tool: high-speed computation. In traditional “computational science,” we use simulation to conduct “virtual experiments” – experiments that can’t be conducted in the lab, for various reasons.
  4. In the past 10 years, a fourth method of scientific discovery has emerged: Acquire data en masse, independent of any hypothesis, and then ask questions about it post hoc. eScience is about massive and complex data -- data large enough to require automated or semi-automated analysis -- there’s too much to look at manually. Relevant tools are databases, visualization, cluster computing, data mining, machine learning, workflow, web services -- all integrated and optimized for scientific use.
  5. Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  6. The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT reesources -- no clusters, no system administrators, no programmers, and no computer scientists. They rely on spreadsheets, email, and maybe a shared file system. Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources. However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel? Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  7. Data Management != Storage Management Storage Management is SATA/SCSI/Fiber Backup policies and procedures redundancy decisions (RAID 0, 1+0, 0+1, 5 Access methods Query languages Data Mining, Analysis, Visualization Data Integration
  8. The blind men and the eScience elephant.
  9. Collaboration across many state federal and university agencies
  10. &amp;lt;number&amp;gt;
  11. Very interesting datasets, and all of it is freely available for any purpose.
  12. File formats, programming languages, and DBMS exist that are organized around this simple property
  13. With an unstructured grid, you explicitly track cells of various dimensions and the incidence relationship that connects them. Specialized “Boundary Representation” data structures exist for special cases, but there is no general data model.
  14. Different semantics for subsetting may be defined. One particular semantics tends to preserve intuitive correctness properties.
  15. &amp;lt;number&amp;gt;
  16. &amp;lt;number&amp;gt;
  17. &amp;lt;number&amp;gt;
  18. &amp;lt;number&amp;gt;
  19. &amp;lt;number&amp;gt;
  20. &amp;lt;number&amp;gt;
  21. On the order of hundreds of points. Manual browsing.
  22. “Make mashups easy to create” “Raise the level of abstraction” “Empower non-programmers to be programmers”
  23. The seven people who know your
  24. Climatology is long-term average
  25. ~1 million observations. Can’t render each dot in, say, javascript. We need services that can produce these visualizations given parameters. This work is about synchronizing visualizations, blessing them with interactivity, and publishing them on the web.
  26. The blind men and the eScience elephant.