Visual Data Analytics in the Cloud for Exploratory Science
1. Visual Data Analytics in the Cloud
for Exploratory Science
Bill Howe, UW
QuickTime™ and a
decompressor
are needed to see this picture.
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW
2. 3/12/09 Bill Howe, UW 2VisTrails + GridFields
Data acquisition is no longer the bottleneck
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing,
3. 3/12/09 Bill Howe, UW 3VisTrails + GridFields
Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of apps
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS
4. 3/12/09 Bill Howe, UW 4VisTrails + GridFields
This Talk
# of Bytes: MapReduce for Scientific Viz
# of Apps: Other VDA Projects
5. 3/12/09 Bill Howe, UW 5VisTrails + GridFields
Converging Requirements
Vis DB
6. 3/12/09 Bill Howe, UW 6VisTrails + GridFields
Why Vis Needs DB
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
Current Research Topics in Vis:
• “Query-driven Visualization”
• “In Situ Visualization”
• “Remote Visualization”
8. 3/12/09 Bill Howe, UW 8VisTrails + GridFields
Why DB Needs Vis (2)
“What does the salt wedge look like?”
9. 3/12/09 Bill Howe, UW 9VisTrails + GridFields
Thesis
We can no longer afford to build separate
visualization and data management systems
Data is increasingly destined for the cloud
First Attack: Implement Vis primitives in an
existing “cloud” DM system
10. 3/12/09 Bill Howe, UW 10VisTrails + GridFields
Core Vis Algorithms in MapReduce
Scalar/Volume Rendering
Isosurface Extraction
Mesh Simplification
11. 3/12/09 Bill Howe, UW 11VisTrails + GridFields
Some distributed algorithm…
Map
(Shuffle)
Reduce
12. 3/12/09 Bill Howe, UW 12VisTrails + GridFields
CluE Cluster
410 nodes
Dual Intel Xeon 2.8GHz, hyperthreading
8GB main memory each
Hadoop, no access to OS
Google provided, IBM maintaine, NSF
funded
24. 3/12/09 Bill Howe, UW 24VisTrails + GridFields
Roadmap
# of Bytes: MapReduce for Scientific Viz
# of Apps: Other VDA projects
Azure Ocean
SQLShare
Automating Mashups
25. 3/12/09 Bill Howe, UW 25VisTrails + GridFields
[John Delaney, University of Washington]
26. 3/12/09 Bill Howe, UW 26VisTrails + GridFields
Azure OceanAzure Ocean
COVE for
Visualization
Trident for
Processing
Azure for
Data+ +
27. 3/12/09 Bill Howe, UW 27VisTrails + GridFields
SQLShare: Query Services
for Ad Hoc Research Data
28. 3/12/09 Bill Howe, UW 28VisTrails + GridFields
Ad Hoc Research Data
5/18/10 Garret Cole, eScience Institute
Fasta format
Spread sheets
Tabular data
29. 3/12/09 Bill Howe, UW 29VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Problem
“I spend 90% of my time handling
data rather than doing science”
-- Robin Kodner, Postdoc, Armbrust Lab
30. 3/12/09 Bill Howe, UW 30VisTrails + GridFields
An observation about “handling data”
How often does each RNA hit appear inside my
annotated surface group?
SELECT hit, COUNT(*) as cnt FROM tigrfamannotation_surface
GROUP BY hit ORDER BY cnt DESC
5/18/10 Garret Cole, eScience Institute
31. 3/12/09 Bill Howe, UW 31VisTrails + GridFields 31
Discovery: SQL Does not Terrify Scientists
5/18/10 Garret Cole, eScience Institute
33. 3/12/09 Bill Howe, UW 33VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Technology used in 1st
Gen
Component Stack
34. 3/12/09 Bill Howe, UW 34VisTrails + GridFields
SQLShare Redux
Conventional wisdom says “Scientists won’t write SQL”
We don’t believe it!
Instead, we implicate difficulty in
installation
configuration
schema design
performance tuning
data ingest
over-reliance on GUIs
Critical need for visualization
Clear role for Tableau!
We are asking “What kind of platform will
make SQL useful for scientific inquiry?”
36. 3/12/09 Bill Howe, UW 36VisTrails + GridFields
Why Mashups?
Jim Gray: # of datasets scales as N2
Each pairwise comparison generates a new dataset
Corollary: # of apps scales as N2
Every pairwise comparison motivates a new mashup
To keep up, we need to
entrain new programmers,
make existing programmers more productive,
or both
39. 3/12/09 Bill Howe, UW 39VisTrails + GridFields
Why Mashups?
The time of one’s data fitting into a 15 page research paper is past.
Datasets are too large and complex to be conveyed with a handful
of static images
Prediction: succinct, targeted, interactive web apps will become the
currency of scientific communication
with the public
with policy makers
with colleagues in other disciplines
with peers
with students (K12 - grad)
41. 3/12/09 Bill Howe, UW 41VisTrails + GridFields
Conclusions
Converging requirements for DB and Vis
At high scale:
A Vis library in MapReduce
At high complexity:
Azure Ocean
Data + Workflow + Vis
“Client + Cloud”,“Computational mobility”
SQLShare
Ad Hoc data -- “anything goes”
Visualization critical
(semi-)automated mashups
“Show me what’s interesting”
42. 3/12/09 Bill Howe, UW 42VisTrails + GridFields
Acknowledgments
http://escience.washington.edu
47. 3/12/09 Bill Howe, UW 47VisTrails + GridFields
Azure OceanAzure Ocean
COVE for
Visualization
Trident for
Processing
Azure for
Data+ +
48. COVECOVE
Research into new interfaces for cross-disciplinary ocean scienceResearch into new interfaces for cross-disciplinary ocean science
Extensive instrument and cable layout for creating experimentsExtensive instrument and cable layout for creating experiments
Flexible terrain and image engine for visualizing siteFlexible terrain and image engine for visualizing site
True 3D/4D science dataset visualizationTrue 3D/4D science dataset visualization
Field tested in RSN observatory layout and on ocean expeditionsField tested in RSN observatory layout and on ocean expeditions
Cross platform and extensible with python and workflow systemsCross platform and extensible with python and workflow systems
49. 3/12/09 Bill Howe, UW 49VisTrails + GridFields
TridentTrident
Microsoft Research scientific workflow systemMicrosoft Research scientific workflow system
Visual programming environment for connecting tasksVisual programming environment for connecting tasks
Science-specific task libraries including one for ocean sciencesScience-specific task libraries including one for ocean sciences
Automated provenance capture, monitoring, and fault toleranceAutomated provenance capture, monitoring, and fault tolerance
Runs on local system, Windows server, or HPC ClusterRuns on local system, Windows server, or HPC Cluster
Cross platform with Silverlight and web service interfaceCross platform with Silverlight and web service interface
50. 3/12/09 Bill Howe, UW 50VisTrails + GridFields
AzureAzure
Microsoft’s cloud computing platformMicrosoft’s cloud computing platform
Provides storage and computing as pay-as-you-go servicesProvides storage and computing as pay-as-you-go services
From development standpoint, system looks like provisioned VM’sFrom development standpoint, system looks like provisioned VM’s
SQL, table, and blob (file system) storage models are includedSQL, table, and blob (file system) storage models are included
Access to storage via RESTful HTTP interfaceAccess to storage via RESTful HTTP interface
51. 3/12/09 Bill Howe, UW 51VisTrails + GridFields
Azure OceanAzure Ocean
COVE + Trident + Azure provides visual analytics to scientistsCOVE + Trident + Azure provides visual analytics to scientists
Any component –Any component – VisualizationVisualization,, ComputingComputing, or, or DataData –– can becan be
provisioned locally, on a server, or in the cloudprovisioned locally, on a server, or in the cloud
When on same machine, system APIs are leveraged for speedWhen on same machine, system APIs are leveraged for speed
When distributed, communication is through HTTP and RESTful APIsWhen distributed, communication is through HTTP and RESTful APIs
Flexible platform for the diverse ocean science needsFlexible platform for the diverse ocean science needs
53. 3/12/09 Bill Howe, UW 53VisTrails + GridFields
MapReduce Programming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
Processes input key/value pair
Produces set of intermediate pairs
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.
55. 3/12/09 Bill Howe, UW 55VisTrails + GridFields
Isosurface Example
<Vis movie>QuickTime™ and a
decompressor
are needed to see this picture.
Key idea: Zooplankton correlated with temperature
57. 3/12/09 Bill Howe, UW 57VisTrails + GridFields
Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation
58. 3/12/09 Bill Howe, UW 58VisTrails + GridFields
UW + Utah CluE Program
Goals
10+-year “climatologies” at interactive speeds
…with provenance, reproducibility, collaboration …on a
shared-nothing, commodity platform
In general: Explore the intersection of scientific
databases and scientific visualization, at scale
Methods
“Cloud-Enable” two projects
GridFields: Query algebra for mesh data
VisTrails: Scientific workflow and provenance
60. 3/12/09 Bill Howe, UW 60VisTrails + GridFields
Converging Requirements
Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: Millions of tuples per result
Vis DB
61. 3/12/09 Bill Howe, UW 61VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
62. 3/12/09 Bill Howe, UW 62VisTrails + GridFields
Core Vis Algorithms in MapReduce
Scalar/Volume Rendering
Map: Rasterization
Reduce: Compositing, blending
Isosurface Extraction
Map: Isosurface Extraction
Reduce: Combine like isovalues
Mesh Simplification
Map: Bin vertices
Reduce: Collapse binned triangles
66. 3/12/09 Bill Howe, UW 66VisTrails + GridFields
“Query-Driven Visualization”
Vis perspective:
query = subsetting
DB perspective:
query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.
67. 3/12/09 Bill Howe, UW 67VisTrails + GridFields
Why Cloud?
“Cloud”?
Software as a Service (SaaS)
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud
68. 3/12/09 Bill Howe, UW 68VisTrails + GridFields
Shared Nothing Parallel Databases
Teradata
Greenplum
Netezza
Aster Data Systems
Datallegro
Vertica
MonetDB
Microsoft
Recently commercialized as “Vectorwise”
69. 3/12/09 Bill Howe, UW 69VisTrails + GridFields
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes
70. 3/12/09 Bill Howe, UW 70VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails
71. 3/12/09 Bill Howe, UW 71VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree
72. 3/12/09 Bill Howe, UW 72VisTrails + GridFields
Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008
73. 3/12/09 Bill Howe, UW 73VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
74. 3/12/09 Bill Howe, UW 74VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
75. 3/12/09 Bill Howe, UW 75VisTrails + GridFields
Hadoop in VisTrails
Wrap Hadoop Streaming/HDFS Operations
Plug “PreProcess” to actual Vis Pipeline
3/12/09 75
76. 3/12/09 Bill Howe, UW 76VisTrails + GridFields
Hadoop in VisTrails
Provenance and Monitoring
3/12/09 76
77. 3/12/09 Bill Howe, UW 77VisTrails + GridFields
Preliminary results
Managing Hadoop jobs with VisTrails
GridField queries in Hadoop
Core Visualization algorithms in Hadoop
78. 3/12/09 Bill Howe, UW 78VisTrails + GridFields
All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
Medicine: ubiquitous digital records, MRI, ultrasound
Oceanography: high-resolution models, cheap sensors, satellites
Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X Analytical X Computational X X-informatics
79. 3/12/09 Bill Howe, UW 79VisTrails + GridFields
Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
80. 3/12/09 Bill Howe, UW 80VisTrails + GridFields
Example System: Teradata
AMP = unit of parallelism
81. 3/12/09 Bill Howe, UW 81VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6
82. 3/12/09 Bill Howe, UW 82VisTrails + GridFields
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
83. 3/12/09 Bill Howe, UW 83VisTrails + GridFields
Example System: Teradata
AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
84. 3/12/09 Bill Howe, UW 84VisTrails + GridFields
Workflow Execution Plans
Need execution plans spanning client/server/cloud
85. 3/12/09 Bill Howe, UW 85VisTrails + GridFields
Example: Isosurface Browsing
QuickTime™ and a
decompressor
are needed to see this picture.
86. 3/12/09 Bill Howe, UW 86VisTrails + GridFields
Example: Isosurface Browsing
Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
87. 3/12/09 Bill Howe, UW 87VisTrails + GridFields
Example: Isosurface Browsing
Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render
88. 3/12/09 Bill Howe, UW 88VisTrails + GridFields
Example: Isosurface Browsing
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization
systems
90. 3/12/09 Bill Howe, UW 90VisTrails + GridFields
Why MapReduce/Hadoop?
Popular
AWS Elastic MapReduce
100s of startups
# of downloads
# of blog posts
Free as in Speech
Free as in Beer
Flexible, Lightweight
Scalable
Fault-tolerant
98. 3/12/09 Bill Howe, UW 98VisTrails + GridFields
As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())
99. 3/12/09 Bill Howe, UW 99VisTrails + GridFields
As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos
100. 3/12/09 Bill Howe, UW 100VisTrails + GridFields
Scientific Workflow Systems
Value proposition: More time on science, less time on code
How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
Provenance
Visual programming
Caching
Integration with domain-specific tools
Scheduling
101. 3/12/09 Bill Howe, UW 101VisTrails + GridFields
Related Vis Work
Parallel visualization systems
ParaView, VisIt
Query-Driven Visualization
[Bethel et al 2006,2008,2009]
FastBit Index
[Shoshani et al 2007]
DB Vis systems
Tableau
102. 3/12/09 Bill Howe, UW 102VisTrails + GridFields
Feeding the Pipeline
source: Ken Moreland
missing step?
104. 3/12/09 Bill Howe, UW 104VisTrails + GridFields
Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review
105. 3/12/09 Bill Howe, UW 105VisTrails + GridFields
Remote Visualization
Reduce and render remotely, transfer images
++ transfers less data
-- specialized hardware, high load
Reduce remotely, transfer data/geometry, render locally
++ uses local graphics pipeline
-- transfers more data
107. 3/12/09 Bill Howe, UW 107VisTrails + GridFields
Scientific Vis System Roundup
General
ParaView [KitWare, Los Alamos, Sandia]
VisIt [LLNL]
Specialized
SALSA, particles, Quinn, UW
VISUS, streaming/progressive, Jones, LLNL
SAGE,
Hyperwall, tiled display, NASA
Hinweis der Redaktion
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
“Typical large pharmas today are generating 20 terabytes of data daily. That’s probably going up to 100 terabytes per day in the next year or so.”
“tens of terabytes of data per day” -- genome center at Washignton University
Increase data collection exponentially with flowcam
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Data is ending up in the cloud; we need to figure out how to use it.
Visualization is a more efficient way to query data -- you can browse and explore.
But you need to be able to switch back and forth between interactive browsing and symbolic querying
What exactly is Ad Hoc Research data?
It is data that can come in any size shape or form, where the data is heterogeneous within its structure, format, quality, and more.
(granted we had a minute for Bill (clearly Bill) to describe this new eScience movement)
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.
Essentially, we want to remove the speed-bump of data handling from the scientists.
To begin, we ask, what kind of questions would you ask your data once you have it ready to be worked on?
Just about EVERY question that we have heard a scientist would ask, we have found an equivalent SQL statement counterpart.
If we could just turn their questions in SQL our job would be done, but there are many other problems to solve before that becomes a reality. For example, their data may not reside in a relational database.
This brings us to part of our next problem: how can we bring the power of SQL to the scientists to solve their questions without the overhead of everything that a database administrator would need to do.
One claim we are trying to prove with this project is that scientists are not afraid to learn a bit of SQL
In our first generation deployment, we used the asp.net front end on the windows azure cloud to host our web service and Amazon’s ec2 cloud as the backend to host our Microsoft SQL Server database.
Data products are the currency of scientific and statistical communication with the public
Ex: Obama map
Ex: Mars Rover pictures generate 218M hits in 24 hrs
But: Datasets are growing too big and too complex to view through a few static images
Scientists want to create interactive visualizations that allow others to explore their results
Ex: Nasa 3D with Photosynth
Ex: CAMERA
Ex:
On the order of hundreds of points. Manual browsing.
Ex: Nasa 3D with Photosynth
Ex: CAMERA
Ex:
Data-intensive science
This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
Need to consider private clouds
Not just renting hardware: general-purpose data processing
The goal here is to make Shared Nothing Architecturs easier to program.
We only wrap the interface for Hadoop Streaming in VisTrails with the additional suppport of HDFS operations to upload/download data/libraries for the job.
The Hadoop Streaming is plugged into a local VTK rendering pipeline that would grab data from the cloud and generate an animation on the VisTrails Spreadsheet.
Users can specify their own Python Source as mapper/reducer. In this case, a VTK script is specified in the mapper. Also, VTK libraries are shipped along with the code to the computing node. This uses the underlying –cacheArchive of Hadoop streaming.
By default, Hadoop logs are output to the standard output of VisTrails app. Jobs are killed by terminate the program and run an extra command returned by Hadoop. However, one can plug a HadoopTrackerCell to the end of the pipeline to have their log messages to be monitored on the VisTrails Spreadsheet. There are also button to kill the job or show Job Tracker, which would automatically connect through the CLuE’s specific proxy to see additional logs/error messages of jobs.
Drowning in data; starving for information
We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
Need to assign workflows to resources for execution in a heterogeneous compute environment. Parts of this workflow can be compiled into Hadoop jobs, parts should be run locally so that they exploit hardware acceleration.
But this is not just computation placement -- there are different execution plans, similar to relational execution plans.
Gridfields expressions can be algebraically optimized, for example.
Plan C: Build a spatial index to support panning
Plan D: Build a multi-resolution index to support zoom
…and so on
Why not precompute all appropriate indexes?
Some will (partially) reside on client
Storage is not as cheap as we pretend
Need a flexible system where
a “query result” can be explored interactively, and
we prepare for similar queries
similarity defined by natural “browsing patterns” in visualization systems
We can’t just precompute the indexes, since they may reside on
Analytics and Visualization are mutually dependent
Scalability
Fault-tolerance
Exploit shared-nothing, commodity clusters
In general: Move computation to the data
Upper left: Average
Sweeping through the velocity fields quickly exposed the location of the “upstream” salt flux -- where salty water made its way back upstream.