Visual Data Analytics in the Cloud for Exploratory Science

Visual Data Analytics in the Cloud
for Exploratory Science
Bill Howe, UW
QuickTime™ and a
decompressor
are needed to see this picture.
Huy Vo, Utah
Claudio Silva, Utah
Juliana Freire, Utah
YingYi Bu, UW

3/12/09 Bill Howe, UW 2VisTrails + GridFields
Data acquisition is no longer the bottleneck
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing,

Biology
Oceanography
Astronomy
Two dimensions#ofbytes
# of apps
LSST
SDSS
Galaxy
BioMart
GEO
IOOS
OOI
LANL
HIVPathway
Commons
PanSTARRS

This Talk
 # of Bytes: MapReduce for Scientific Viz
 # of Apps: Other VDA Projects

Converging Requirements
Vis DB

Why Vis Needs DB
“Transferring the whole data generated … to a storage device or a visualization
machine could become a serious bottleneck, because I/O would take most of the …
time. A more feasible approach is to reduce and prepare the data in situ for
subsequent visualization and data analysis tasks.”
-- SciDAC Review
Current Research Topics in Vis:
• “Query-driven Visualization”
• “In Situ Visualization”
• “Remote Visualization”

Why DB Needs Vis

Why DB Needs Vis (2)
“What does the salt wedge look like?”

Thesis
 We can no longer afford to build separate
visualization and data management systems
 Data is increasingly destined for the cloud
 First Attack: Implement Vis primitives in an
existing “cloud” DM system

Core Vis Algorithms in MapReduce
 Scalar/Volume Rendering
 Isosurface Extraction
 Mesh Simplification

Some distributed algorithm…
Map
(Shuffle)
Reduce

CluE Cluster
 410 nodes
 Dual Intel Xeon 2.8GHz, hyperthreading
 8GB main memory each
 Hadoop, no access to OS
 Google provided, IBM maintaine, NSF
funded

CluE Cluster Scaling

Isosurface Example

Isosurface Extraction

Isosurface Results
O(N2
)O(N)

Scalable Rendering

Scalable Rendering
 Left: Atlas
 18GB
 500M triangles
 Right: St. Matthew
 13GB
 372M triangles
 Laser Scans, Digital
Michelandgelo project
srrc: Digital Michelangelo project

Rendering Results

Roadmap
 # of Bytes: MapReduce for Scientific Viz
 # of Apps: Other VDA projects
 Azure Ocean
 SQLShare
 Automating Mashups

[John Delaney, University of Washington]

Azure OceanAzure Ocean
COVE for
Visualization
Trident for
Processing
Azure for
Data+ +

SQLShare: Query Services
for Ad Hoc Research Data

Ad Hoc Research Data
5/18/10 Garret Cole, eScience Institute
Fasta format
Spread sheets
Tabular data

3/12/09 Bill Howe, UW 29VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Problem
“I spend 90% of my time handling
data rather than doing science”
-- Robin Kodner, Postdoc, Armbrust Lab

An observation about “handling data”
 How often does each RNA hit appear inside my
annotated surface group?
 SELECT hit, COUNT(*) as cnt FROM tigrfamannotation_surface
GROUP BY hit ORDER BY cnt DESC

3/12/09 Bill Howe, UW 31VisTrails + GridFields 31
Discovery: SQL Does not Terrify Scientists

3/12/09 Bill Howe, UW 33VisTrails + GridFields5/18/10 Garret Cole, eScience Institute
Technology used in 1st
Gen
Component Stack

SQLShare Redux
 Conventional wisdom says “Scientists won’t write SQL”
 We don’t believe it!
 Instead, we implicate difficulty in
 installation
 configuration
 schema design
 performance tuning
 data ingest
 over-reliance on GUIs
 Critical need for visualization
 Clear role for Tableau!
We are asking “What kind of platform will
make SQL useful for scientific inquiry?”

Automating Mashups

Why Mashups?
 Jim Gray: # of datasets scales as N2
 Each pairwise comparison generates a new dataset
 Corollary: # of apps scales as N2
 Every pairwise comparison motivates a new mashup
 To keep up, we need to
 entrain new programmers,
 make existing programmers more productive,
 or both

Satellite Images + Crime Incidence Reports

Twitter Feed + Flickr Stream

Why Mashups?
 The time of one’s data fitting into a 15 page research paper is past.
 Datasets are too large and complex to be conveyed with a handful
of static images
 Prediction: succinct, targeted, interactive web apps will become the
currency of scientific communication
 with the public
 with policy makers
 with colleagues in other disciplines
 with peers
 with students (K12 - grad)

Tableau
Mashups

Conclusions
 Converging requirements for DB and Vis
 At high scale:
 A Vis library in MapReduce
 At high complexity:
 Azure Ocean

Data + Workflow + Vis

“Client + Cloud”,“Computational mobility”
 SQLShare

Ad Hoc data -- “anything goes”

Visualization critical
 (semi-)automated mashups

“Show me what’s interesting”

Acknowledgments
http://escience.washington.edu

BACKUP SLIDES

[John Delaney, University of Washington]

John Delaney

COVE for
Visualization
Trident for
Processing
Azure for
Data+ +

COVECOVE
 Research into new interfaces for cross-disciplinary ocean scienceResearch into new interfaces for cross-disciplinary ocean science
 Extensive instrument and cable layout for creating experimentsExtensive instrument and cable layout for creating experiments
 Flexible terrain and image engine for visualizing siteFlexible terrain and image engine for visualizing site
 True 3D/4D science dataset visualizationTrue 3D/4D science dataset visualization
 Field tested in RSN observatory layout and on ocean expeditionsField tested in RSN observatory layout and on ocean expeditions
 Cross platform and extensible with python and workflow systemsCross platform and extensible with python and workflow systems

TridentTrident
 Microsoft Research scientific workflow systemMicrosoft Research scientific workflow system
 Visual programming environment for connecting tasksVisual programming environment for connecting tasks
 Science-specific task libraries including one for ocean sciencesScience-specific task libraries including one for ocean sciences
 Automated provenance capture, monitoring, and fault toleranceAutomated provenance capture, monitoring, and fault tolerance
 Runs on local system, Windows server, or HPC ClusterRuns on local system, Windows server, or HPC Cluster
 Cross platform with Silverlight and web service interfaceCross platform with Silverlight and web service interface

AzureAzure
 Microsoft’s cloud computing platformMicrosoft’s cloud computing platform
 Provides storage and computing as pay-as-you-go servicesProvides storage and computing as pay-as-you-go services
 From development standpoint, system looks like provisioned VM’sFrom development standpoint, system looks like provisioned VM’s
 SQL, table, and blob (file system) storage models are includedSQL, table, and blob (file system) storage models are included
 Access to storage via RESTful HTTP interfaceAccess to storage via RESTful HTTP interface

 COVE + Trident + Azure provides visual analytics to scientistsCOVE + Trident + Azure provides visual analytics to scientists
 Any component –Any component – VisualizationVisualization,, ComputingComputing, or, or DataData –– can becan be
provisioned locally, on a server, or in the cloudprovisioned locally, on a server, or in the cloud
 When on same machine, system APIs are leveraged for speedWhen on same machine, system APIs are leveraged for speed
 When distributed, communication is through HTTP and RESTful APIsWhen distributed, communication is through HTTP and RESTful APIs
 Flexible platform for the diverse ocean science needsFlexible platform for the diverse ocean science needs

MapReduce Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:
 Processes input key/value pair
 Produces set of intermediate pairs
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
slide source: Google, Inc.

Isosurface Example

Isosurface Example
<Vis movie>QuickTime™ and a
decompressor
Key idea: Zooplankton correlated with temperature

Example Query Results

Example Query: Climatology
Feb May
Average Surface Salinity by Month
Columbia River Plume 1999-2006
Columbia
River
psu
Washington
Oregon
animation

UW + Utah CluE Program
 Goals
 10+-year “climatologies” at interactive speeds
 …with provenance, reproducibility, collaboration …on a
shared-nothing, commodity platform
 In general: Explore the intersection of scientific
databases and scientific visualization, at scale
 Methods
 “Cloud-Enable” two projects

GridFields: Query algebra for mesh data

VisTrails: Scientific workflow and provenance

Converging Requirements
Vis: “Query-driven Visualization”
Vis: “In Situ Visualization”
Vis: “Remote Visualization”
DB: Millions of tuples per result
Vis DB

Preliminary results
 Managing Hadoop jobs with VisTrails
 GridField queries in Hadoop
 Core Visualization algorithms in Hadoop

Core Vis Algorithms in MapReduce
 Scalar/Volume Rendering
 Map: Rasterization
 Reduce: Compositing, blending
 Isosurface Extraction
 Map: Isosurface Extraction
 Reduce: Combine like isovalues
 Mesh Simplification
 Map: Bin vertices
 Reduce: Collapse binned triangles

ATLAS dataset

Rendering (not CluE)
# of mappers
57-node Nehalem

Isosurface Extraction (Preliminary)
32
48
64
96
128

“Query-Driven Visualization”
 Vis perspective:
 query = subsetting
 DB perspective:
 query = manipulation, preparation, restructuring, index-building,
aggregation, regridding, downsampling, simplification,
reformatting, etc.
Database Maxims:
1. Push the computation to the data.
2. Declarative programming is a good thing.

Why Cloud?
 “Cloud”?
 Software as a Service (SaaS)
 Infrastructure as a Service (IaaS)
 Platform as a Service (PaaS)
 Working definition:
General, elastic, data-intensive, scalable computing
This work: Vis techniques + DB techniques in the Cloud

Shared Nothing Parallel Databases
 Teradata
 Greenplum
 Netezza
 Aster Data Systems
 Datallegro
 Vertica
 MonetDB
Microsoft
Recently commercialized as “Vectorwise”

Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of nodes

3/12/09 Bill Howe, UW 70VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
VisTrails

3/12/09 Bill Howe, UW 71VisTrails + GridFieldsscreenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
Version Tree

Collaboration
Bill Howe @ UW
computes salt flux
using GridFields
Erik Anderson @ Utah
adds vector
streamlines and
adjusts opacity
Bill Howe @ UW adds
an isosurface of
salinity
Peter Lawson adds
discussion of the
scientific
interpretation
Howe et al., eScience 2008

Preliminary results

Hadoop in VisTrails
 Wrap Hadoop Streaming/HDFS Operations
 Plug “PreProcess” to actual Vis Pipeline
3/12/09 75

Hadoop in VisTrails
 Provenance and Monitoring
3/12/09 76

Preliminary results

All Science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, independent of hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase Data Collection Exponentially in Less Time, with FlowCAM”
Empirical X  Analytical X  Computational X  X-informatics

Key Idea: Declarative Languages
SELECT *
FROM Order o, Item i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered

Example System: Teradata
AMP = unit of parallelism

AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 4 AMP 5 AMP 6

AMP 1 AMP 2 AMP 3
scan
Item i
AMP 4 AMP 5 AMP 6
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)

AMP 4 AMP 5 AMP 6
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1

Workflow Execution Plans
Need execution plans spanning client/server/cloud

Example: Isosurface Browsing
QuickTime™ and a
decompressor

 Plan A
Subset Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3

 Plan B: Build an index
Build Index, e.g., an Interval Tree (Cignoni 97)
Subset Subset Subset
tstep 0 tstep 1 tstep 2 tstep 3
Subset
Render
Isosurface Isosurface Isosurface Isosurface
Render Render Render

 Plan C: Build a spatial index to support panning
 Plan D: Build a multi-resolution index to support zoom
 …and so on
 Why not precompute all appropriate indexes?
 Some will (partially) reside on client
 Storage is not as cheap as we pretend
 Need a flexible system where
 a “query result” can be explored interactively, and
 we prepare for similar queries
 similarity defined by natural “browsing patterns” in visualization
systems

Why MapReduce/Hadoop?
 Popular

AWS Elastic MapReduce

100s of startups

# of downloads

# of blog posts
 Free as in Speech
 Free as in Beer
 Flexible, Lightweight
 Scalable
 Fault-tolerant

Reducing Latency
 Online processing/progressive refinement
 Deliver approximate/partial results
 Standing Queries/Prepared plans
 Exploit indexes
Changes to Hadoop and/or other
tools required (e.g., Hbase)

Masking Latency
 Caching/materialized views
 Reuse old results
 Pre-fetching
 Stage and prepare new results
 Speculative processing
 Anticipate future results
No change to Hadoop required

source: Antonio Baptista, NSF CMOP STC

Why Visualization? (2)
north
channel
south
channel

MapReduce?
 Hadoop simplifies parallel data processing
 ++ scalability
 ++ fault tolerance
 ++ less programming
 -- latency is an issue

1 2 3 4 5 6 7
31
23
psu
8 9 10 11 12 13 14 15
16 17 18
(b)
19 20 21 22
24 25 26 27 28 29 30
Climatology Queries

As a GridField Expression
⊗
H0 : (x,y,b) V0 : (σ )
apply(0, z=(surf − b) * σ )
bind(0, surf)
C
H = Scan(contxt, "H")
rH = Restrict("(326<x) & (x<345) & (287<y) & (y<302)", 0, H)
T = Scan(contxt, “T”)
V = Scan(contxt, “V”)
HxV = Cross(H, V)
HxVxT = Cross(HxV, T)
salt = Bind(contxt, HxVxT, “salt”)
onemonth = Regrid(salt, HxV, equijoin(“hpos,vpos”), avg())

As a SQL Query
Select hpos, vpos, avg(salt)
from ocean
group by hpos, vpos

Scientific Workflow Systems
 Value proposition: More time on science, less time on code
 How: By providing language features emphasizing sharing,
reuse, reproducibility, rapid prototyping, efficiency
 Provenance
 Visual programming
 Caching
 Integration with domain-specific tools
 Scheduling

Related Vis Work
 Parallel visualization systems
 ParaView, VisIt
 Query-Driven Visualization
 [Bethel et al 2006,2008,2009]
 FastBit Index
 [Shoshani et al 2007]
 DB Vis systems
 Tableau

Feeding the Pipeline
source: Ken Moreland
missing step?

Cannot Ignore “Preprocessing”
Hadoop

Role 2: Move Computation to the Data
“Transferring the whole data generated … to a storage device or a
visualization machine could become a serious bottleneck, because I/O
would take most of the … time. A more feasible approach is to reduce
and prepare the data in situ for subsequent visualization and data
analysis tasks.”
-- SciDAC Review

Remote Visualization
 Reduce and render remotely, transfer images
 ++ transfers less data
 -- specialized hardware, high load
 Reduce remotely, transfer data/geometry, render locally
 ++ uses local graphics pipeline
 -- transfers more data

Scientific Vis System Roundup
 General
 ParaView [KitWare, Los Alamos, Sandia]
 VisIt [LLNL]
 Specialized
 SALSA, particles, Quinn, UW
 VISUS, streaming/progressive, Jones, LLNL
 SAGE,
 Hyperwall, tiled display, NASA

Visual Data Analytics in the Cloud for Exploratory Science

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Visual Data Analytics in the Cloud for Exploratory Science

Ähnlich wie Visual Data Analytics in the Cloud for Exploratory Science (20)

Mehr von University of Washington

Mehr von University of Washington (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visual Data Analytics in the Cloud for Exploratory Science

Hinweis der Redaktion