SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Active Data: Data Life Cycle Management
Across Heterogeneous Systems and
Infrastructures
Anthony Simonet, Gilles Fedak (INRIA)
Matei Ripeanu, Samer Al-Kiswany (UCB)
Kyle Chard, Ian Foster (ANL/UC)
Hot Topics in High-Performance Distributed Computing Workshop
IBM Almaden Research Center
San Jose, California
March 12, 2015
1/13
G. Fedak() Active Data March 12, 2015
Big Data ...
Huge and growing volume of information originating from multiple
sources.
!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($)
. . . or Big Bottlenecks ?
how to scale the infrastructure ?
end-to-end performance improvement, inter-system optimization.
how to improve productivity of data-intensive scientist ?
data-oriented programming language, data quality, improve
automation and errors recovery
2/13
G. Fedak() Active Data March 12, 2015
Data Life Cycle
DeïŹnition
Data Life Cycle (DLC) is the course of operational stages through which
data pass from the time when they enter a set of systems to the time
when they leave it.
!"#$%&%'()* +,-.,("-&&%)/* 01(,2/-*
!)234&%&*
!)234&%&*
Challenges :
Expose high level view DLC across distributed systems and
infrastructures
Expose interactions between the infrastructure and the DLC (e.g
failures)
3/13
G. Fedak() Active Data March 12, 2015
Active Data
Active Data:
Allow to reason about data sets handled by heterogeneous software
and infrastructures.
A formal model that captures the essential life cycle stages and
properties: creation, deletion, faults, replication, error checking . . .
programming model to develop easily data life cycle management
applications.
Allows legacy systems to expose their intrinsic data life cycle.
4/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
‱
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identiïŹer, corresponding to the actual data
item’s.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
‱
Written
t2
Read
t3
t4
Terminated
A transition is ïŹred whenever a data state changes.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
‱
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is ïŹred.
5/13
G. Fedak() Active Data March 12, 2015
Active Data Framework
6/13
G. Fedak() Active Data March 12, 2015
Life Cycle View
File transferFile Dataset Metadata
Guard
Code Execution
}Tagged Tokens
NotiïŹcation
Framework features:
Captures data events in legacy systems
High-level life cycle-centered view of data
Single namespace for all the ïŹles,
datasets and metadata
Powerful ïŹlters based on Data Tags
Install Taggers on Transitions
Guarded Transitions : only executes on
token which have speciïŹc tags.
Publish/subscribe transitions
Custom user reaction to data progress
Custom code execution
Custom notiïŹcations (twitter, email,
gdoc, ifttt . . . )
Use Case: Advanced Photon Source
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract
Metadata
3. Globus
Transfer
4. Swift Parallel Analysis
3 to 5 TB of data per week on this detector
Raw data are pre-processed and registered in the Globus Catalog :
Data are curated by several applications
Data are shared amongst scientiïŹc user
7/13
G. Fedak() Active Data March 12, 2015
Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and hacking):
Monitoring Data Set Progress
Better Automation
Sharing & NotiïŹcation
Error Discovery & Recovery
8/13
G. Fedak() Active Data March 12, 2015
APS Data Life Cycle Model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
9/13
G. Fedak() Active Data March 12, 2015
Error Detection & Recovery
10/13
G. Fedak() Active Data March 12, 2015
Example scenario
Recover from system-wide errors: faulty acquired ïŹles are detected only
after Swift fails to process them.
In this situation, the user manually:
Drops the whole dataset
Removes any associated ïŹle and metadata
Re-acquire the dataset using the same parameters
E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
Avalon June 3rd, 2014 22/30
Active Data Client
Failure
and
Recovery
Handler
remove metadata from the catalog
Filters
data
likely to
fail
Tagger
Active Data Client
Guard
Handler
run the Globus
catalog UI scripts
11/13
G. Fedak() Active Data March 12, 2015
Handler Code
TransitionHandler handler = new TransitionHandler () {
public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) {
// Get the dataset identifier
LifeCycle lc = ad. getLifeCycle (inTokens [0]);
datasetId = lc.getTokens("Shared storage.Created")[0]. getUid ();
// Remove the dataset annotations from the catalog
String url = "https :// catalog.globus.org/dataset/" + datasetId;
Runtime r = Runtime.getRuntime ();
Process p = r.exec(" catalog_client .py remove " + url);
p.waitFor ();
// Locally , remove the datasets
String path = "~/aps/" + datasetId;
FileUtils. deleteDirectory (new File(path));
// Publish the " Detector .End"
Token root = lc.getTokens("Detector.Created")[0];
ad. publishTransition ("Detector.End", lc);
// Notify the user
sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId);
}
};
HandlerGuard guard = new HandlerGuard () {
public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) {
return input [0]. hasTag(" f a i l u r e c o r r u p t e d ");
}}
ad.subscribeTo("Swift.Failure", handler , guard);
12/13
G. Fedak() Active Data March 12, 2015
Conclusion
Active Data
allows to expose Data Life Cycle across heterogeneous systems and
infrastructures
transition-based programming model for DLC management
application
Monitoring, automation, error detection & recovery
X-systems optimizations: incremental computing, data staging,
caching, throttling etc. . .
Perspectives :
Use AD to deploy data management software stack on IaaS (Asma
Ben Cheick, Heithem Abbes, Univ. Tunis)
Big Data Apache stack X-optimization (H. He, CAS, Beijing)
Volunteer & crowd computing (M. Moca, BBU, Romania)
13/13
G. Fedak() Active Data March 12, 2015
Thank you!
Questions?

Weitere Àhnliche Inhalte

Was ist angesagt?

SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationNicolas FrÀnkel
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolBharath Nunepalli
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSBharath Nunepalli
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesKai Schlegel
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference InformationKai Schlegel
 
Introduction to R2DBC
Introduction to R2DBCIntroduction to R2DBC
Introduction to R2DBCRob Hedgpeth
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 
Thomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimThomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimRepository Fringe
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
JOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheJOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheNicolas FrÀnkel
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01jade_22
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Mark Ivan Ligason
 
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...Anirudh Prabhu
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong KongSammy Fung
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 

Was ist angesagt? (20)

SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy application
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation tool
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OS
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of services
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Introduction to R2DBC
Introduction to R2DBCIntroduction to R2DBC
Introduction to R2DBC
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
Thomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimThomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaim
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
JOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheJOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen Cache
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)
 
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 

Andere mochten auch

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13Gilles Fedak
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsGilles Fedak
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingGilles Fedak
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
 
Information Management Life Cycle
Information Management Life CycleInformation Management Life Cycle
Information Management Life CycleCollabor8now Ltd
 

Andere mochten auch (8)

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, Optimizations
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud Computing
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 
Information Management Life Cycle
Information Management Life CycleInformation Management Life Cycle
Information Management Life Cycle
 

Ähnlich wie Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for usebolu804
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Claire Stewart
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Presentation
PresentationPresentation
Presentationbolu804
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft DryadColin Clark
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
 
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...Andreas Schreiber
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Jeffrey T. Pollock
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slidesharebolu804
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data StagingHenning Bergmeyer
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationIOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationBobby Curtis
 

Ähnlich wie Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures (20)

60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for use
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Presentation
PresentationPresentation
Presentation
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slideshare
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data Staging
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationIOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
 

KĂŒrzlich hochgeladen

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Christopher Logan Kennedy
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

KĂŒrzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

  • 1. Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015
  • 2. Big Data ... Huge and growing volume of information originating from multiple sources. !"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($) . . . or Big Bottlenecks ? how to scale the infrastructure ? end-to-end performance improvement, inter-system optimization. how to improve productivity of data-intensive scientist ? data-oriented programming language, data quality, improve automation and errors recovery 2/13 G. Fedak() Active Data March 12, 2015
  • 3. Data Life Cycle DeïŹnition Data Life Cycle (DLC) is the course of operational stages through which data pass from the time when they enter a set of systems to the time when they leave it. !"#$%&%'()* +,-.,("-&&%)/* 01(,2/-* !)234&%&* !)234&%&* Challenges : Expose high level view DLC across distributed systems and infrastructures Expose interactions between the infrastructure and the DLC (e.g failures) 3/13 G. Fedak() Active Data March 12, 2015
  • 4. Active Data Active Data: Allow to reason about data sets handled by heterogeneous software and infrastructures. A formal model that captures the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . programming model to develop easily data life cycle management applications. Allows legacy systems to expose their intrinsic data life cycle. 4/13 G. Fedak() Active Data March 12, 2015
  • 5. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations ‱ Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identiïŹer, corresponding to the actual data item’s. 5/13 G. Fedak() Active Data March 12, 2015
  • 6. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 ‱ Written t2 Read t3 t4 Terminated A transition is ïŹred whenever a data state changes. 5/13 G. Fedak() Active Data March 12, 2015
  • 7. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 ‱ Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is ïŹred. 5/13 G. Fedak() Active Data March 12, 2015
  • 8. Active Data Framework 6/13 G. Fedak() Active Data March 12, 2015 Life Cycle View File transferFile Dataset Metadata Guard Code Execution }Tagged Tokens NotiïŹcation Framework features: Captures data events in legacy systems High-level life cycle-centered view of data Single namespace for all the ïŹles, datasets and metadata Powerful ïŹlters based on Data Tags Install Taggers on Transitions Guarded Transitions : only executes on token which have speciïŹc tags. Publish/subscribe transitions Custom user reaction to data progress Custom code execution Custom notiïŹcations (twitter, email, gdoc, ifttt . . . )
  • 9. Use Case: Advanced Photon Source Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis 3 to 5 TB of data per week on this detector Raw data are pre-processed and registered in the Globus Catalog : Data are curated by several applications Data are shared amongst scientiïŹc user 7/13 G. Fedak() Active Data March 12, 2015
  • 10. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): Monitoring Data Set Progress Better Automation Sharing & NotiïŹcation Error Discovery & Recovery 8/13 G. Fedak() Active Data March 12, 2015
  • 11. APS Data Life Cycle Model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. 9/13 G. Fedak() Active Data March 12, 2015
  • 12. Error Detection & Recovery 10/13 G. Fedak() Active Data March 12, 2015 Example scenario Recover from system-wide errors: faulty acquired ïŹles are detected only after Swift fails to process them. In this situation, the user manually: Drops the whole dataset Removes any associated ïŹle and metadata Re-acquire the dataset using the same parameters
  • 13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 Active Data Client Failure and Recovery Handler remove metadata from the catalog Filters data likely to fail Tagger Active Data Client Guard Handler run the Globus catalog UI scripts 11/13 G. Fedak() Active Data March 12, 2015
  • 14. Handler Code TransitionHandler handler = new TransitionHandler () { public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) { // Get the dataset identifier LifeCycle lc = ad. getLifeCycle (inTokens [0]); datasetId = lc.getTokens("Shared storage.Created")[0]. getUid (); // Remove the dataset annotations from the catalog String url = "https :// catalog.globus.org/dataset/" + datasetId; Runtime r = Runtime.getRuntime (); Process p = r.exec(" catalog_client .py remove " + url); p.waitFor (); // Locally , remove the datasets String path = "~/aps/" + datasetId; FileUtils. deleteDirectory (new File(path)); // Publish the " Detector .End" Token root = lc.getTokens("Detector.Created")[0]; ad. publishTransition ("Detector.End", lc); // Notify the user sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId); } }; HandlerGuard guard = new HandlerGuard () { public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) { return input [0]. hasTag(" f a i l u r e c o r r u p t e d "); }} ad.subscribeTo("Swift.Failure", handler , guard); 12/13 G. Fedak() Active Data March 12, 2015
  • 15. Conclusion Active Data allows to expose Data Life Cycle across heterogeneous systems and infrastructures transition-based programming model for DLC management application Monitoring, automation, error detection & recovery X-systems optimizations: incremental computing, data staging, caching, throttling etc. . . Perspectives : Use AD to deploy data management software stack on IaaS (Asma Ben Cheick, Heithem Abbes, Univ. Tunis) Big Data Apache stack X-optimization (H. He, CAS, Beijing) Volunteer & crowd computing (M. Moca, BBU, Romania) 13/13 G. Fedak() Active Data March 12, 2015