SlideShare a Scribd company logo
1 of 1
Empowering R Scientific Applications with Pegasus WMS
Samrat Jha, Rafael Ferreira da Silva, Ewa Deelman, Karan Vahi, Mats Rynge, Rajiv Mayani
University of Southern California / Information Sciences Institute
• Pegasus is a system for mapping and executing
abstract application workflows over a range of
execution environments.
• The same abstract workflow can, at different times,
be mapped different execution environments such
as XSEDE, OSG, commercial and academic clouds,
campus grids, and clusters.
• Pegasus can easily scale both the size of the
workflow, and the resources that the workflow is
distributed over. Pegasus runs workflows ranging
from just a few computational tasks up to 1 million.
• Pegasus Workflow Management System (WMS)
consists of three main components: the Pegasus
Mapper, HTCondor DAGMan, and the HTCondor
Schedd.
Overview
Acknowledgments:
• Pegasus WMS is funded by the National Science Foundation OCI SDCI program grant #1148515.
• HTCondor : Miron Livny, Kent Wenger, University of Wisconsin Madison
• LIGO: Larne Pekowsky , Duncan Brown, Alexander H. Nitz and others from LIGO CBC group
http://pegasus.isi.edu
DAX Generator API
Easy to use APIs in Python,
Java, Perl and now R to
generate an abstract
workflow describing the
users computation.
Above is a simple two node
hello world example in R
Abstract Workflow (DAX)
The abstract workflow rendered as XML.
It only captures the computations the
user wants to do and is devoid of any
physical paths. Input and output files are
identified by logical identifiers. This
representation is portable between
different execution environments.
Abstract to Executable Workflow (Condor DAG) Mapping
The DAX is passed to the Pegasus Mapper and it generates a
HTCondor DAGMan workflow that can be run on actual
resource.
The above example highlights addition of data movement
nodes to staging in the input data and stage out the output
data; addition of data cleanup nodes to remove data that is
no longer required; and registration nodes to catalog
output data locations for future discovery.
Workflow Design and Mapping EPA Workflow
Annotating Genomic Variants – Basic Bioconductor Workflow
Bioconductor: Bioconductor is an open source software project that
provides tools for the analysis and comprehension of high-throughput
genomic data. It is based primarily on the R programming language.
Description: This workflow focused on variants located in the Transient
Receptor Potential Vanilloid gene family on chromosome 17. The single
provided script was partitioned into different jobs to allow for easy cross site
runs.
Cross Site Run: Single Workflow can be executed on multiple sites, with
Pegasus taking care of the data movement between the sites.
Monitoring and Debugging
At runtime, a database is populated with workflow and
task runtime provenance, including which software was
used and with what parameters, execution environment,
runtime statistics and exit status.
Pegasus comes with command line monitoring and
debugging tools. A web dashboard now allows users to
monitor their running workflows and check jobs status
and output.
A PMC job is expressed as a DAG and PMC uses the MPI master-worker paradigm to farm out individual tasks
to worker nodes. PMC acts a scheduler and considers core and memory requirements of the tasks when
making scheduling decisions. PMC can be easier to setup than pilot jobs.
Advanced Bioconductor Workflow
Basic EPA workflow run on Pegasus:
• Scripts obtained from EPA’s website:
https://www3.epa.gov/caddis/da_software_rscript12.html
• Three sample scripts obtained and collected into a simple
Split workflow form.
• The goal of the workflow was to run the three scripts that
obtained input files from EPA’s website, scrape it for
necessary information, perform analysis and store them in
output files.
• The entire run was executed in Pegasus’ environment with no
further involvement from the user. Completely automated.
Basic Bioconductor Workflow
Workflow Challenges
• Bioconductor provides its
workflows as one contiguous
script. Very expensive to execute
all at once.
• Simple errors in the script will
break down the entire execution
Solution:
• The workflow is partitioned into
independent sub graphs, which
are submitted as self-contained
Pegasus MPI Cluster (PMC) jobs to
the remote sites.
• The jobs can now run in parallel
on separate clusters allowing for a
much more efficient execution
and not allowing errors to affect
the rest of the independently
running workflow.

More Related Content

What's hot

H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...Alex Zeltov
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in SparkDatabricks
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programmingphanleson
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Rafael Ferreira da Silva
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibphanleson
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Vasanth Rajamani
 
The Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInThe Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInSuja Viswesan
 

What's hot (20)

H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Features of Hadoop
Features of HadoopFeatures of Hadoop
Features of Hadoop
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Hpcc
HpccHpcc
Hpcc
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
 
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef HabdankSpark Summit EU talk by Josef Habdank
Spark Summit EU talk by Josef Habdank
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7
 
The Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInThe Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedIn
 

Similar to Pegasus-Poster-2016-final-v2

Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
 
Zou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 ConciseZou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 Conciseyongqiangzou
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Weather and Climate Visualization software
Weather and Climate Visualization softwareWeather and Climate Visualization software
Weather and Climate Visualization softwareRahul Gupta
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
OGCE MSI Presentation
OGCE MSI PresentationOGCE MSI Presentation
OGCE MSI Presentationmarpierc
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating SystemKeshav Yadav
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Haripds Shrestha
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorialmarpierc
 
Seven50 Sparc Overview
Seven50 Sparc OverviewSeven50 Sparc Overview
Seven50 Sparc OverviewRoar Media
 

Similar to Pegasus-Poster-2016-final-v2 (20)

Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management System
 
Zou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 ConciseZou Layered VO PDCAT2008 V0.5 Concise
Zou Layered VO PDCAT2008 V0.5 Concise
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Weather and Climate Visualization software
Weather and Climate Visualization softwareWeather and Climate Visualization software
Weather and Climate Visualization software
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
OGCE MSI Presentation
OGCE MSI PresentationOGCE MSI Presentation
OGCE MSI Presentation
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating System
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
Designing and Implementing a cloud-hosted SaaS for data movement and Sharing ...
 
OGCE SciDAC2010 Tutorial
OGCE SciDAC2010 TutorialOGCE SciDAC2010 Tutorial
OGCE SciDAC2010 Tutorial
 
Seven50 Sparc Overview
Seven50 Sparc OverviewSeven50 Sparc Overview
Seven50 Sparc Overview
 

Pegasus-Poster-2016-final-v2

  • 1. Empowering R Scientific Applications with Pegasus WMS Samrat Jha, Rafael Ferreira da Silva, Ewa Deelman, Karan Vahi, Mats Rynge, Rajiv Mayani University of Southern California / Information Sciences Institute • Pegasus is a system for mapping and executing abstract application workflows over a range of execution environments. • The same abstract workflow can, at different times, be mapped different execution environments such as XSEDE, OSG, commercial and academic clouds, campus grids, and clusters. • Pegasus can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Pegasus runs workflows ranging from just a few computational tasks up to 1 million. • Pegasus Workflow Management System (WMS) consists of three main components: the Pegasus Mapper, HTCondor DAGMan, and the HTCondor Schedd. Overview Acknowledgments: • Pegasus WMS is funded by the National Science Foundation OCI SDCI program grant #1148515. • HTCondor : Miron Livny, Kent Wenger, University of Wisconsin Madison • LIGO: Larne Pekowsky , Duncan Brown, Alexander H. Nitz and others from LIGO CBC group http://pegasus.isi.edu DAX Generator API Easy to use APIs in Python, Java, Perl and now R to generate an abstract workflow describing the users computation. Above is a simple two node hello world example in R Abstract Workflow (DAX) The abstract workflow rendered as XML. It only captures the computations the user wants to do and is devoid of any physical paths. Input and output files are identified by logical identifiers. This representation is portable between different execution environments. Abstract to Executable Workflow (Condor DAG) Mapping The DAX is passed to the Pegasus Mapper and it generates a HTCondor DAGMan workflow that can be run on actual resource. The above example highlights addition of data movement nodes to staging in the input data and stage out the output data; addition of data cleanup nodes to remove data that is no longer required; and registration nodes to catalog output data locations for future discovery. Workflow Design and Mapping EPA Workflow Annotating Genomic Variants – Basic Bioconductor Workflow Bioconductor: Bioconductor is an open source software project that provides tools for the analysis and comprehension of high-throughput genomic data. It is based primarily on the R programming language. Description: This workflow focused on variants located in the Transient Receptor Potential Vanilloid gene family on chromosome 17. The single provided script was partitioned into different jobs to allow for easy cross site runs. Cross Site Run: Single Workflow can be executed on multiple sites, with Pegasus taking care of the data movement between the sites. Monitoring and Debugging At runtime, a database is populated with workflow and task runtime provenance, including which software was used and with what parameters, execution environment, runtime statistics and exit status. Pegasus comes with command line monitoring and debugging tools. A web dashboard now allows users to monitor their running workflows and check jobs status and output. A PMC job is expressed as a DAG and PMC uses the MPI master-worker paradigm to farm out individual tasks to worker nodes. PMC acts a scheduler and considers core and memory requirements of the tasks when making scheduling decisions. PMC can be easier to setup than pilot jobs. Advanced Bioconductor Workflow Basic EPA workflow run on Pegasus: • Scripts obtained from EPA’s website: https://www3.epa.gov/caddis/da_software_rscript12.html • Three sample scripts obtained and collected into a simple Split workflow form. • The goal of the workflow was to run the three scripts that obtained input files from EPA’s website, scrape it for necessary information, perform analysis and store them in output files. • The entire run was executed in Pegasus’ environment with no further involvement from the user. Completely automated. Basic Bioconductor Workflow Workflow Challenges • Bioconductor provides its workflows as one contiguous script. Very expensive to execute all at once. • Simple errors in the script will break down the entire execution Solution: • The workflow is partitioned into independent sub graphs, which are submitted as self-contained Pegasus MPI Cluster (PMC) jobs to the remote sites. • The jobs can now run in parallel on separate clusters allowing for a much more efficient execution and not allowing errors to affect the rest of the independently running workflow.