SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Oozie:
Scheduling Workflows

         On
the
Grid


       Mohammad
K
Islam

     kamrul@yahoo‐inc.com

Agenda

•  Oozie
Overview

•  Oozie
3.x
features:


   –  Bundle

   –  Scalability

   –  Usability


•  Challenges

•  Future
Plan

•  Q&A


Overview:
Workflow

•  Oozie
executes
workflow
defined
as
DAG
of
jobs.

•  The
job
type
includes:
Map‐Reduce/
Pipes/
Streaming/
   Pig/Custom
Java
Code
etc.

•  Introduced
in
Oozie
1.x.

                                      M/R   

                                   streaming  

                                       job


           M/R  

  start
      
             fork
                                 join

           job



                                      Pig
                        MORE
           decision

                                      job



                                                              M/R  

                                                                                      ENOUGH

                                                              job





                                                   FS

                            end
                                          Java

                                                  job 

Overview:
Coordinator

•  Oozie
executes
workflow
based
on:

   –  Time
Dependency
(Frequency)

   –  Data
Dependency


•  Introduced
in
Oozie
2.x.


                   Oozie
Server

                                            Check


  WS
API
            Oozie

           Data
Availability

                   Coordinator


                     Oozie


 Oozie
             Workflow

 Client
                                        Hadoop

Oozie
3.x:
Bundle

•  User
can
define
and
execute
a bunch of 
   coordinator
applicaons.

•  User
could
start/stop/suspend/resume/rerun in

   the
bundle
level.

•  Benefits:
Easy
to
maintain
and
control
large
data

   pipelines
applicaons
for
Service
Engineering

   team.

                 Oozie
Server
            Check


  WS
API
                            Data
Availability

                   Bundle


                 Coordinator


 Oozie
           Workflow

 Client
                                      Hadoop

Oozie
AbstracNon
Layers

                               Bundle                                            Layer
1



      Coord Job 1                                   Coord Job 2 



                                                                                 Layer
2

Coord                       Coord               Coord                Coord 
Action 1                    Action 2            Action1              Action 2 




WF Job 1                    WF Job 1           WF Job 2              WF Job 2 



                     PIG 
                                                                                 Layer
3

                     Job 
 M/R                                    M/R                   PIG 
 Job                                    Job                   Job 

                     FS 
                     Job 
Enhanced
Stability
and
Scalability

•  Issue
:


   –  At
very
high
load,
Oozie
becomes
slow.

   –  90%
of
the
total
Oozie
support
incidence.


•  Reason:


   –  Lot
of
acve
but
non‐progressing
jobs.

   –  Oozie
internal
queue
is
full.

•  Resoluon:

   –  Throcle
the
number
of
acve
jobs/coordinator

   –  Put
the
job
into
meout
state.

   –  Enforce
the
uniqueness
for
oozie
queue
element.


Improved
Usability

•  Issue:


   –  Coordinator
job’s
status
is
not
intuive
and
causes

      confusion
to
the
Oozie
user.

•  Reason:

   –  Status
SUCCEEDED
doesn’t
mean
job
is

      successful!!

   –  Status
PREMATER
is
for
oozie
internal
use
only.

      But
it
was
exposed
to
user.

•  Resoluon:

   –  Redesign
Coordinator
status

Coordinator
Status
Redesign

Current
                    SUSPENDED
               KILLED



    PREP
      PREMATER
                 Running
   SUCCEEDED




                                                     FAILED





New
           SUSPENDED
                             KILLED



                                                     SUCCEEDED

       PREP
     Running


                                                    DONE_WITH_ERROR


                 PAUSED
                              FAILED

The
Second
Year
...

•  Number
of
Releases

   –  Feature
Releases
:
3

   –  Patches
:
9

•  Backward compa5bility is
strongly
maintained.


•  No
need
to
resubmit
the
job
if
Oozie
is
restarted.

•  Code
Overhaul:

   –  Re‐designed
the
command
pacern
to
avoid
DB

      connecon
leaks
and
to
improve
DB
connecons

      usages.

Oozie
Usages

•  Y!
internal
usages:

   –  Total
number
of
user
:
377

   –  Total
number
of
processed
jobs
≈
600K/month

•  External
downloads:

   –  1500+
in
last
8
months
from
Github

   –  A
large
number
of
downloads
maintained

by
3rd

      party
packaging.



Oozie
Usages
Cont.

•  User
Community:

  –  Membership

     •  Y!
internal
‐
265

     •  External
–
163

  –  Message
(approximate):

     •  Y!
internal
–
9/day

     •  External
–
7/day

Challenges
1
:Data
Availability
Check

•  Issue
:


   –  Currently
checks
directory
in
every
minute
(polling 
      based).

   –  Increases
NN
overhead
and
does not scale well.

•  Reason:
No
meta‐data
system
with

   appropriate
noficaons
mechanism.

•  Planned
resoluon:
Incorporate
with
HCatalog

   metadata
system.


Challenges
2
:
Adaptability
to
Hadoop


•  Issues
:
If
Hadoop
NN
or
JT
is
down,
Oozie

   submits
job
and
obviously
fails.
User
intervenon

   is
required
when
Hadoop
server
is
back.

•  Impact:
Inconvenient
for
Oozie
user.
For
example,

   if
Hadoop
is
restarted
on
Friday
night,
job
will
not

   run
unl
next
Monday.

•  Planned
Resoluon:
Graceful
handling
of
Hadoop

   downme:


   –  If
Hadoop
is
down,
block
submission.


   –  When
Hadoop
becomes
available


      •  Submit
the
blocked
job


      •  Auto‐resubmit
the
untraced
job.


Challenges
3:
Horizontally
Scalable

•  Issues:
One
instance
of
Oozie
could
not
efficiently

   handle
a
very
large
number
of
jobs
(say
100K/
   hours).
In
addion,
Oozie
doesn’t
support
load

   balancing.

•  
Reason:
Oozie
internal
task
queue
is
not

   synchronized
across
mulple
Oozie
instances.

•  Planned
Resoluon:
Use
Zookeeper
for
coordinaon.

•  Benefits:
As
the
load
increases,
add
extra
Oozie

   server.

Future
Plan

•  AutomaNc
Failover:
Using
ZooKeeper.

•  Monitoring:
Rich
WS
API
for
applicaon

   Monitoring/Alerng.

•  Improved
Usability:


  –  Distcp
acon

  –  Hive
Acon

•  Asynchronous
data
processing.

•  Incremental
data
processing.

•  Apache
MigraNon:
Works
iniated.


Q&A



•  Github
link:
hcp://yahoo.github.com/oozie

•  Mailing
list:
Oozie-users@yahoogroups.com



                     Mohammad
K
Islam

                  kamrul@yahoo‐inc.com

Backup
Slides

Oozie
Workflow
Applicaon

•  Contents

   –  A
workflow.xml
file


   –  Resource
files,
config
files
and
Pig
scripts

   –  All
necessary
JAR
and
nave
library
files



•  Parameters

   –  The
workflow.xml,
is
parameterized,
parameters

      can
be
propagated
to
map-reduce,
pig &
ssh

      jobs


•  Deployment

   –  In
a
directory
in
the
HDFS
of
the
Hadoop
cluster

      where
the
Hadoop
&
Pig
jobs
will
run



                                                      19

Oozie

                      Running
a
Workflow
Job
                                                 cmd



Workflow
ApplicaNon
Deployment

    
$ hadoop    fs   –mkdir hdfs://usr/tucu/wordcount-wf
    
$ hadoop    fs   –mkdir hdfs://usr/tucu/wordcount-wf/lib
    
$ hadoop    fs   –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf
    
$ hadoop    fs   –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib
    
$



Workflow
Job
ExecuNon

    
$ oozie run -o http://foo.corp:8080/oozie 
                 -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 

                 input=/data/2008/input output=/data/2008/output
    
 Workflow job id [1234567890-wordcount-wf]
    
$




Workflow
Job
Status

    
$ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf
    
 Workflow job status [RUNNING]
      ...
    
$




                                                                                       20


Weitere ähnliche Inhalte

Was ist angesagt?

Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeepPradeep Pandey
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtimeDataWorks Summit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelFuseSource.com
 
High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016Vlad Mihalcea
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureMatt Ray
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 SummitMatt Ray
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowMatt Ray
 
High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016Vlad Mihalcea
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016Vlad Mihalcea
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankCarsten Czarski
 
Apache camel overview dec 2011
Apache camel overview dec 2011Apache camel overview dec 2011
Apache camel overview dec 2011Marcelo Jabali
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshareMorten Andersen-Gott
 

Was ist angesagt? (20)

Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeep
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-Datenbank
 
Apache camel overview dec 2011
Apache camel overview dec 2011Apache camel overview dec 2011
Apache camel overview dec 2011
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 

Andere mochten auch

Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn mislam77
 
Yarn at LinkedIn
Yarn at LinkedIn Yarn at LinkedIn
Yarn at LinkedIn mislam77
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11mislam77
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
 
Ambari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoAmbari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoHortonworks
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration FlowHortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgradesDurga Gadiraju
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDataWorks Summit
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Clusterahortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 

Andere mochten auch (20)

Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
 
Yarn at LinkedIn
Yarn at LinkedIn Yarn at LinkedIn
Yarn at LinkedIn
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
 
Ambari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoAmbari Meetup: Architecture and Demo
Ambari Meetup: Architecture and Demo
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration Flow
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 

Ähnlich wie Oozie Summit 2011

MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Julian Dunn
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707oscon2007
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Work Queues
Work QueuesWork Queues
Work Queuesciconf
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniterErik Giberti
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Zend Products and PHP for IBMi
Zend Products and PHP for IBMi  Zend Products and PHP for IBMi
Zend Products and PHP for IBMi Shlomo Vanunu
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM Sverige
 
Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)IBM Danmark
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 

Ähnlich wie Oozie Summit 2011 (20)

Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Work Queues
Work QueuesWork Queues
Work Queues
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniter
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Zend Products and PHP for IBMi
Zend Products and PHP for IBMi  Zend Products and PHP for IBMi
Zend Products and PHP for IBMi
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systems
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 
Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 

Kürzlich hochgeladen

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Kürzlich hochgeladen (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Oozie Summit 2011

  • 1. Oozie:
Scheduling Workflows
 On
the
Grid
 Mohammad
K
Islam
 kamrul@yahoo‐inc.com

  • 2. Agenda
 •  Oozie
Overview
 •  Oozie
3.x
features:

 –  Bundle
 –  Scalability
 –  Usability

 •  Challenges
 •  Future
Plan
 •  Q&A


  • 3. Overview:
Workflow
 •  Oozie
executes
workflow
defined
as
DAG
of
jobs.
 •  The
job
type
includes:
Map‐Reduce/
Pipes/
Streaming/ Pig/Custom
Java
Code
etc.
 •  Introduced
in
Oozie
1.x.
 M/R 
 streaming 
 job
 M/R 
 start 
 fork
 join
 job
 Pig
 MORE
 decision
 job
 M/R 
 ENOUGH
 job
 FS
 end
 Java
 job 

  • 4. Overview:
Coordinator
 •  Oozie
executes
workflow
based
on:
 –  Time
Dependency
(Frequency)
 –  Data
Dependency

 •  Introduced
in
Oozie
2.x.
 Oozie
Server
 Check

 WS
API
 Oozie

 Data
Availability
 Coordinator
 Oozie

 Oozie
 Workflow
 Client
 Hadoop

  • 5. Oozie
3.x:
Bundle
 •  User
can
define
and
execute
a bunch of  coordinator
applicaons.
 •  User
could
start/stop/suspend/resume/rerun in
 the
bundle
level.
 •  Benefits:
Easy
to
maintain
and
control
large
data
 pipelines
applicaons
for
Service
Engineering
 team.
 Oozie
Server
 Check

 WS
API
 Data
Availability
 Bundle
 Coordinator
 Oozie
 Workflow
 Client
 Hadoop

  • 6. Oozie
AbstracNon
Layers
 Bundle  Layer
1
 Coord Job 1  Coord Job 2  Layer
2
 Coord  Coord  Coord  Coord  Action 1  Action 2  Action1   Action 2  WF Job 1  WF Job 1  WF Job 2  WF Job 2  PIG  Layer
3
 Job  M/R  M/R  PIG  Job  Job  Job  FS  Job 
  • 7. Enhanced
Stability
and
Scalability
 •  Issue
:

 –  At
very
high
load,
Oozie
becomes
slow.
 –  90%
of
the
total
Oozie
support
incidence.

 •  Reason:

 –  Lot
of
acve
but
non‐progressing
jobs.
 –  Oozie
internal
queue
is
full.
 •  Resoluon:
 –  Throcle
the
number
of
acve
jobs/coordinator
 –  Put
the
job
into
meout
state.
 –  Enforce
the
uniqueness
for
oozie
queue
element.


  • 8. Improved
Usability
 •  Issue:

 –  Coordinator
job’s
status
is
not
intuive
and
causes
 confusion
to
the
Oozie
user.
 •  Reason:
 –  Status
SUCCEEDED
doesn’t
mean
job
is
 successful!!
 –  Status
PREMATER
is
for
oozie
internal
use
only.
 But
it
was
exposed
to
user.
 •  Resoluon:
 –  Redesign
Coordinator
status

  • 9. Coordinator
Status
Redesign
 Current
 SUSPENDED
 KILLED
 PREP
 PREMATER
 Running
 SUCCEEDED
 FAILED
 New
 SUSPENDED
 KILLED
 SUCCEEDED
 PREP
 Running
 DONE_WITH_ERROR
 PAUSED
 FAILED

  • 10. The
Second
Year
...
 •  Number
of
Releases
 –  Feature
Releases
:
3
 –  Patches
:
9
 •  Backward compa5bility is
strongly
maintained.

 •  No
need
to
resubmit
the
job
if
Oozie
is
restarted.
 •  Code
Overhaul:
 –  Re‐designed
the
command
pacern
to
avoid
DB
 connecon
leaks
and
to
improve
DB
connecons
 usages.

  • 11. Oozie
Usages
 •  Y!
internal
usages:
 –  Total
number
of
user
:
377
 –  Total
number
of
processed
jobs
≈
600K/month
 •  External
downloads:
 –  1500+
in
last
8
months
from
Github
 –  A
large
number
of
downloads
maintained

by
3rd
 party
packaging.



  • 12. Oozie
Usages
Cont.
 •  User
Community:
 –  Membership
 •  Y!
internal
‐
265
 •  External
–
163
 –  Message
(approximate):
 •  Y!
internal
–
9/day
 •  External
–
7/day

  • 13. Challenges
1
:Data
Availability
Check
 •  Issue
:

 –  Currently
checks
directory
in
every
minute
(polling  based).
 –  Increases
NN
overhead
and
does not scale well.
 •  Reason:
No
meta‐data
system
with
 appropriate
noficaons
mechanism.
 •  Planned
resoluon:
Incorporate
with
HCatalog
 metadata
system.


  • 14. Challenges
2
:
Adaptability
to
Hadoop

 •  Issues
:
If
Hadoop
NN
or
JT
is
down,
Oozie
 submits
job
and
obviously
fails.
User
intervenon
 is
required
when
Hadoop
server
is
back.
 •  Impact:
Inconvenient
for
Oozie
user.
For
example,
 if
Hadoop
is
restarted
on
Friday
night,
job
will
not
 run
unl
next
Monday.
 •  Planned
Resoluon:
Graceful
handling
of
Hadoop
 downme:

 –  If
Hadoop
is
down,
block
submission.

 –  When
Hadoop
becomes
available

 •  Submit
the
blocked
job

 •  Auto‐resubmit
the
untraced
job.


  • 15. Challenges
3:
Horizontally
Scalable
 •  Issues:
One
instance
of
Oozie
could
not
efficiently
 handle
a
very
large
number
of
jobs
(say
100K/ hours).
In
addion,
Oozie
doesn’t
support
load
 balancing.
 •  
Reason:
Oozie
internal
task
queue
is
not
 synchronized
across
mulple
Oozie
instances.
 •  Planned
Resoluon:
Use
Zookeeper
for
coordinaon.
 •  Benefits:
As
the
load
increases,
add
extra
Oozie
 server.

  • 16. Future
Plan
 •  AutomaNc
Failover:
Using
ZooKeeper.
 •  Monitoring:
Rich
WS
API
for
applicaon
 Monitoring/Alerng.
 •  Improved
Usability:

 –  Distcp
acon
 –  Hive
Acon
 •  Asynchronous
data
processing.
 •  Incremental
data
processing.
 •  Apache
MigraNon:
Works
iniated.


  • 19. Oozie
Workflow
Applicaon
 •  Contents
 –  A
workflow.xml
file

 –  Resource
files,
config
files
and
Pig
scripts
 –  All
necessary
JAR
and
nave
library
files

 •  Parameters
 –  The
workflow.xml,
is
parameterized,
parameters
 can
be
propagated
to
map-reduce,
pig &
ssh
 jobs
 •  Deployment
 –  In
a
directory
in
the
HDFS
of
the
Hadoop
cluster
 where
the
Hadoop
&
Pig
jobs
will
run
 19

  • 20. Oozie
 Running
a
Workflow
Job
 cmd
 Workflow
ApplicaNon
Deployment
 $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib $ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf $ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib $ Workflow
Job
ExecuNon
 $ oozie run -o http://foo.corp:8080/oozie -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 
 input=/data/2008/input output=/data/2008/output Workflow job id [1234567890-wordcount-wf] $
 Workflow
Job
Status
 $ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf Workflow job status [RUNNING] ... $ 20