SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Oozie: Towards a Scalable Workflow
 Management System for Hadoop

                     Mohammad Islam
                                 And
                        Virag Kothari
Accepted Paper
• Workshop in ACM/SIGMOD, May 2012.
• It is a team effort!

Mohammad Islam       Angelo Huang
Mohamed Battisha     Michelle Chiang
SanthoshSrinivasan   Craig Peters

Andreas Neumann      Alejandro Abdelnur
Presentation Workflow


Oozie                   Design         Result
Tutorial                Decision       s
                        s


                      Question
                      s?


           Address               END
           Question
Installing Oozie

Step 1: Download the Oozietarball
curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-
incubating/oozie-3.1.3-incubating-distro.tar.gz

Step 2: Unpack the tarball
tar –xzvf<PATH_TO_OOZIE_TAR>

Step 3: Run the setup script
bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie
bin/oozie-start.sh

Step 5: Check status of oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status
Running an Example

•Standalone Map-Reduce job
    $ hadoop jar /usr/joe/hadoop-examples.jarorg.myorg.wordcountinputDiroutputDir


•     Using Oozie


                       MapReduce    OK                           <workflow –app name =..>
        Start                                End                 <start..>
                       wordcount
                                                                 <action>
                                                                 <map-reduce>
                            ERROR                                    ……
                                                                     ……
                                                                 </workflow>
                           Kill


                  Example DAG                                    Workflow.xml
Example Workflow
<action name=’wordcount'>
<map-reduce>
<configuration>
<property>
<name>mapred.mapper.class</name><value>org.myorg.WordCount.Map</value> mapred.mapper.class =
</property>                                                              org.myorg.WordCount.Map
     <property>
<name>mapred.reducer.class</name><value>org.myorg.WordCount.Reduce</value>
</property>
<property>
                                                                         mapred.reducer.class =
<name>mapred.input.dir</name>                                            org.myorg.WordCount.Reduce
<value>usr/joe/inputDir</value>
</property>
<property>
                                                                         mapred.input.dir = inputDir
<name>mapred.output.dir</name>
<value>/usr/joe/outputDir</value>
</property>
</configuration>                                                         mapred.output.dir = outputDir
</map-reduce>
</action>
A Workflow Application
Three components required for a Workflow:

1) Workflow.xml:
   Contains job definition


2) Libraries:
   optional ‘lib/’ directory contains .jar/.so files

3) Properties file:
• Parameterization of Workflow xml
• Mandatory property is oozie.wf.application.path
Workflow Submission
Run Workflow Job

  $ oozie job –run -configjob.properties-oozie http://localhost:11000/oozie/
  Workflow ID: 00123-123456-oozie-wrkf-W

Check Workflow Job Status

  $ oozie job –info 00123-123456-oozie-wrkf-W -oozie
http://localhost:11000/oozie/
  -----------------------------------------------------------------------
  Workflow Name: test-wf
  App Path: hdfs://localhost:11000/user/your_id/oozie/
  Workflow job status [RUNNING]
   ...
  ------------------------------------------------------------------------
Key Features and Design
Decisions
• Multi-tenant
• Security
  – Authenticate every request
  – Pass appropriate token to Hadoop job
• Scalability
  – Vertical: Add extra memory/disk
  – Horizontal: Add machines
Oozie Job Processing

         Oozie Security




                                                     Hadoop
                                 Access
                                 Secure
       Job                                Kerberos
                   OozieServer

End
user
Oozie-Hadoop Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
     Job                                      Kerberos
                      Oozie Server


End user




                                                c
Oozie-Hadoop Security

 •   Oozie is a multi-tenant system
 •   Job can be scheduled to run later
 •   Oozie submits/maintains the hadoop jobs
 •   Hadoop needs security token for each
     request

Question: Who should provide the security
token to hadoop and how?
Oozie-Hadoop Security Contd.

• Answer: Oozie
• How?
  – Hadoop considers Oozieas a super-user
  – Hadoopdoes not check end-user credential
  – Hadooponly checks the credential of
    Oozieprocess


• BUT hadoop job is executed as end-user.
•Oozie utilizes doAs() functionality of Hadoop.
User-Oozie Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
      Job                                     Kerberos
                      Oozie Server


End user
             c
Why Oozie Security?

• One user should not modify another user’s
  job
• Hadoop doesn’t authenticate end–user
• Ooziehas to verifyits user before passing
  the job to Hadoop
How does Oozie Support Security?

• Built-in authentication
  – Kerberos
  – Non-secured (default)
• Design Decision
  – Pluggable authentication
  – Easy to include new type of authentication
  – Yahoo supports 3 types of authentication.
Job Submission to Hadoop

• Oozie is designed to handle thousands of
  jobs at the same time

• Question : Should Oozie server
  – Submit the hadoop job directly?
  – Wait for it to finish?


 • Answer: No
Job Submission Contd.
• Reason
  – Resource constraints: A single Oozie process
    can’t simultaneously create thousands of thread
    for each hadoop job. (Scaling limitation)
  – Isolation: Running user code on Oozie server
    might de-stabilize Oozie
• Design Decision
  – Create a launcher hadoop job
  – Execute the actual user job from the launcher.
  – Wait asynchronously for the job to finish.
Job Submission to Hadoop


                    Hadoop Cluster
             5     Job
                                 Actual
                 Tracker
                                M/R Job
Oozie                       3
Server       1                  4
                 Launcher
         2        Mapper
Job Submission Contd.

• Advantages
  – Horizontal scalability: If load increases, add
    machines into Hadoop cluster
  – Stability: Isolation of user code and system
    process
• Disadvantages
  – Extra map-slot is occupied by each job.
Production Setup

• Total number of nodes: 42K+
• Total number of Clusters: 25+
• Total number of processed jobs ≈ 750K/month
• Data presented from two clusters
• Each of them have nearly 4K nodes
• Total number of users /cluster = 50
Oozie Usage Pattern @ Y!
                    Distribution of Job Types On Production Clusters
               50

               45

               40

               35
  Percentage




               30

               25

               20                                                      #1 Cluster
               15                                                      #2 Cluster
               10

                5

                0

                         fs          java         map-reduce     pig
                                            Job type

• Pig and Java are the most popular
• Number of pure Map-Reduce jobs are fewer
Experimental Setup

•   Number of nodes: 7
•   Number of map-slots: 28
•   4 Core, RAM: 16 GB
•   64 bit RHEL
•   Oozie Server
    – 3 GB RAM
    – Internal Queue size = 10 K
    – # Worker Thread = 300
Job Acceptance
                                          Workflow Acceptance Rate
  workflows Accepted/Min



                           1400
                           1200
                           1000
                           800
                           600
                           400
                           200
                             0
                                  2   6   10   14   20   40   52 100 120 200 320 640
                                           Number of Submission Threads

Observation: Oozie can accept a large number of jobs
Time Line of a Oozie Job

  User       Oozie            Job         Job
 submits   submits to     completes   completes
   Job      Hadoop        at Hadoop    at Oozie

                                             Time

   Preparation              Completion
   Overhead                 Overhead

Total Oozie Overhead = Preparation + Completion
Oozie Overhead
                                          Per Action Overhead
Overhead in millisecs




                        1800
                        1600
                        1400
                        1200
                        1000
                         800
                         600
                         400
                         200
                           0
                               1 Action       5 Actions   10 Actions    50 Actions
                                           Number of Actions/Workflow

Observation: Oozie overhead is less when multiple
actions are in the same workflow.
Oozie Futures

• Scalability
  – Hot-Hot/Load balancing service
  – Replace SQL DB with Zookeeper
• Improved Usability
• Extend the benchmarking scope
• Monitoring WS API
Take Away ..

• Oozie is
 – Easier to use
 – Scalable
 – Secure and multi-tenant
Q&A




          Mohammad K               Virag Kothari
 Islamkamrul@yahoo- virag@yahoo-inc.com
                inc.com
    http://incubator.apache.org/oozie/

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtimeDataWorks Summit
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot GamesMatt Goeke
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelFuseSource.com
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016Vlad Mihalcea
 
W-JAX 2011: OSGi with Apache Karaf
W-JAX 2011: OSGi with Apache KarafW-JAX 2011: OSGi with Apache Karaf
W-JAX 2011: OSGi with Apache KarafJerry Preissler
 
OSGi ecosystems compared on Apache Karaf - Christian Schneider
OSGi ecosystems compared on Apache Karaf - Christian SchneiderOSGi ecosystems compared on Apache Karaf - Christian Schneider
OSGi ecosystems compared on Apache Karaf - Christian Schneidermfrancis
 
Rails 2.0 Presentation
Rails 2.0 PresentationRails 2.0 Presentation
Rails 2.0 PresentationScott Chacon
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshareMorten Andersen-Gott
 
A Groovy Kind of Java (San Francisco Java User Group)
A Groovy Kind of Java (San Francisco Java User Group)A Groovy Kind of Java (San Francisco Java User Group)
A Groovy Kind of Java (San Francisco Java User Group)Nati Shalom
 

Was ist angesagt? (20)

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
W-JAX 2011: OSGi with Apache Karaf
W-JAX 2011: OSGi with Apache KarafW-JAX 2011: OSGi with Apache Karaf
W-JAX 2011: OSGi with Apache Karaf
 
OSGi ecosystems compared on Apache Karaf - Christian Schneider
OSGi ecosystems compared on Apache Karaf - Christian SchneiderOSGi ecosystems compared on Apache Karaf - Christian Schneider
OSGi ecosystems compared on Apache Karaf - Christian Schneider
 
Rails 2.0 Presentation
Rails 2.0 PresentationRails 2.0 Presentation
Rails 2.0 Presentation
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 
A Groovy Kind of Java (San Francisco Java User Group)
A Groovy Kind of Java (San Francisco Java User Group)A Groovy Kind of Java (San Francisco Java User Group)
A Groovy Kind of Java (San Francisco Java User Group)
 

Ähnlich wie May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdfwwww63
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11mislam77
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackke4qqq
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_startGim GyungJin
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707oscon2007
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"Daniel Bryant
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809Tim Bunce
 
A tour of Ansible
A tour of AnsibleA tour of Ansible
A tour of AnsibleDevOps Ltd.
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013Simon McCartney
 

Ähnlich wie May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop (20)

Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
A tour of Ansible
A tour of AnsibleA tour of Ansible
A tour of Ansible
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013
 

Mehr von Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

  • 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  • 2. Accepted Paper • Workshop in ACM/SIGMOD, May 2012. • It is a team effort! Mohammad Islam Angelo Huang Mohamed Battisha Michelle Chiang SanthoshSrinivasan Craig Peters Andreas Neumann Alejandro Abdelnur
  • 3. Presentation Workflow Oozie Design Result Tutorial Decision s s Question s? Address END Question
  • 4. Installing Oozie Step 1: Download the Oozietarball curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3- incubating/oozie-3.1.3-incubating-distro.tar.gz Step 2: Unpack the tarball tar –xzvf<PATH_TO_OOZIE_TAR> Step 3: Run the setup script bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip Step 4: Start oozie bin/oozie-start.sh Step 5: Check status of oozie bin/oozie admin -oozie http://localhost:11000/oozie -status
  • 5. Running an Example •Standalone Map-Reduce job $ hadoop jar /usr/joe/hadoop-examples.jarorg.myorg.wordcountinputDiroutputDir • Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 6. Example Workflow <action name=’wordcount'> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name><value>org.myorg.WordCount.Map</value> mapred.mapper.class = </property> org.myorg.WordCount.Map <property> <name>mapred.reducer.class</name><value>org.myorg.WordCount.Reduce</value> </property> <property> mapred.reducer.class = <name>mapred.input.dir</name> org.myorg.WordCount.Reduce <value>usr/joe/inputDir</value> </property> <property> mapred.input.dir = inputDir <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> </property> </configuration> mapred.output.dir = outputDir </map-reduce> </action>
  • 7. A Workflow Application Three components required for a Workflow: 1) Workflow.xml: Contains job definition 2) Libraries: optional ‘lib/’ directory contains .jar/.so files 3) Properties file: • Parameterization of Workflow xml • Mandatory property is oozie.wf.application.path
  • 8. Workflow Submission Run Workflow Job $ oozie job –run -configjob.properties-oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-W Check Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  • 9. Key Features and Design Decisions • Multi-tenant • Security – Authenticate every request – Pass appropriate token to Hadoop job • Scalability – Vertical: Add extra memory/disk – Horizontal: Add machines
  • 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos OozieServer End user
  • 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user c
  • 12. Oozie-Hadoop Security • Oozie is a multi-tenant system • Job can be scheduled to run later • Oozie submits/maintains the hadoop jobs • Hadoop needs security token for each request Question: Who should provide the security token to hadoop and how?
  • 13. Oozie-Hadoop Security Contd. • Answer: Oozie • How? – Hadoop considers Oozieas a super-user – Hadoopdoes not check end-user credential – Hadooponly checks the credential of Oozieprocess • BUT hadoop job is executed as end-user. •Oozie utilizes doAs() functionality of Hadoop.
  • 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user c
  • 15. Why Oozie Security? • One user should not modify another user’s job • Hadoop doesn’t authenticate end–user • Ooziehas to verifyits user before passing the job to Hadoop
  • 16. How does Oozie Support Security? • Built-in authentication – Kerberos – Non-secured (default) • Design Decision – Pluggable authentication – Easy to include new type of authentication – Yahoo supports 3 types of authentication.
  • 17. Job Submission to Hadoop • Oozie is designed to handle thousands of jobs at the same time • Question : Should Oozie server – Submit the hadoop job directly? – Wait for it to finish? • Answer: No
  • 18. Job Submission Contd. • Reason – Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) – Isolation: Running user code on Oozie server might de-stabilize Oozie • Design Decision – Create a launcher hadoop job – Execute the actual user job from the launcher. – Wait asynchronously for the job to finish.
  • 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R Job Oozie 3 Server 1 4 Launcher 2 Mapper
  • 20. Job Submission Contd. • Advantages – Horizontal scalability: If load increases, add machines into Hadoop cluster – Stability: Isolation of user code and system process • Disadvantages – Extra map-slot is occupied by each job.
  • 21. Production Setup • Total number of nodes: 42K+ • Total number of Clusters: 25+ • Total number of processed jobs ≈ 750K/month • Data presented from two clusters • Each of them have nearly 4K nodes • Total number of users /cluster = 50
  • 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35 Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type • Pig and Java are the most popular • Number of pure Map-Reduce jobs are fewer
  • 23. Experimental Setup • Number of nodes: 7 • Number of map-slots: 28 • 4 Core, RAM: 16 GB • 64 bit RHEL • Oozie Server – 3 GB RAM – Internal Queue size = 10 K – # Worker Thread = 300
  • 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission Threads Observation: Oozie can accept a large number of jobs
  • 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead Overhead Total Oozie Overhead = Preparation + Completion
  • 26. Oozie Overhead Per Action Overhead Overhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/Workflow Observation: Oozie overhead is less when multiple actions are in the same workflow.
  • 27. Oozie Futures • Scalability – Hot-Hot/Load balancing service – Replace SQL DB with Zookeeper • Improved Usability • Extend the benchmarking scope • Monitoring WS API
  • 28. Take Away .. • Oozie is – Easier to use – Scalable – Secure and multi-tenant
  • 29. Q&A Mohammad K Virag Kothari Islamkamrul@yahoo- virag@yahoo-inc.com inc.com http://incubator.apache.org/oozie/