During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. Oozie stably processes more than 45% of the jobs run across more than 25 Hadoop clusters in Yahoo. At the same time adoption
in other enterprises has increased substantially since Oozie was contributed to the Apache community. We attribute these achievements to design decisions
that was selected to be presented at a workshop during the ACM/SIGMOD conference. This presentation covers the key architectural design choices described in the paper. Operational metrics will be used to illustrate production experience at Yahoo, and we will also include a quick tutorial.
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
1. Oozie: Towards a Scalable Workflow
Management System for Hadoop
Mohammad Islam
And
Virag Kothari
2. Accepted Paper
• Workshop in ACM/SIGMOD, May 2012.
• It is a team effort!
Mohammad Islam Angelo Huang
Mohamed Battisha Michelle Chiang
SanthoshSrinivasan Craig Peters
Andreas Neumann Alejandro Abdelnur
4. Installing Oozie
Step 1: Download the Oozietarball
curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-
incubating/oozie-3.1.3-incubating-distro.tar.gz
Step 2: Unpack the tarball
tar –xzvf<PATH_TO_OOZIE_TAR>
Step 3: Run the setup script
bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip
Step 4: Start oozie
bin/oozie-start.sh
Step 5: Check status of oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status
5. Running an Example
•Standalone Map-Reduce job
$ hadoop jar /usr/joe/hadoop-examples.jarorg.myorg.wordcountinputDiroutputDir
• Using Oozie
MapReduce OK <workflow –app name =..>
Start End <start..>
wordcount
<action>
<map-reduce>
ERROR ……
……
</workflow>
Kill
Example DAG Workflow.xml
9. Key Features and Design
Decisions
• Multi-tenant
• Security
– Authenticate every request
– Pass appropriate token to Hadoop job
• Scalability
– Vertical: Add extra memory/disk
– Horizontal: Add machines
10. Oozie Job Processing
Oozie Security
Hadoop
Access
Secure
Job Kerberos
OozieServer
End
user
11. Oozie-Hadoop Security
Oozie Security
Hadoop
Access
Secure
Job Kerberos
Oozie Server
End user
c
12. Oozie-Hadoop Security
• Oozie is a multi-tenant system
• Job can be scheduled to run later
• Oozie submits/maintains the hadoop jobs
• Hadoop needs security token for each
request
Question: Who should provide the security
token to hadoop and how?
13. Oozie-Hadoop Security Contd.
• Answer: Oozie
• How?
– Hadoop considers Oozieas a super-user
– Hadoopdoes not check end-user credential
– Hadooponly checks the credential of
Oozieprocess
• BUT hadoop job is executed as end-user.
•Oozie utilizes doAs() functionality of Hadoop.
14. User-Oozie Security
Oozie Security
Hadoop
Access
Secure
Job Kerberos
Oozie Server
End user
c
15. Why Oozie Security?
• One user should not modify another user’s
job
• Hadoop doesn’t authenticate end–user
• Ooziehas to verifyits user before passing
the job to Hadoop
16. How does Oozie Support Security?
• Built-in authentication
– Kerberos
– Non-secured (default)
• Design Decision
– Pluggable authentication
– Easy to include new type of authentication
– Yahoo supports 3 types of authentication.
17. Job Submission to Hadoop
• Oozie is designed to handle thousands of
jobs at the same time
• Question : Should Oozie server
– Submit the hadoop job directly?
– Wait for it to finish?
• Answer: No
18. Job Submission Contd.
• Reason
– Resource constraints: A single Oozie process
can’t simultaneously create thousands of thread
for each hadoop job. (Scaling limitation)
– Isolation: Running user code on Oozie server
might de-stabilize Oozie
• Design Decision
– Create a launcher hadoop job
– Execute the actual user job from the launcher.
– Wait asynchronously for the job to finish.
19. Job Submission to Hadoop
Hadoop Cluster
5 Job
Actual
Tracker
M/R Job
Oozie 3
Server 1 4
Launcher
2 Mapper
20. Job Submission Contd.
• Advantages
– Horizontal scalability: If load increases, add
machines into Hadoop cluster
– Stability: Isolation of user code and system
process
• Disadvantages
– Extra map-slot is occupied by each job.
21. Production Setup
• Total number of nodes: 42K+
• Total number of Clusters: 25+
• Total number of processed jobs ≈ 750K/month
• Data presented from two clusters
• Each of them have nearly 4K nodes
• Total number of users /cluster = 50
22. Oozie Usage Pattern @ Y!
Distribution of Job Types On Production Clusters
50
45
40
35
Percentage
30
25
20 #1 Cluster
15 #2 Cluster
10
5
0
fs java map-reduce pig
Job type
• Pig and Java are the most popular
• Number of pure Map-Reduce jobs are fewer
23. Experimental Setup
• Number of nodes: 7
• Number of map-slots: 28
• 4 Core, RAM: 16 GB
• 64 bit RHEL
• Oozie Server
– 3 GB RAM
– Internal Queue size = 10 K
– # Worker Thread = 300
24. Job Acceptance
Workflow Acceptance Rate
workflows Accepted/Min
1400
1200
1000
800
600
400
200
0
2 6 10 14 20 40 52 100 120 200 320 640
Number of Submission Threads
Observation: Oozie can accept a large number of jobs
25. Time Line of a Oozie Job
User Oozie Job Job
submits submits to completes completes
Job Hadoop at Hadoop at Oozie
Time
Preparation Completion
Overhead Overhead
Total Oozie Overhead = Preparation + Completion
26. Oozie Overhead
Per Action Overhead
Overhead in millisecs
1800
1600
1400
1200
1000
800
600
400
200
0
1 Action 5 Actions 10 Actions 50 Actions
Number of Actions/Workflow
Observation: Oozie overhead is less when multiple
actions are in the same workflow.
27. Oozie Futures
• Scalability
– Hot-Hot/Load balancing service
– Replace SQL DB with Zookeeper
• Improved Usability
• Extend the benchmarking scope
• Monitoring WS API
28. Take Away ..
• Oozie is
– Easier to use
– Scalable
– Secure and multi-tenant
29. Q&A
Mohammad K Virag Kothari
Islamkamrul@yahoo- virag@yahoo-inc.com
inc.com
http://incubator.apache.org/oozie/