Boris Lublinsky and Alexey Yakubovich give us an overview of using Oozie. This presentation was given on December 13th, 2012 at the Nokia offices in Chicago, IL.
View the HD video of this talk here: http://vimeo.com/chug/oozie-overview
Everything you wanted to know, but were afraid to ask about Oozie
1. Everything that you ever wanted to
know about Oozie, but were afraid
to ask
B Lublinsky, A Yakubovich
2. Apache Oozie
• Oozie is a workflow/coordination system to
manage Apache Hadoop jobs.
• A single Oozie server implements all four
functional Oozie components:
– Oozie workflow
– Oozie coordinator
– Oozie bundle
– Oozie SLA.
3. Main components
Oozie Server
Bundle
3rd party application
time condition monitoring
Coordinator
WS API
workflow
data condition monitoring
action
Oozie Command action action
Line Interface
action
wf logic job submission
and monitoring
definitions,
states
Oozie shared
libraries
HDFS
Bundle
Coordinator
Coordinator
MapReduce
Data
Coordinator
Coordinator
Coordinator
Workflow
Coordinator
Coordinator
Hadoop
5. Workflow Language
Flow-control XML element type Description
node
Decision workflow:DECISION expressing “switch-case” logic
Fork workflow:FORK splits one path of execution into multiple concurrent paths
Join workflow:JOIN waits until every concurrent execution path of a previous fork
node arrives to it
Kill workflow:kill forces a workflow job to kill (abort) itself
Action node XML element type Description
java workflow:JAVA invokes the main() method from the specified java class
fs workflow:FS manipulate files and directories in HDFS; supports commands:
move, delete, mkdir
MapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
streaming job or pipe job
Pig workflow:pig runs a Pig job
Sub workflow workflow:SUB- runs a child workflow job
WORKFLOW
Hive * workflow:HIVE runs a Hive job
Shell * workflow:SHELL runs a Shell command
ssh * workflow:SSH starts a shell command on a remote machine as a remote secure
shell
Sqoop * workflow:SQOOP runs a Sqoop job
Email * workflow:EMAIL sending emails from Oozie workflow application
Distcp ? Under development (Yahoo)
9. Extending Oozie workflow
• Oozie provides a “minimal” workflow language, which
contains only a handful of control and actions nodes.
• Oozie supports a very elegant extensibility mechanism –
custom action nodes. Custom action nodes allow to extend
Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of
following:
– Java action implementation, which extends ActionExecutor
class.
– Implementation of the action’s XML schema defining action’s
configuration parameters
– Packaging of java implementation and configuration schema
into action jar, which has to be added to Oozie war
– extending oozie-site.xml to register information about custom
executor with Oozie runtime.
10. Oozie Workflow Client
• Oozie provides an easy way for integration with enterprise
applications through Oozie client APIs. It provides two
types of APIs
• REST HTTP API
Number of HTTP requests
• Info requests (job status, job configuration)
• Job management (submit, start, suspend, resume, kill)
Example: job definition info request
GET /oozie/v0/job/job-ID?show=definition
• Java API - package org.apache.oozie.client
– OozieClient
start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
– WorkflowJob, WorkflowAction
– CoordinatorJob, CoordinatorAction
– SLAEvent
11. Oozie workflow good, bad and ugly
• Good
– Nice integration with Hadoop ecosystem, allowing to easily build
processes encompassing synchronized execution of multiple Map
Reduce, Hive, Pig, etc jobs.
– Nice UI for tracking execution progress
– Simple APIs for integration with other applications
– Simple extensibility APIs
• Bad
– Process has to be expressed directly in hPDL with no visual support
– No support for Uber Jars (but we added our own)
• Ugly
– Static forking (but you can regenerate workflow and invoke on a fly)
– No support for loops
13. Coordinator language
Element type Description Attributes and sub-elements
coordinator- top-level element in coordinator instance frequency
app start
end
controls specify the execution policy for coordinator and timeout (actions)
it’s elements (workflow actions) concurrency (actions)
execution order (workflow
instances)
action Required singular element specifying the Workflow name
associated workflow. The jobs specified in
workflow consume and produce dataset
instances
datasets Collection of data referred to by a logical name.
Datasets serve to specify data dependences
between workflow instances
input event specifies the input conditions (in the form of
present data sets) that are required in order to
execute a coordinator action
output event specifies the dataset that should be produced
by coordinator action
18. SLA Navigation
COORD_JOBS
id
app_name
app_path
…
WF_JOBS
SLA_EVENT
event_id id
alert_contact app_name
alert-frieuency app_path
… …
sla_id
... COORD_ACTIONS
id
action_number
action_xml WF_ACTIONS
…
external_id
... id
conf
console_url
…
19.
20. Using Probes to analyze/monitor Places
• Select probe data for specified time/location
• Validate – Filter - Transform probe data
• Calculate statistics on available probe data
• Distribute data per geo-tiles
• Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
25. Configuring workflow
• Oozie provides 3 overlapping mechanisms to configure workflow -
config-default.xml, jobs properties file and job arguments that can
be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as
follows:
– Use all of the parameters from command line invocation
– For remaining unresolved parameters, job config is used
– Use config-default.xml for everything else
• Although documentation does not describe clearly when to use
which, the overall recommendation is as follows:
– Use config-default.xml for defining parameters that never change for a
given workflow
– Use jobs properties for the parameters that are common for a given
deployment of a workflow
– Use command line arguments for the parameters that are specific for
a given workflow invocation.
26. Accessing and storing process
variables
• Accessing
– Through the arguments in java main
• Storing
String ooziePropFileName =
System.getProperty("oozie.action.output.properties");
OutputStream os = new FileOutputStream(new
File(ooziePropFileName));
Properties props = new Properties();
props.setProperty(key, value);
props.store(os, "");
os.close();
27. Validating data presence
• Oozie provides two possible approaches for validating
resource file(s) presence
– using Oozie coordinator’s input events based on the data set -
technically the simplest implementation approach, but it does
not provide a more complex decision support that might be
required. It just either runs a corresponding workflow or not.
– custom java node inside Oozie workflow. - allows to extend
decision logic by sending notifications about data absence, run
execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator,
for example, ability to wait for files arrival, etc. can expand
usage of Oozie coordinator.
28. Invoking map Reduce jobs
• Oozie provides two different ways of invoking Map Reduce
job – MapReduce action and java action.
• Invocation of Map Reduce job with java action is somewhat
similar to invocation of this job with Hadoop command line
from the edge node. You specify a driver as a class for the
java activity and Oozie invokes the driver. This approach
has two main advantages:
– The same driver class can be used for both – running Map
Reduce job from an edge node and a java action in an Oozie
process.
– A driver provides a convenient place for executing additional
code, for example clean-up required for Map Reduce execution.
• Driver requires a proper shutdown hook to ensure that
there are no lingering Map Reduce jobs
29. Implementing predefined looping and
forking
• hPDL is an XML document with the well-defined
schema.
• This means that the actual workflow can be easily
manipulated using JAXB objects, which can be
generated from hPDL schema using xjc compiler.
• This means that we can create the complete
workflow programmatically, based on calculated
amount of fork branches or implementing loops
as a repeated actions.
• The other option is creation of template process
and modifying it based on calculated parameters.
30. Oozie client security (or lack of)
• By default Oozie client reads clients identity from the
local machine OS and passes it to the Oozie server,
which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting
OozieClient class’ method createConfiguration, where
client variables can be set through new constructor.
public Properties createConfiguration() {
Properties conf = new Properties();
if(user == null)
conf.setProperty(USER_NAME, System.getProperty("user.name"));
else
conf.setProperty(USER_NAME, user);
return conf;
}
31. uber jars with Oozie
uber jar contains resources: other jars, so libraries, zip files
unpack resources
Oozie launcher to current uber jar dir
server java action
set inverse classloader
uber jar
Classes (Launcher) invoke MR driver
pass arguments
jars so zip
<java> set shutdown hook
… ‘wait for complete’
<main-class>${wfUberLauncher}</main-class>
<arg>-appStart=${wfAppMain}</arg>
… mapper
</java> mapper