First part of the talk will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and BCP management. Second part of the talk will focus on out of box support for spark jobs.
Speakers:
Purshotam Shah is a senior software engineer with the Hadoop team at Yahoo, and an Apache Oozie PMC member and committer.
Satish Saley is a software engineer at Yahoo!. He contributes to Apache Oozie.
4. Scale at Yahoo
4
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
5. Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
6. Data Pipelines
6
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
10. Current limitation of Oozie coordinator
• All dataset are required
• All instance are forced
• We can’t combine datasets from multiple provider
• There is no way to assign priority among datasets
10
13. BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
13
14. Minimum availability processing
14
Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
15. Optional feeds
15
Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
16. Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
16
17. Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
17
18. Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
18
19. Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
25. PySpark Example
Yahoo Confidential & Proprietary
Automatically sets up pyspark.zip and py4j-src.zip from Sharelib
<spark xmlns="uri:oozie:spark-action:0.2">
<master>yarn</master>
<mode>cluster</mode>
<name>PySparkExample</name>
<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>
<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>
</spark>
26. Modes supported
Yahoo Confidential & Proprietary
• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting
any properties for Driver, property should be prefixed with oozie.launcher.
• For ex, oozie.launcher.mapreduce.map.memory.mb and
oozie.launcher.mapreduce.map.java.opts should be modified for increasing
driver memory.
Master Mode
local[*]
yarn client
yarn cluster
27. Recent enhancements
Yahoo Confidential & Proprietary
• Support for PySpark jobs
• Show Spark Job URLs in Oozie UI under Child Jobs Tab
• Automatically include spark-defaults.conf from Sharelib
• Support for <file> and <archive>
• Faster job launch time
• Simplify setting up of classpath
• Avoid re-uploading jars for localization by reusing hdfs paths in
mapreduce.job.cache.files
• Couple of bug fixes
28. Agenda
Oozie at Yahoo1
Data Pipelines and Complex dependencies
Oozie unit testing
Spark Action
Future Work
2
3
4
5
29. Future Work
29
Oozie Unit testing framework
No unit tests now. Directly tested by running in staging
Coordinator Dependency management
Better reprocessing
Aperiodic processing
Managed through workarounds