1. Cascading on Starfish
Fei Dong
Duke University
dongfei@cs.duke.edu
December 10, 2011
1 Introduction
Hadoop [6] is a software framework installed on a cluster to permit large scale
distributed data analysis. It provides the robust Hadoop Distributed File
System (HDFS) as well as a Java-based API that allows parallel processing
across the nodes of the cluster. Programs employ a Map/Reduce execution
engine which functions as a fault-tolerant distributed computing system over
large data sets.
In addition to Hadoop, which is a top-level Apache project, there are sub-
projects related to workflow of Hadoop, such as Hive [8], a data warehouse
framework used for ad hoc querying (with an SQL type query language);
and Pig [9], a high-level data-flow language and execution framework whose
compiler produces sequences of Map/Reduce programs for execution within
Hadoop. Cascading [2], an API for defining and executing fault tolerant
data processing workflows on a Hadoop cluster. All of mentioned projects
simplify some of work for developers, allowing them to write more traditional
procedural or SQL-style code that, under the covers, creates a sequence of
Hadoop jobs. In this report, we focus on Cascading as the main data-parallel
workflow choice.
1.1 Cascading Introduction
Cascading is a Java application framework that allows you to more easily
write scripts to access and manipulate data inside Hadoop. There are a
number of key features provided by this API:
• Dependency-Based ’Topological Scheduler’ and MapReduce Planning
- Two key components of the cascading API are its ability to sched-
ule the invocation of flows based on dependency; with the execution
order being independent of construction order, often allowing for con-
current invocation of portions of flows and cascades. In addition, the
1
2. steps of the various flows are intelligently converted into map-reduce
invocations against the hadoop cluster.
• Event Notification - The various steps of the flow can perform notifi-
cations via callbacks, allowing for the host application to report and
respond to the progress of the data processing.
• Scriptable - The Cascading API has scriptable interfaces for Jython,
Groovy, and JRuby.
Although Cascading provides the above benefits, we still consider about
the balance of the performance and productivity on Cascading. Marz [5]
shows some rules to optimize Cascading Flows. For some experienced Cas-
cading users, they can gain some performance improvement by following
those principles in high level. One interesting questions is whether there ex-
ist some ways to improve the workflow performance without expert knowl-
edge. In other words, we want to optimize workflow in physical level. In
Starfish [7], the authors demonstrate the power of self-tuning jobs on Hadoop
and Herodotou has successfully applied optimization technology on Pig.
This report will discuss auto-optimization on Cascading with the help of
Starfish.
2 Terminology
First, we introduce some concepts widely used in Cascading.
• Stream: data input and output.
• Tuple: stream is composed of a series of Tuples. Tuples are sets of
ordered data.
• Tap: abstraction on top of Hadoop files. Source - A source tap is read
from and acted upon. Actions on source taps result in pipes. Sink -
A sink tap is a location to be written to. A Sink tap can later serve
as a Source in the same script.
• Operations: define what to do on the data. i.e.: Each(), Group(),
CoGroup(), Every().
• Pipe: tie Operation together. When an operation is executed upon a
Tap, the result is a Pipe. In other words, a flow is a pipe with data
flowing through it. i.e: Pipes can use other Pipes as input, thereby
wrapping themselves into a series of operations.
• Filter: pass through it to remove useless records. i.e. RegexFilter(),
And(), Or().
2
3. • Aggregator: function after group operation. i.e. Count(), Average(),
Min(), Max().
• Step: a logic unit in Flow. It represents a Map-only or MapReduce
job.
• Flow: A Flow is a combination of a Source, a Sink and Pipe.
• Cascade:a series of Flows.
3 Cascading Structure
Figure 1: A Typical Cascading Structure.
In Figure 1, we can clearly see that Cascading Structure. The top level
is called Cascading which is composed of several flows. In each flow, it
defines a source Tap, a sink Tap and Pipes. We also notice one flow can
have multiple pipes to do data operations like filter, grouping, aggregator.
Internally, a Cascade is constructed through the CascadeConnector class,
by building an internal graph that makes each Flow a ’vertex’, and each file
an ’edge’. A topological walk on this graph will touch each vertex in order
of its dependencies. When a vertex has all it’s incoming edges available, it
will be scheduled on the cluster. Figure 2 gives us an example which goal
is to statistic second and minute count from Apache logs. The dataflow is
represented as a Graph. The first step is to import and parse source data.
Next it generates two following steps to process ”second” and ”minutes”
respectively.
The execution order for Log Analysis is:
1. calculate the dependency between flows, so we get F low1 → F low2
2. start to call F low1
2.1 initialize ”import” flowStep and construct the Job1
2.2 submit ”import” Job1 to Hadoop
3. start to call F low2
3.1 initialize ”minute and secend statistics” flowSteps and construct
the Job2, Job3
3.2 submit Job2, Job3 to Hadoop
The complete code is attached at Appendix.
3
4. Figure 2: Workflow Sample:Log Analysis.
4 Cascading on Starfish
4.1 Change to new Hadoop API
We notice current Cascading is based on Hadoop Old-API. Since Starfish
only works within New-API, the first work is to connect those heterogeneous
systems. Herodotos works on supporting Hadoop Old-API on Starfish. I
work on replacing Old-API of Cascading with New-API. Although Hadoop
community recommends new API and provide some upgrade advice [11], it
still take us much energy on translating. One reason is the system complexity
(40K lines), we sacrifice some advanced features such as S3fs, TemplateTap,
ZipSplit, Stats reports and Strategy to make the change work. Finally, we
provide a revised version of Cascading that only use Hadoop New-API. In
the mean time Herodotos updated Starfish to support Old-API recently.
While this report will only consider New-API version of Cascading.
4.2 Cascading Profiler
First, we need to decide when to capture the profilers. Since modified Cas-
cading is using Hadoop New-API, the position to enable Profiler is the same
as a single MapReduce job. We choose the return point of blockT illCompleteOrStoped
of cascading.f low.F lowStepJob to collect job execution files when job com-
4
5. pletes. When all of jobs are finished and execution files are collected, we
would like to build a profile graph to represent dataflow dependencies among
the jobs. In order to build the job DAG, we decouple the hierarchy of Cas-
cading and Flows. As we see before, Log Analysis workflow has two de-
pendent Flows and finally will submit three MapReduce jobs on Hadoop.
Figure 3 shows the original Workflow in Cascading and translating JobGraph
in Starfish. We propose the following algorithm to build Job DAG.
Algorithm 1 Build Job DAG Pseudo-Code
1: procedure BuildJobDAG(f lowGraph)
2: for f low ∈ f lowGraph do Iterate over all flows
3:
4: for f lowStep ∈ f low.f lowStepGraph do Add the job
vertices
5: Create the jobVertex from the flowStep
6: end for
7:
8: for edge ∈ f low.f lowStepGraph.edgeSet do Add the job
edges within a flow
9: Create the corresponding edge in the jobGraph
10: end for
11: end for
12: for f lowEdge ∈ f lowGraph.edgeSet do Iterate over all flow
edges (source → target)
13: sourceF lowSteps ← f lowEdge.sourceF low.getLeaf F lowSteps
14: targetF lowSteps ← f lowEdge.targetF low.getRootF lowSteps
15: for sourceF S ∈ sourceF lowSteps do
16: for targetF S ∈ targetF lowSteps do
17: Create the job edge from corresponding source to target
18: end for
19: end for
20: end for
21: end procedure
4.3 Cascading What-if Engine and Optimizer
What-if Engine is to predict the behavior of a workflow W . To achieve
that, DAG Profilers ,Data Model, Cluster, DAG Configurations are given
as parameters. Building the Conf Graph shares the same idea as building
Job Graph. We capture the returning point of initializeN ewJobM ap in
cascading.cascade where we process what-if requests and exit the program
afterwards.
5
6. (a) Cascading Represent (b) Dataflow Transla-
tion
Figure 3: Log Analysis.
For the Cascading optimizer, I make use of data flow optimizer and feed
the related interface. When running the Optimizer, we keep the default
Optimizer mode as crossjob + dynamic.
4.4 Program Interface
The usage of Cascading on Starfish is simple and user-friendly. Users do not
need to change the source code or import new package. We can list some
cases as follows.
prof ile cascading jar loganalysis.jar
Profiler: collect task profiles when running a workflow and generate the
profile files in P ROF ILER OU T P U T DIR.
execute cascading jar loganalysis.jar
Execute: only run program without collecting profiles.
analyze cascading details workf low 20111017205527
Analyze: list some List basic or detail statistical information regarding
all jobs found in the P ROF ILER OU T P U T DIR
whatif details workf low 20111018014128 cascading jar loganalysis.jar
What-if Engine: ask hypothetical question on a particular workflow and
return predicted profiles.
optimize run workf low 20111018014128 cascading jar loganalysis.jar
Optimizer:Execute a MapReduce workflow using the configuration pa-
rameter settings automatically suggested by the Cost-based Optimizer.
6
7. 5 Evaluation
5.1 Experiment Environment
In the experimental evaluation, we used Hadoop clusters running on Amazon
EC2. The following is the detail preparation.
• Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory,
2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasks
concurrently.
• Hadoop Configurations: 0.20.203.
• Cascading Version : modified V1.2.4 (use Hadoop New-API)
• Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs for
pagerank, 5G paper author pairs.
• Optimizer Type: cross jobs and dynamic
5.2 Description of Data-parallel Workflows
We evaluate the end-to-end performance of optimizers on seven representa-
tive workflows used in different domains.
Term Frequency-Inverse Document Frequency(TF-IDF): TF-
IDF calculates weights representing the importance of each word to a doc-
ument in a collection. The workflow contains three jobs: 1) the total terms
in each document. 2) calculate the number of documents containing each
term. 3) calculate tf * idf. Job 2 depends on Job 1 and Job 3 depends on
Job 2.
Top 20 Coauthor Pairs: Suppose you have a large datasets of papers
and authors. You want to know who and if there is any correlation between
being collaborative and being a prolific author. It can take three jobs: 1)
Group authors by paper. 2) Generate co-authorship pairs (map) and count
(reduce). 3) Sort by count. Job 2 depends on Job 1 and Job 3 depends on
Job 2.
Log Analysis: Given an Apache log, parse it with specified format,
statistic the minute count and second count seperately and dump each re-
sults. There are three jobs: 1) Import and parse raw log. 2) group by
minutes and statistic counts. 3) Group by seconds and statistic counts. Job
2 and Job 3 depends on Job 1.
PageRank: The goal is to find the ranking of web pages. The algorithm
can be implemented as an iterative workflow containing two jobs:1) Join on
the pageId of two datasets.2) Calculate the new rankings of each webpage.
Job 2 depends on Job 1.
TPC-H: TPC-H benchmark as a representative example of a complex
SQL query. Query 3 is implemented in four-job workflow. 1) Join the order
7
8. and customer table, with filter conditions on each table. 2) Join lineitem
and result table in job one. 3) Calculate the volume by discount. 4) Get
the sum after grouping by some keys. Job 2 depends on Job 1 and Job 3
depends on Job 2 and Job 4 depends on Job 3.
HTML Parser and WordCount: The workflow processes a collection
of web source pages. It has three jobs: 1) Parse the raw data with HTML
SAX Parser. 2) Statistic the number of words with the same urls. 3) Ag-
gregate the total word count. Job 2 depends on Job 1 and Job 3 depends
on Job 2.
User-defined Partition: It spill the dataset into three parts by the
range of key. Some statistics are collected on each spilled part. In general,
it is run in three jobs and each job is responsible for one part of dataset.
There is no dependency between those three job, which means three jobs
can be run in parallel.
The source code for experiment groups is submitted in Starfish reposi-
tory.
5.3 Speedup with Starfish Optimizer
Figure 4 shows the timeline for TPC-H Query 3 workflow. When using
profiler, it spends 20% more time. The final cross-job optimizer causes 1.3x
speedup. Figure 5 analysis the speedup for six workflows respectively. The
optimizer is effective for most of workflows with only exception of the user-
defined partition. One possible reason is that workflows generates three jobs
in parallel which compete the limited cluster resource (30 available map slots
and 20 available reduce slots) from each other.
Figure 4: run TPC-H Query3 with no Optimizer, Profiler and Optimizer.
8
9. (a) Log Analysls (b) Coauthor Pairs (c) PageRank
(d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word-
count
Figure 5: Speedup with Starfish Optimizer.
5.4 Overhead on Profiler
Figure 6 shows the profiling overhead by comparing againest the same job
run with profiling turned off. In average, profiling consumes 20% of the
running time.
5.5 Compare with Pig
We are very interested in comparing various workflow framework with the
same datasets. We run the identical workflow written by Harold Lim. Fig-
ure 7 shows the performance of Pig overwhelm Cascading even if Cascading
is optimized. We think of several possible reasons.
• Cascading does not support Combiner. One article [4] talks about
hand-rolled join optimizations.
• Pig does many optimization work on physical and logic layer, while
Cascading does not optimize the planner well. In user-defined parti-
tion, Pig only has one MapReduce job while Cascading populates 3
jobs.
• Cascading only uses Customed Inputformat and InputSplit called Mul-
tiInputFormat and MultiInputSplit, no matter for single job or single
9
10. (a) Log Analysls (b) Coauthor Pairs (c) PageRank
(d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word-
count
Figure 6: Overhead to measure profile.
Figure 7: Pig Versus Cascading on Performance .
input source.
• Cascading’s CoGroup() join is not meant to be used with large data
files.
• Using RegexSplit to parse files into tuples is not efficient.
• Disable compression.
10
11. 6 Conclusion
Cascading aims to help developers build powerful applications quickly and
simply, through a well-reasoned API, without needing to think in MapRe-
duce, while leaving the heavy lifting of data distribution, replication, dis-
tributed process management, and liveness to Hadoop.
With Starfish Optimizer, we can boost the original Cascading program
by 20% to 200% without modifying any source code. It also demonstrates
that the similar syntax sentences as Pig in representation, but the experi-
ment group display distinct differences in results, which shows Pig perfor-
mance is much better than Cascading in most cases.
Considering the code scale, learning cost and performance, we recom-
mend for simple queries, using Pig is much more suitable and performant.
We also find Cascalog [3] which is data processing and querying library for
Clojure, is another choice of writing workflows on Hadoop.
We notice Cascading 2.0 [1] is ready to release, which will improve hugely
on performance and fix bugs of previous version. For the future work, when
Starfish supports old API, we can import latest version of Cascading and
rerun the experiment.
7 Acknowledgement
I would like to thank Herodotos Herodotou, the lead contributor of Starfish,
who gave me so much help on the system design and Hadoop internal mech-
anism. The report could not be done without him. I also want to thank
Harold Lim who gives me some support on benchmarks.
Thank Professor Shivnath Babu for his help and supervising this work,
and holding meeting for us to exchange ideas.
References
[1] Tips for Optimizing Cascading Flows. http://www.cascading.org/2011/10/cascading-
20-early-access.html.
[2] Cascading. http://www.cascading.org/.
[3] Cascalog. https://github.com/nathanmarz/cascalog.
[4] Pseudo Combiners in Cascading. http://blog.rapleaf.com/dev/2010/06/10/pseudo-
combiners-in-cascading/.
[5] Tips for Optimizing Cascading Flows. http://nathanmarz.com/blog/tips-for-
optimizing-cascading-flows.html.
[6] Apache Hadoop. http://hadoop.apache.org/.
[7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu.
Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.
11
12. [8] Hive. http://hadoop.apache.org/hive/.
[9] Pig. http://hadoop.apache.org/pig/.
[10] TPC. TPC Benchmark H Standard Specification
, 2009. http://www.tpc.org/tpch/spec/tpch2.9.0.pdf.
[11] Upgrading to the New Map Reduce API
. http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api.
8 Appendix
8.1 Complete Source Code of Log Analysis
Listing 1: LogAnalysis.java
1 package loganalysis;
2
3 import java.util.*;
4 import org.apache.hadoop.conf.*;
5 import org.apache.hadoop.util.*;
6
7 import cascading.cascade.*;
8 import cascading.flow.*;
9 import cascading.operation.aggregator.Count;
10 import cascading.operation.expression.ExpressionFunction;
11 import cascading.operation.regex.RegexParser;
12 import cascading.operation.text.DateParser;
13 import cascading.pipe.*;
14 import cascading.scheme.TextLine;
15 import cascading.tap.*;
16 import cascading.tuple.Fields;
17
18 public class LogAnalysis extends Configured implements Tool {
19 public int run(String[] args) throws Exception {
20 // set the Hadoop parameters
21 Properties properties = new Properties();
22 Iterator<Map.Entry<String, String>> iter = getConf().
iterator();
23 while (iter.hasNext()) {
24 Map.Entry<String, String> entry = iter.next();
25 properties.put(entry.getKey(), entry.getValue());
26 }
27
28 FlowConnector.setApplicationJarClass(properties, Main.class
);
29 FlowConnector flowConnector = new FlowConnector(properties)
;
30 CascadeConnector cascadeConnector = new CascadeConnector();
31
32 String inputPath = args[0];
33 String logsPath = args[1] + "/logs/";
34 String arrivalRatePath = args[1] + "/arrivalrate/";
12
13. 35 String arrivalRateSecPath = arrivalRatePath + "sec";
36 String arrivalRateMinPath = arrivalRatePath + "min";
37
38 // create an assembly to import an Apache log file and
store on DFS
39 // declares: "time", "method", "event", "status", "size"
40 Fields apacheFields = new Fields("ip", "time", "method", "
event",
41 "status", "size");
42 String apacheRegex = "ˆ([ˆ ]*) +[ˆ ]* +[ˆ ]* +[([ˆ]]*)]
+"([ˆ ]*) ([ˆ ]*) [ˆ ]*" ([ˆ ]*) ([ˆ ]*).*$";
43 int[] apacheGroups = { 1, 2, 3, 4, 5, 6 };
44 RegexParser parser = new RegexParser(apacheFields,
apacheRegex,
45 apacheGroups);
46 Pipe importPipe = new Each("import", new Fields("line"),
parser);
47
48 // create tap to read a resource from the local file system
, if not an
49 // url for an external resource
50 // Lfs allows for relative paths
51 Tap logTap = new Hfs(new TextLine(), inputPath);
52 // create a tap to read/write from the default filesystem
53 Tap parsedLogTap = new Hfs(apacheFields, logsPath);
54
55 // connect the assembly to source and sink taps
56 Flow importLogFlow = flowConnector.connect(logTap,
parsedLogTap,
57 importPipe);
58
59 // create an assembly to parse out the time field into a
timestamp
60 // then count the number of requests per second and per
minute
61
62 // apply a text parser to create a timestamp with ’second’
granularity
63 // declares field "ts"
64 DateParser dateParser = new DateParser(new Fields("ts"),
65 "dd/MMM/yyyy:HH:mm:ss Z");
66 Pipe tsPipe = new Each("arrival rate", new Fields("time"),
dateParser,
67 Fields.RESULTS);
68
69 // name the per second assembly and split on tsPipe
70 Pipe tsCountPipe = new Pipe("tsCount", tsPipe);
71 tsCountPipe = new GroupBy(tsCountPipe, new Fields("ts"));
72 tsCountPipe = new Every(tsCountPipe, Fields.GROUP, new
Count());
73
74 // apply expression to create a timestamp with ’minute’
granularity
75 // declares field "tm"
13
14. 76 Pipe tmPipe = new Each(tsPipe, new ExpressionFunction(new
Fields("tm"),
77 "ts - (ts % (60 * 1000))", long.class));
78
79 // name the per minute assembly and split on tmPipe
80 Pipe tmCountPipe = new Pipe("tmCount", tmPipe);
81 tmCountPipe = new GroupBy(tmCountPipe, new Fields("tm"));
82 tmCountPipe = new Every(tmCountPipe, Fields.GROUP, new
Count());
83
84 // create taps to write the results the default filesystem,
using the
85 // given fields
86 Tap tsSinkTap = new Hfs(new TextLine(), arrivalRateSecPath,
true);
87 Tap tmSinkTap = new Hfs(new TextLine(), arrivalRateMinPath,
true);
88
89 // a convenience method for binding taps and pipes, order
is significant
90 Map<String, Tap> sinks = Cascades.tapsMap(Pipe.pipes(
tsCountPipe,
91 tmCountPipe), Tap.taps(tsSinkTap, tmSinkTap));
92
93 // connect the assembly to the source and sink taps
94 Flow arrivalRateFlow = flowConnector.connect(parsedLogTap,
sinks,
95 tsCountPipe, tmCountPipe);
96
97 // optionally print out the arrivalRateFlow to a graph file
for import
98 // into a graphics package
99 //arrivalRateFlow.writeDOT( "arrivalrate.dot" );
100
101 // connect the flows by their dependencies, order is not
significant
102 Cascade cascade = cascadeConnector.connect(importLogFlow,
103 arrivalRateFlow);
104
105 // execute the cascade, which in turn executes each flow in
dependency
106 // order
107 cascade.complete();
108 return 0;
109 }
110
111 public static void main(String[] args) throws Exception {
112 int res = ToolRunner.run(new Configuration(), new Main(),
args);
113 System.exit(res);
114 }
115 }
14