SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Cascading on Starfish
                                Fei Dong
                             Duke University
                           dongfei@cs.duke.edu

                            December 10, 2011


1     Introduction
Hadoop [6] is a software framework installed on a cluster to permit large scale
distributed data analysis. It provides the robust Hadoop Distributed File
System (HDFS) as well as a Java-based API that allows parallel processing
across the nodes of the cluster. Programs employ a Map/Reduce execution
engine which functions as a fault-tolerant distributed computing system over
large data sets.
    In addition to Hadoop, which is a top-level Apache project, there are sub-
projects related to workflow of Hadoop, such as Hive [8], a data warehouse
framework used for ad hoc querying (with an SQL type query language);
and Pig [9], a high-level data-flow language and execution framework whose
compiler produces sequences of Map/Reduce programs for execution within
Hadoop. Cascading [2], an API for defining and executing fault tolerant
data processing workflows on a Hadoop cluster. All of mentioned projects
simplify some of work for developers, allowing them to write more traditional
procedural or SQL-style code that, under the covers, creates a sequence of
Hadoop jobs. In this report, we focus on Cascading as the main data-parallel
workflow choice.

1.1   Cascading Introduction
Cascading is a Java application framework that allows you to more easily
write scripts to access and manipulate data inside Hadoop. There are a
number of key features provided by this API:

    • Dependency-Based ’Topological Scheduler’ and MapReduce Planning
      - Two key components of the cascading API are its ability to sched-
      ule the invocation of flows based on dependency; with the execution
      order being independent of construction order, often allowing for con-
      current invocation of portions of flows and cascades. In addition, the



                                      1
steps of the various flows are intelligently converted into map-reduce
      invocations against the hadoop cluster.

    • Event Notification - The various steps of the flow can perform notifi-
      cations via callbacks, allowing for the host application to report and
      respond to the progress of the data processing.

    • Scriptable - The Cascading API has scriptable interfaces for Jython,
      Groovy, and JRuby.

    Although Cascading provides the above benefits, we still consider about
the balance of the performance and productivity on Cascading. Marz [5]
shows some rules to optimize Cascading Flows. For some experienced Cas-
cading users, they can gain some performance improvement by following
those principles in high level. One interesting questions is whether there ex-
ist some ways to improve the workflow performance without expert knowl-
edge. In other words, we want to optimize workflow in physical level. In
Starfish [7], the authors demonstrate the power of self-tuning jobs on Hadoop
and Herodotou has successfully applied optimization technology on Pig.
This report will discuss auto-optimization on Cascading with the help of
Starfish.


2    Terminology
First, we introduce some concepts widely used in Cascading.

    • Stream: data input and output.

    • Tuple: stream is composed of a series of Tuples. Tuples are sets of
      ordered data.

    • Tap: abstraction on top of Hadoop files. Source - A source tap is read
      from and acted upon. Actions on source taps result in pipes. Sink -
      A sink tap is a location to be written to. A Sink tap can later serve
      as a Source in the same script.

    • Operations: define what to do on the data. i.e.: Each(), Group(),
      CoGroup(), Every().

    • Pipe: tie Operation together. When an operation is executed upon a
      Tap, the result is a Pipe. In other words, a flow is a pipe with data
      flowing through it. i.e: Pipes can use other Pipes as input, thereby
      wrapping themselves into a series of operations.

    • Filter: pass through it to remove useless records. i.e. RegexFilter(),
      And(), Or().


                                      2
• Aggregator: function after group operation. i.e. Count(), Average(),
      Min(), Max().

    • Step: a logic unit in Flow. It represents a Map-only or MapReduce
      job.

    • Flow: A Flow is a combination of a Source, a Sink and Pipe.

    • Cascade:a series of Flows.


3    Cascading Structure




                  Figure 1: A Typical Cascading Structure.

    In Figure 1, we can clearly see that Cascading Structure. The top level
is called Cascading which is composed of several flows. In each flow, it
defines a source Tap, a sink Tap and Pipes. We also notice one flow can
have multiple pipes to do data operations like filter, grouping, aggregator.
    Internally, a Cascade is constructed through the CascadeConnector class,
by building an internal graph that makes each Flow a ’vertex’, and each file
an ’edge’. A topological walk on this graph will touch each vertex in order
of its dependencies. When a vertex has all it’s incoming edges available, it
will be scheduled on the cluster. Figure 2 gives us an example which goal
is to statistic second and minute count from Apache logs. The dataflow is
represented as a Graph. The first step is to import and parse source data.
Next it generates two following steps to process ”second” and ”minutes”
respectively.
    The execution order for Log Analysis is:
1. calculate the dependency between flows, so we get F low1 → F low2
2. start to call F low1
       2.1 initialize ”import” flowStep and construct the Job1
       2.2 submit ”import” Job1 to Hadoop
3. start to call F low2
       3.1 initialize ”minute and secend statistics” flowSteps and construct
the Job2, Job3
       3.2 submit Job2, Job3 to Hadoop
    The complete code is attached at Appendix.

                                     3
Figure 2: Workflow Sample:Log Analysis.


4     Cascading on Starfish
4.1   Change to new Hadoop API
We notice current Cascading is based on Hadoop Old-API. Since Starfish
only works within New-API, the first work is to connect those heterogeneous
systems. Herodotos works on supporting Hadoop Old-API on Starfish. I
work on replacing Old-API of Cascading with New-API. Although Hadoop
community recommends new API and provide some upgrade advice [11], it
still take us much energy on translating. One reason is the system complexity
(40K lines), we sacrifice some advanced features such as S3fs, TemplateTap,
ZipSplit, Stats reports and Strategy to make the change work. Finally, we
provide a revised version of Cascading that only use Hadoop New-API. In
the mean time Herodotos updated Starfish to support Old-API recently.
While this report will only consider New-API version of Cascading.

4.2   Cascading Profiler
First, we need to decide when to capture the profilers. Since modified Cas-
cading is using Hadoop New-API, the position to enable Profiler is the same
as a single MapReduce job. We choose the return point of blockT illCompleteOrStoped
of cascading.f low.F lowStepJob to collect job execution files when job com-

                                     4
pletes. When all of jobs are finished and execution files are collected, we
would like to build a profile graph to represent dataflow dependencies among
the jobs. In order to build the job DAG, we decouple the hierarchy of Cas-
cading and Flows. As we see before, Log Analysis workflow has two de-
pendent Flows and finally will submit three MapReduce jobs on Hadoop.
Figure 3 shows the original Workflow in Cascading and translating JobGraph
in Starfish. We propose the following algorithm to build Job DAG.

Algorithm 1 Build Job DAG Pseudo-Code
 1:   procedure BuildJobDAG(f lowGraph)
 2:      for f low ∈ f lowGraph do                     Iterate over all flows
 3:
 4:        for f lowStep ∈ f low.f lowStepGraph do              Add the job
    vertices
 5:           Create the jobVertex from the flowStep
 6:        end for
 7:
 8:       for edge ∈ f low.f lowStepGraph.edgeSet do        Add the job
    edges within a flow
 9:           Create the corresponding edge in the jobGraph
10:       end for
11:    end for

12:      for f lowEdge ∈ f lowGraph.edgeSet do          Iterate over all flow
      edges (source → target)
13:          sourceF lowSteps ← f lowEdge.sourceF low.getLeaf F lowSteps
14:          targetF lowSteps ← f lowEdge.targetF low.getRootF lowSteps
15:          for sourceF S ∈ sourceF lowSteps do
16:             for targetF S ∈ targetF lowSteps do
17:                Create the job edge from corresponding source to target
18:             end for
19:          end for
20:      end for
21:   end procedure



4.3     Cascading What-if Engine and Optimizer
What-if Engine is to predict the behavior of a workflow W . To achieve
that, DAG Profilers ,Data Model, Cluster, DAG Configurations are given
as parameters. Building the Conf Graph shares the same idea as building
Job Graph. We capture the returning point of initializeN ewJobM ap in
cascading.cascade where we process what-if requests and exit the program
afterwards.

                                      5
(a) Cascading Represent   (b) Dataflow Transla-
                                            tion

                           Figure 3: Log Analysis.


   For the Cascading optimizer, I make use of data flow optimizer and feed
the related interface. When running the Optimizer, we keep the default
Optimizer mode as crossjob + dynamic.

4.4   Program Interface
The usage of Cascading on Starfish is simple and user-friendly. Users do not
need to change the source code or import new package. We can list some
cases as follows.
   prof ile cascading jar loganalysis.jar
   Profiler: collect task profiles when running a workflow and generate the
profile files in P ROF ILER OU T P U T DIR.

   execute cascading jar loganalysis.jar
   Execute: only run program without collecting profiles.

     analyze cascading details workf low 20111017205527
     Analyze: list some List basic or detail statistical information regarding
all jobs found in the P ROF ILER OU T P U T DIR

    whatif details workf low 20111018014128 cascading jar loganalysis.jar
    What-if Engine: ask hypothetical question on a particular workflow and
return predicted profiles.

   optimize run workf low 20111018014128 cascading jar loganalysis.jar
   Optimizer:Execute a MapReduce workflow using the configuration pa-
rameter settings automatically suggested by the Cost-based Optimizer.



                                       6
5     Evaluation
5.1   Experiment Environment
In the experimental evaluation, we used Hadoop clusters running on Amazon
EC2. The following is the detail preparation.

    • Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory,
      2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasks
      concurrently.
    • Hadoop Configurations: 0.20.203.
    • Cascading Version : modified V1.2.4 (use Hadoop New-API)
    • Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs for
      pagerank, 5G paper author pairs.
    • Optimizer Type: cross jobs and dynamic

5.2   Description of Data-parallel Workflows
We evaluate the end-to-end performance of optimizers on seven representa-
tive workflows used in different domains.
    Term Frequency-Inverse Document Frequency(TF-IDF): TF-
IDF calculates weights representing the importance of each word to a doc-
ument in a collection. The workflow contains three jobs: 1) the total terms
in each document. 2) calculate the number of documents containing each
term. 3) calculate tf * idf. Job 2 depends on Job 1 and Job 3 depends on
Job 2.
    Top 20 Coauthor Pairs: Suppose you have a large datasets of papers
and authors. You want to know who and if there is any correlation between
being collaborative and being a prolific author. It can take three jobs: 1)
Group authors by paper. 2) Generate co-authorship pairs (map) and count
(reduce). 3) Sort by count. Job 2 depends on Job 1 and Job 3 depends on
Job 2.
    Log Analysis: Given an Apache log, parse it with specified format,
statistic the minute count and second count seperately and dump each re-
sults. There are three jobs: 1) Import and parse raw log. 2) group by
minutes and statistic counts. 3) Group by seconds and statistic counts. Job
2 and Job 3 depends on Job 1.
    PageRank: The goal is to find the ranking of web pages. The algorithm
can be implemented as an iterative workflow containing two jobs:1) Join on
the pageId of two datasets.2) Calculate the new rankings of each webpage.
Job 2 depends on Job 1.
    TPC-H: TPC-H benchmark as a representative example of a complex
SQL query. Query 3 is implemented in four-job workflow. 1) Join the order

                                    7
and customer table, with filter conditions on each table. 2) Join lineitem
and result table in job one. 3) Calculate the volume by discount. 4) Get
the sum after grouping by some keys. Job 2 depends on Job 1 and Job 3
depends on Job 2 and Job 4 depends on Job 3.
     HTML Parser and WordCount: The workflow processes a collection
of web source pages. It has three jobs: 1) Parse the raw data with HTML
SAX Parser. 2) Statistic the number of words with the same urls. 3) Ag-
gregate the total word count. Job 2 depends on Job 1 and Job 3 depends
on Job 2.
     User-defined Partition: It spill the dataset into three parts by the
range of key. Some statistics are collected on each spilled part. In general,
it is run in three jobs and each job is responsible for one part of dataset.
There is no dependency between those three job, which means three jobs
can be run in parallel.
     The source code for experiment groups is submitted in Starfish reposi-
tory.

5.3   Speedup with Starfish Optimizer
Figure 4 shows the timeline for TPC-H Query 3 workflow. When using
profiler, it spends 20% more time. The final cross-job optimizer causes 1.3x
speedup. Figure 5 analysis the speedup for six workflows respectively. The
optimizer is effective for most of workflows with only exception of the user-
defined partition. One possible reason is that workflows generates three jobs
in parallel which compete the limited cluster resource (30 available map slots
and 20 available reduce slots) from each other.




   Figure 4: run TPC-H Query3 with no Optimizer, Profiler and Optimizer.




                                      8
(a) Log Analysls           (b) Coauthor Pairs           (c) PageRank




  (d) User-defined Partition          (e) TF-IDF        (f) HTML Parser and Word-
                                                       count

                  Figure 5: Speedup with Starfish Optimizer.


5.4   Overhead on Profiler
Figure 6 shows the profiling overhead by comparing againest the same job
run with profiling turned off. In average, profiling consumes 20% of the
running time.

5.5   Compare with Pig
We are very interested in comparing various workflow framework with the
same datasets. We run the identical workflow written by Harold Lim. Fig-
ure 7 shows the performance of Pig overwhelm Cascading even if Cascading
is optimized. We think of several possible reasons.
   • Cascading does not support Combiner. One article [4] talks about
     hand-rolled join optimizations.
   • Pig does many optimization work on physical and logic layer, while
     Cascading does not optimize the planner well. In user-defined parti-
     tion, Pig only has one MapReduce job while Cascading populates 3
     jobs.
   • Cascading only uses Customed Inputformat and InputSplit called Mul-
     tiInputFormat and MultiInputSplit, no matter for single job or single

                                      9
(a) Log Analysls            (b) Coauthor Pairs             (c) PageRank




(d) User-defined Partition           (e) TF-IDF        (f) HTML Parser and Word-
                                                      count

                   Figure 6: Overhead to measure profile.




             Figure 7: Pig Versus Cascading on Performance .


   input source.

• Cascading’s CoGroup() join is not meant to be used with large data
  files.

• Using RegexSplit to parse files into tuples is not efficient.

• Disable compression.

                                    10
6    Conclusion
Cascading aims to help developers build powerful applications quickly and
simply, through a well-reasoned API, without needing to think in MapRe-
duce, while leaving the heavy lifting of data distribution, replication, dis-
tributed process management, and liveness to Hadoop.
    With Starfish Optimizer, we can boost the original Cascading program
by 20% to 200% without modifying any source code. It also demonstrates
that the similar syntax sentences as Pig in representation, but the experi-
ment group display distinct differences in results, which shows Pig perfor-
mance is much better than Cascading in most cases.
    Considering the code scale, learning cost and performance, we recom-
mend for simple queries, using Pig is much more suitable and performant.
We also find Cascalog [3] which is data processing and querying library for
Clojure, is another choice of writing workflows on Hadoop.
    We notice Cascading 2.0 [1] is ready to release, which will improve hugely
on performance and fix bugs of previous version. For the future work, when
Starfish supports old API, we can import latest version of Cascading and
rerun the experiment.


7    Acknowledgement
I would like to thank Herodotos Herodotou, the lead contributor of Starfish,
who gave me so much help on the system design and Hadoop internal mech-
anism. The report could not be done without him. I also want to thank
Harold Lim who gives me some support on benchmarks.
    Thank Professor Shivnath Babu for his help and supervising this work,
and holding meeting for us to exchange ideas.


References
 [1] Tips for Optimizing Cascading Flows. http://www.cascading.org/2011/10/cascading-
     20-early-access.html.
 [2] Cascading. http://www.cascading.org/.
 [3] Cascalog. https://github.com/nathanmarz/cascalog.
 [4] Pseudo Combiners in Cascading. http://blog.rapleaf.com/dev/2010/06/10/pseudo-
     combiners-in-cascading/.
 [5] Tips for Optimizing Cascading Flows. http://nathanmarz.com/blog/tips-for-
     optimizing-cascading-flows.html.
 [6] Apache Hadoop. http://hadoop.apache.org/.
 [7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu.
     Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.


                                       11
[8] Hive. http://hadoop.apache.org/hive/.
 [9] Pig. http://hadoop.apache.org/pig/.
[10] TPC. TPC Benchmark H Standard Specification
     , 2009. http://www.tpc.org/tpch/spec/tpch2.9.0.pdf.
[11] Upgrading to the New Map Reduce API
     . http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api.


8        Appendix
8.1       Complete Source Code of Log Analysis

        Listing 1: LogAnalysis.java
    1   package loganalysis;
    2
    3   import java.util.*;
    4   import org.apache.hadoop.conf.*;
    5   import org.apache.hadoop.util.*;
    6
    7   import   cascading.cascade.*;
    8   import   cascading.flow.*;
    9   import   cascading.operation.aggregator.Count;
 10     import   cascading.operation.expression.ExpressionFunction;
 11     import   cascading.operation.regex.RegexParser;
 12     import   cascading.operation.text.DateParser;
 13     import   cascading.pipe.*;
 14     import   cascading.scheme.TextLine;
 15     import   cascading.tap.*;
 16     import   cascading.tuple.Fields;
 17
 18     public class LogAnalysis extends Configured implements Tool {
 19       public int run(String[] args) throws Exception {
 20         // set the Hadoop parameters
 21             Properties properties = new Properties();
 22             Iterator<Map.Entry<String, String>> iter = getConf().
                    iterator();
 23             while (iter.hasNext()) {
 24                 Map.Entry<String, String> entry = iter.next();
 25                 properties.put(entry.getKey(), entry.getValue());
 26             }
 27
 28         FlowConnector.setApplicationJarClass(properties, Main.class
                );
 29         FlowConnector flowConnector = new FlowConnector(properties)
                ;
 30         CascadeConnector cascadeConnector = new CascadeConnector();
 31
 32         String inputPath = args[0];
 33         String logsPath = args[1] + "/logs/";
 34         String arrivalRatePath = args[1] + "/arrivalrate/";



                                       12
35   String arrivalRateSecPath = arrivalRatePath + "sec";
36   String arrivalRateMinPath = arrivalRatePath + "min";
37
38   // create an assembly to import an Apache log file and
         store on DFS
39   // declares: "time", "method", "event", "status", "size"
40   Fields apacheFields = new Fields("ip", "time", "method", "
         event",
41       "status", "size");
42   String apacheRegex = "ˆ([ˆ ]*) +[ˆ ]* +[ˆ ]* +[([ˆ]]*)]
          +"([ˆ ]*) ([ˆ ]*) [ˆ ]*" ([ˆ ]*) ([ˆ ]*).*$";
43   int[] apacheGroups = { 1, 2, 3, 4, 5, 6 };
44   RegexParser parser = new RegexParser(apacheFields,
         apacheRegex,
45       apacheGroups);
46   Pipe importPipe = new Each("import", new Fields("line"),
         parser);
47
48   // create tap to read a resource from the local file system
         , if not an
49   // url for an external resource
50   // Lfs allows for relative paths
51   Tap logTap = new Hfs(new TextLine(), inputPath);
52   // create a tap to read/write from the default filesystem
53   Tap parsedLogTap = new Hfs(apacheFields, logsPath);
54
55   // connect the assembly to source and sink taps
56   Flow importLogFlow = flowConnector.connect(logTap,
         parsedLogTap,
57       importPipe);
58
59   // create an assembly to parse out the time field into a
         timestamp
60   // then count the number of requests per second and per
         minute
61
62   // apply a text parser to create a timestamp with ’second’
         granularity
63   // declares field "ts"
64   DateParser dateParser = new DateParser(new Fields("ts"),
65       "dd/MMM/yyyy:HH:mm:ss Z");
66   Pipe tsPipe = new Each("arrival rate", new Fields("time"),
         dateParser,
67       Fields.RESULTS);
68
69   // name the per second assembly and split on tsPipe
70   Pipe tsCountPipe = new Pipe("tsCount", tsPipe);
71   tsCountPipe = new GroupBy(tsCountPipe, new Fields("ts"));
72   tsCountPipe = new Every(tsCountPipe, Fields.GROUP, new
         Count());
73
74   // apply expression to create a timestamp with ’minute’
         granularity
75   // declares field "tm"


                              13
76           Pipe tmPipe = new Each(tsPipe, new ExpressionFunction(new
                  Fields("tm"),
 77               "ts - (ts % (60 * 1000))", long.class));
 78
 79           // name the per minute assembly and split on tmPipe
 80           Pipe tmCountPipe = new Pipe("tmCount", tmPipe);
 81           tmCountPipe = new GroupBy(tmCountPipe, new Fields("tm"));
 82           tmCountPipe = new Every(tmCountPipe, Fields.GROUP, new
                  Count());
 83
 84           // create taps to write the results the default filesystem,
                   using the
 85           // given fields
 86           Tap tsSinkTap = new Hfs(new TextLine(), arrivalRateSecPath,
                   true);
 87           Tap tmSinkTap = new Hfs(new TextLine(), arrivalRateMinPath,
                   true);
 88
 89           // a convenience method for binding taps and pipes, order
                  is significant
 90           Map<String, Tap> sinks = Cascades.tapsMap(Pipe.pipes(
                  tsCountPipe,
 91               tmCountPipe), Tap.taps(tsSinkTap, tmSinkTap));
 92
 93           // connect the assembly to the source and sink taps
 94           Flow arrivalRateFlow = flowConnector.connect(parsedLogTap,
                  sinks,
 95               tsCountPipe, tmCountPipe);
 96
 97           // optionally print out the arrivalRateFlow to a graph file
                   for import
 98           // into a graphics package
 99           //arrivalRateFlow.writeDOT( "arrivalrate.dot" );
100
101           // connect the flows by their dependencies, order is not
                  significant
102           Cascade cascade = cascadeConnector.connect(importLogFlow,
103               arrivalRateFlow);
104
105           // execute the cascade, which in turn executes each flow in
                   dependency
106           // order
107           cascade.complete();
108           return 0;
109       }
110
111       public static void main(String[] args) throws Exception {
112         int res = ToolRunner.run(new Configuration(), new Main(),
                args);
113         System.exit(res);
114       }
115   }




                                       14

Weitere ähnliche Inhalte

Was ist angesagt?

websphere cast iron labs
 websphere cast iron labs websphere cast iron labs
websphere cast iron labs
AMIT KUMAR
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal
 

Was ist angesagt? (20)

SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
websphere cast iron labs
 websphere cast iron labs websphere cast iron labs
websphere cast iron labs
 
Making Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQLMaking Nested Columns as First Citizen in Apache Spark SQL
Making Nested Columns as First Citizen in Apache Spark SQL
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Application Timeline Server Past, Present and Future
Application Timeline Server  Past, Present and FutureApplication Timeline Server  Past, Present and Future
Application Timeline Server Past, Present and Future
 
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc BourlierSpark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Luc Bourlier
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Under the Hood - Couchbase Server Architecture - June 2015
Under the Hood - Couchbase Server Architecture - June 2015Under the Hood - Couchbase Server Architecture - June 2015
Under the Hood - Couchbase Server Architecture - June 2015
 
Hadoop and Big Data Overview
Hadoop and Big Data OverviewHadoop and Big Data Overview
Hadoop and Big Data Overview
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
IBM Spectrum Scale on the Cloud
IBM Spectrum Scale on the CloudIBM Spectrum Scale on the Cloud
IBM Spectrum Scale on the Cloud
 
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
Developing Applications with Hadoop 2.0 and YARN by Abhijit Lele
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 

Andere mochten auch (6)

Decision support for Amazon Spot Instance
Decision support for Amazon Spot InstanceDecision support for Amazon Spot Instance
Decision support for Amazon Spot Instance
 
Amending the Budget (In a Nutshell)
Amending the Budget (In a Nutshell)Amending the Budget (In a Nutshell)
Amending the Budget (In a Nutshell)
 
VIVIENDA ACCESIBLE
VIVIENDA ACCESIBLEVIVIENDA ACCESIBLE
VIVIENDA ACCESIBLE
 
DMGinc. Magazine Design Portfolio
DMGinc. Magazine Design PortfolioDMGinc. Magazine Design Portfolio
DMGinc. Magazine Design Portfolio
 
Feb pics
Feb picsFeb pics
Feb pics
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 

Ähnlich wie Cascading on starfish

Big data hadoop distributed file system for data
Big data hadoop distributed file system for dataBig data hadoop distributed file system for data
Big data hadoop distributed file system for data
preetik9044
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
soujavajug
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 

Ähnlich wie Cascading on starfish (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Data Science
Data ScienceData Science
Data Science
 
Spark core
Spark coreSpark core
Spark core
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Big data hadoop distributed file system for data
Big data hadoop distributed file system for dataBig data hadoop distributed file system for data
Big data hadoop distributed file system for data
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 

Cascading on starfish

  • 1. Cascading on Starfish Fei Dong Duke University dongfei@cs.duke.edu December 10, 2011 1 Introduction Hadoop [6] is a software framework installed on a cluster to permit large scale distributed data analysis. It provides the robust Hadoop Distributed File System (HDFS) as well as a Java-based API that allows parallel processing across the nodes of the cluster. Programs employ a Map/Reduce execution engine which functions as a fault-tolerant distributed computing system over large data sets. In addition to Hadoop, which is a top-level Apache project, there are sub- projects related to workflow of Hadoop, such as Hive [8], a data warehouse framework used for ad hoc querying (with an SQL type query language); and Pig [9], a high-level data-flow language and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop. Cascading [2], an API for defining and executing fault tolerant data processing workflows on a Hadoop cluster. All of mentioned projects simplify some of work for developers, allowing them to write more traditional procedural or SQL-style code that, under the covers, creates a sequence of Hadoop jobs. In this report, we focus on Cascading as the main data-parallel workflow choice. 1.1 Cascading Introduction Cascading is a Java application framework that allows you to more easily write scripts to access and manipulate data inside Hadoop. There are a number of key features provided by this API: • Dependency-Based ’Topological Scheduler’ and MapReduce Planning - Two key components of the cascading API are its ability to sched- ule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for con- current invocation of portions of flows and cascades. In addition, the 1
  • 2. steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster. • Event Notification - The various steps of the flow can perform notifi- cations via callbacks, allowing for the host application to report and respond to the progress of the data processing. • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby. Although Cascading provides the above benefits, we still consider about the balance of the performance and productivity on Cascading. Marz [5] shows some rules to optimize Cascading Flows. For some experienced Cas- cading users, they can gain some performance improvement by following those principles in high level. One interesting questions is whether there ex- ist some ways to improve the workflow performance without expert knowl- edge. In other words, we want to optimize workflow in physical level. In Starfish [7], the authors demonstrate the power of self-tuning jobs on Hadoop and Herodotou has successfully applied optimization technology on Pig. This report will discuss auto-optimization on Cascading with the help of Starfish. 2 Terminology First, we introduce some concepts widely used in Cascading. • Stream: data input and output. • Tuple: stream is composed of a series of Tuples. Tuples are sets of ordered data. • Tap: abstraction on top of Hadoop files. Source - A source tap is read from and acted upon. Actions on source taps result in pipes. Sink - A sink tap is a location to be written to. A Sink tap can later serve as a Source in the same script. • Operations: define what to do on the data. i.e.: Each(), Group(), CoGroup(), Every(). • Pipe: tie Operation together. When an operation is executed upon a Tap, the result is a Pipe. In other words, a flow is a pipe with data flowing through it. i.e: Pipes can use other Pipes as input, thereby wrapping themselves into a series of operations. • Filter: pass through it to remove useless records. i.e. RegexFilter(), And(), Or(). 2
  • 3. • Aggregator: function after group operation. i.e. Count(), Average(), Min(), Max(). • Step: a logic unit in Flow. It represents a Map-only or MapReduce job. • Flow: A Flow is a combination of a Source, a Sink and Pipe. • Cascade:a series of Flows. 3 Cascading Structure Figure 1: A Typical Cascading Structure. In Figure 1, we can clearly see that Cascading Structure. The top level is called Cascading which is composed of several flows. In each flow, it defines a source Tap, a sink Tap and Pipes. We also notice one flow can have multiple pipes to do data operations like filter, grouping, aggregator. Internally, a Cascade is constructed through the CascadeConnector class, by building an internal graph that makes each Flow a ’vertex’, and each file an ’edge’. A topological walk on this graph will touch each vertex in order of its dependencies. When a vertex has all it’s incoming edges available, it will be scheduled on the cluster. Figure 2 gives us an example which goal is to statistic second and minute count from Apache logs. The dataflow is represented as a Graph. The first step is to import and parse source data. Next it generates two following steps to process ”second” and ”minutes” respectively. The execution order for Log Analysis is: 1. calculate the dependency between flows, so we get F low1 → F low2 2. start to call F low1 2.1 initialize ”import” flowStep and construct the Job1 2.2 submit ”import” Job1 to Hadoop 3. start to call F low2 3.1 initialize ”minute and secend statistics” flowSteps and construct the Job2, Job3 3.2 submit Job2, Job3 to Hadoop The complete code is attached at Appendix. 3
  • 4. Figure 2: Workflow Sample:Log Analysis. 4 Cascading on Starfish 4.1 Change to new Hadoop API We notice current Cascading is based on Hadoop Old-API. Since Starfish only works within New-API, the first work is to connect those heterogeneous systems. Herodotos works on supporting Hadoop Old-API on Starfish. I work on replacing Old-API of Cascading with New-API. Although Hadoop community recommends new API and provide some upgrade advice [11], it still take us much energy on translating. One reason is the system complexity (40K lines), we sacrifice some advanced features such as S3fs, TemplateTap, ZipSplit, Stats reports and Strategy to make the change work. Finally, we provide a revised version of Cascading that only use Hadoop New-API. In the mean time Herodotos updated Starfish to support Old-API recently. While this report will only consider New-API version of Cascading. 4.2 Cascading Profiler First, we need to decide when to capture the profilers. Since modified Cas- cading is using Hadoop New-API, the position to enable Profiler is the same as a single MapReduce job. We choose the return point of blockT illCompleteOrStoped of cascading.f low.F lowStepJob to collect job execution files when job com- 4
  • 5. pletes. When all of jobs are finished and execution files are collected, we would like to build a profile graph to represent dataflow dependencies among the jobs. In order to build the job DAG, we decouple the hierarchy of Cas- cading and Flows. As we see before, Log Analysis workflow has two de- pendent Flows and finally will submit three MapReduce jobs on Hadoop. Figure 3 shows the original Workflow in Cascading and translating JobGraph in Starfish. We propose the following algorithm to build Job DAG. Algorithm 1 Build Job DAG Pseudo-Code 1: procedure BuildJobDAG(f lowGraph) 2: for f low ∈ f lowGraph do Iterate over all flows 3: 4: for f lowStep ∈ f low.f lowStepGraph do Add the job vertices 5: Create the jobVertex from the flowStep 6: end for 7: 8: for edge ∈ f low.f lowStepGraph.edgeSet do Add the job edges within a flow 9: Create the corresponding edge in the jobGraph 10: end for 11: end for 12: for f lowEdge ∈ f lowGraph.edgeSet do Iterate over all flow edges (source → target) 13: sourceF lowSteps ← f lowEdge.sourceF low.getLeaf F lowSteps 14: targetF lowSteps ← f lowEdge.targetF low.getRootF lowSteps 15: for sourceF S ∈ sourceF lowSteps do 16: for targetF S ∈ targetF lowSteps do 17: Create the job edge from corresponding source to target 18: end for 19: end for 20: end for 21: end procedure 4.3 Cascading What-if Engine and Optimizer What-if Engine is to predict the behavior of a workflow W . To achieve that, DAG Profilers ,Data Model, Cluster, DAG Configurations are given as parameters. Building the Conf Graph shares the same idea as building Job Graph. We capture the returning point of initializeN ewJobM ap in cascading.cascade where we process what-if requests and exit the program afterwards. 5
  • 6. (a) Cascading Represent (b) Dataflow Transla- tion Figure 3: Log Analysis. For the Cascading optimizer, I make use of data flow optimizer and feed the related interface. When running the Optimizer, we keep the default Optimizer mode as crossjob + dynamic. 4.4 Program Interface The usage of Cascading on Starfish is simple and user-friendly. Users do not need to change the source code or import new package. We can list some cases as follows. prof ile cascading jar loganalysis.jar Profiler: collect task profiles when running a workflow and generate the profile files in P ROF ILER OU T P U T DIR. execute cascading jar loganalysis.jar Execute: only run program without collecting profiles. analyze cascading details workf low 20111017205527 Analyze: list some List basic or detail statistical information regarding all jobs found in the P ROF ILER OU T P U T DIR whatif details workf low 20111018014128 cascading jar loganalysis.jar What-if Engine: ask hypothetical question on a particular workflow and return predicted profiles. optimize run workf low 20111018014128 cascading jar loganalysis.jar Optimizer:Execute a MapReduce workflow using the configuration pa- rameter settings automatically suggested by the Cost-based Optimizer. 6
  • 7. 5 Evaluation 5.1 Experiment Environment In the experimental evaluation, we used Hadoop clusters running on Amazon EC2. The following is the detail preparation. • Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory, 2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasks concurrently. • Hadoop Configurations: 0.20.203. • Cascading Version : modified V1.2.4 (use Hadoop New-API) • Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs for pagerank, 5G paper author pairs. • Optimizer Type: cross jobs and dynamic 5.2 Description of Data-parallel Workflows We evaluate the end-to-end performance of optimizers on seven representa- tive workflows used in different domains. Term Frequency-Inverse Document Frequency(TF-IDF): TF- IDF calculates weights representing the importance of each word to a doc- ument in a collection. The workflow contains three jobs: 1) the total terms in each document. 2) calculate the number of documents containing each term. 3) calculate tf * idf. Job 2 depends on Job 1 and Job 3 depends on Job 2. Top 20 Coauthor Pairs: Suppose you have a large datasets of papers and authors. You want to know who and if there is any correlation between being collaborative and being a prolific author. It can take three jobs: 1) Group authors by paper. 2) Generate co-authorship pairs (map) and count (reduce). 3) Sort by count. Job 2 depends on Job 1 and Job 3 depends on Job 2. Log Analysis: Given an Apache log, parse it with specified format, statistic the minute count and second count seperately and dump each re- sults. There are three jobs: 1) Import and parse raw log. 2) group by minutes and statistic counts. 3) Group by seconds and statistic counts. Job 2 and Job 3 depends on Job 1. PageRank: The goal is to find the ranking of web pages. The algorithm can be implemented as an iterative workflow containing two jobs:1) Join on the pageId of two datasets.2) Calculate the new rankings of each webpage. Job 2 depends on Job 1. TPC-H: TPC-H benchmark as a representative example of a complex SQL query. Query 3 is implemented in four-job workflow. 1) Join the order 7
  • 8. and customer table, with filter conditions on each table. 2) Join lineitem and result table in job one. 3) Calculate the volume by discount. 4) Get the sum after grouping by some keys. Job 2 depends on Job 1 and Job 3 depends on Job 2 and Job 4 depends on Job 3. HTML Parser and WordCount: The workflow processes a collection of web source pages. It has three jobs: 1) Parse the raw data with HTML SAX Parser. 2) Statistic the number of words with the same urls. 3) Ag- gregate the total word count. Job 2 depends on Job 1 and Job 3 depends on Job 2. User-defined Partition: It spill the dataset into three parts by the range of key. Some statistics are collected on each spilled part. In general, it is run in three jobs and each job is responsible for one part of dataset. There is no dependency between those three job, which means three jobs can be run in parallel. The source code for experiment groups is submitted in Starfish reposi- tory. 5.3 Speedup with Starfish Optimizer Figure 4 shows the timeline for TPC-H Query 3 workflow. When using profiler, it spends 20% more time. The final cross-job optimizer causes 1.3x speedup. Figure 5 analysis the speedup for six workflows respectively. The optimizer is effective for most of workflows with only exception of the user- defined partition. One possible reason is that workflows generates three jobs in parallel which compete the limited cluster resource (30 available map slots and 20 available reduce slots) from each other. Figure 4: run TPC-H Query3 with no Optimizer, Profiler and Optimizer. 8
  • 9. (a) Log Analysls (b) Coauthor Pairs (c) PageRank (d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word- count Figure 5: Speedup with Starfish Optimizer. 5.4 Overhead on Profiler Figure 6 shows the profiling overhead by comparing againest the same job run with profiling turned off. In average, profiling consumes 20% of the running time. 5.5 Compare with Pig We are very interested in comparing various workflow framework with the same datasets. We run the identical workflow written by Harold Lim. Fig- ure 7 shows the performance of Pig overwhelm Cascading even if Cascading is optimized. We think of several possible reasons. • Cascading does not support Combiner. One article [4] talks about hand-rolled join optimizations. • Pig does many optimization work on physical and logic layer, while Cascading does not optimize the planner well. In user-defined parti- tion, Pig only has one MapReduce job while Cascading populates 3 jobs. • Cascading only uses Customed Inputformat and InputSplit called Mul- tiInputFormat and MultiInputSplit, no matter for single job or single 9
  • 10. (a) Log Analysls (b) Coauthor Pairs (c) PageRank (d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word- count Figure 6: Overhead to measure profile. Figure 7: Pig Versus Cascading on Performance . input source. • Cascading’s CoGroup() join is not meant to be used with large data files. • Using RegexSplit to parse files into tuples is not efficient. • Disable compression. 10
  • 11. 6 Conclusion Cascading aims to help developers build powerful applications quickly and simply, through a well-reasoned API, without needing to think in MapRe- duce, while leaving the heavy lifting of data distribution, replication, dis- tributed process management, and liveness to Hadoop. With Starfish Optimizer, we can boost the original Cascading program by 20% to 200% without modifying any source code. It also demonstrates that the similar syntax sentences as Pig in representation, but the experi- ment group display distinct differences in results, which shows Pig perfor- mance is much better than Cascading in most cases. Considering the code scale, learning cost and performance, we recom- mend for simple queries, using Pig is much more suitable and performant. We also find Cascalog [3] which is data processing and querying library for Clojure, is another choice of writing workflows on Hadoop. We notice Cascading 2.0 [1] is ready to release, which will improve hugely on performance and fix bugs of previous version. For the future work, when Starfish supports old API, we can import latest version of Cascading and rerun the experiment. 7 Acknowledgement I would like to thank Herodotos Herodotou, the lead contributor of Starfish, who gave me so much help on the system design and Hadoop internal mech- anism. The report could not be done without him. I also want to thank Harold Lim who gives me some support on benchmarks. Thank Professor Shivnath Babu for his help and supervising this work, and holding meeting for us to exchange ideas. References [1] Tips for Optimizing Cascading Flows. http://www.cascading.org/2011/10/cascading- 20-early-access.html. [2] Cascading. http://www.cascading.org/. [3] Cascalog. https://github.com/nathanmarz/cascalog. [4] Pseudo Combiners in Cascading. http://blog.rapleaf.com/dev/2010/06/10/pseudo- combiners-in-cascading/. [5] Tips for Optimizing Cascading Flows. http://nathanmarz.com/blog/tips-for- optimizing-cascading-flows.html. [6] Apache Hadoop. http://hadoop.apache.org/. [7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011. 11
  • 12. [8] Hive. http://hadoop.apache.org/hive/. [9] Pig. http://hadoop.apache.org/pig/. [10] TPC. TPC Benchmark H Standard Specification , 2009. http://www.tpc.org/tpch/spec/tpch2.9.0.pdf. [11] Upgrading to the New Map Reduce API . http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api. 8 Appendix 8.1 Complete Source Code of Log Analysis Listing 1: LogAnalysis.java 1 package loganalysis; 2 3 import java.util.*; 4 import org.apache.hadoop.conf.*; 5 import org.apache.hadoop.util.*; 6 7 import cascading.cascade.*; 8 import cascading.flow.*; 9 import cascading.operation.aggregator.Count; 10 import cascading.operation.expression.ExpressionFunction; 11 import cascading.operation.regex.RegexParser; 12 import cascading.operation.text.DateParser; 13 import cascading.pipe.*; 14 import cascading.scheme.TextLine; 15 import cascading.tap.*; 16 import cascading.tuple.Fields; 17 18 public class LogAnalysis extends Configured implements Tool { 19 public int run(String[] args) throws Exception { 20 // set the Hadoop parameters 21 Properties properties = new Properties(); 22 Iterator<Map.Entry<String, String>> iter = getConf(). iterator(); 23 while (iter.hasNext()) { 24 Map.Entry<String, String> entry = iter.next(); 25 properties.put(entry.getKey(), entry.getValue()); 26 } 27 28 FlowConnector.setApplicationJarClass(properties, Main.class ); 29 FlowConnector flowConnector = new FlowConnector(properties) ; 30 CascadeConnector cascadeConnector = new CascadeConnector(); 31 32 String inputPath = args[0]; 33 String logsPath = args[1] + "/logs/"; 34 String arrivalRatePath = args[1] + "/arrivalrate/"; 12
  • 13. 35 String arrivalRateSecPath = arrivalRatePath + "sec"; 36 String arrivalRateMinPath = arrivalRatePath + "min"; 37 38 // create an assembly to import an Apache log file and store on DFS 39 // declares: "time", "method", "event", "status", "size" 40 Fields apacheFields = new Fields("ip", "time", "method", " event", 41 "status", "size"); 42 String apacheRegex = "ˆ([ˆ ]*) +[ˆ ]* +[ˆ ]* +[([ˆ]]*)] +"([ˆ ]*) ([ˆ ]*) [ˆ ]*" ([ˆ ]*) ([ˆ ]*).*$"; 43 int[] apacheGroups = { 1, 2, 3, 4, 5, 6 }; 44 RegexParser parser = new RegexParser(apacheFields, apacheRegex, 45 apacheGroups); 46 Pipe importPipe = new Each("import", new Fields("line"), parser); 47 48 // create tap to read a resource from the local file system , if not an 49 // url for an external resource 50 // Lfs allows for relative paths 51 Tap logTap = new Hfs(new TextLine(), inputPath); 52 // create a tap to read/write from the default filesystem 53 Tap parsedLogTap = new Hfs(apacheFields, logsPath); 54 55 // connect the assembly to source and sink taps 56 Flow importLogFlow = flowConnector.connect(logTap, parsedLogTap, 57 importPipe); 58 59 // create an assembly to parse out the time field into a timestamp 60 // then count the number of requests per second and per minute 61 62 // apply a text parser to create a timestamp with ’second’ granularity 63 // declares field "ts" 64 DateParser dateParser = new DateParser(new Fields("ts"), 65 "dd/MMM/yyyy:HH:mm:ss Z"); 66 Pipe tsPipe = new Each("arrival rate", new Fields("time"), dateParser, 67 Fields.RESULTS); 68 69 // name the per second assembly and split on tsPipe 70 Pipe tsCountPipe = new Pipe("tsCount", tsPipe); 71 tsCountPipe = new GroupBy(tsCountPipe, new Fields("ts")); 72 tsCountPipe = new Every(tsCountPipe, Fields.GROUP, new Count()); 73 74 // apply expression to create a timestamp with ’minute’ granularity 75 // declares field "tm" 13
  • 14. 76 Pipe tmPipe = new Each(tsPipe, new ExpressionFunction(new Fields("tm"), 77 "ts - (ts % (60 * 1000))", long.class)); 78 79 // name the per minute assembly and split on tmPipe 80 Pipe tmCountPipe = new Pipe("tmCount", tmPipe); 81 tmCountPipe = new GroupBy(tmCountPipe, new Fields("tm")); 82 tmCountPipe = new Every(tmCountPipe, Fields.GROUP, new Count()); 83 84 // create taps to write the results the default filesystem, using the 85 // given fields 86 Tap tsSinkTap = new Hfs(new TextLine(), arrivalRateSecPath, true); 87 Tap tmSinkTap = new Hfs(new TextLine(), arrivalRateMinPath, true); 88 89 // a convenience method for binding taps and pipes, order is significant 90 Map<String, Tap> sinks = Cascades.tapsMap(Pipe.pipes( tsCountPipe, 91 tmCountPipe), Tap.taps(tsSinkTap, tmSinkTap)); 92 93 // connect the assembly to the source and sink taps 94 Flow arrivalRateFlow = flowConnector.connect(parsedLogTap, sinks, 95 tsCountPipe, tmCountPipe); 96 97 // optionally print out the arrivalRateFlow to a graph file for import 98 // into a graphics package 99 //arrivalRateFlow.writeDOT( "arrivalrate.dot" ); 100 101 // connect the flows by their dependencies, order is not significant 102 Cascade cascade = cascadeConnector.connect(importLogFlow, 103 arrivalRateFlow); 104 105 // execute the cascade, which in turn executes each flow in dependency 106 // order 107 cascade.complete(); 108 return 0; 109 } 110 111 public static void main(String[] args) throws Exception { 112 int res = ToolRunner.run(new Configuration(), new Main(), args); 113 System.exit(res); 114 } 115 } 14