Data Processing with Cascading Java API on Apache Hadoop
1. “Enterprise Data
Workflows with
Cascading”
• Cascading is an application framework for developers, data analysts, data scientists to
simply develop robust Data Analytics and Data Management applications on Apache
Hadoop.
• Cascading is a query API and query Planner used for defining, sharing, and executing
data processing workflows on a distributed data grid or cluster, based upon Apache
Hadoop;
• Cascading greatly simplifies the complexities with Hadoop application development, job
creation, and job scheduling;
• Cascading was developed to allow organizations to rapidly develop complex data
processing applications;
2. Cascading API
Tap
• LFS
• DFS
• HFS
• MultiSourceTap
• MultiSinkTap
• TemplateTap
• GlobHfs
• S3fs(Deprecated)
Scheme
• TextLine
• TextDelimited
• SequenceFile
• WritableSequenceFile
TAP types
• SinkMode.KEEP
• SinkMode.REPLACE
• SinkMode.UPDATE
Tuple
• A single ‘row’ of data being processed
• Each column is named
• Can access data by name or position
How to create Tap?
Fields fields=new Fields(“field_1”, ”field_2, ”...”, “field_n”);
TextDelimited scheme = new TextDelimited(fields, "t");
Hfs input = new Hfs(scheme, path/to/hdfs_file, SinkMode.REPLACE );
3. Pipe Assemblies
• Pipe assemblies define what work should be done against a tuple stream, where during runtime
tuple streams are read from Tap sources and are written to Tap sinks.
• Pipe assemblies may have multiple sources and multiple sinks and they can define splits,
merges, and joins to manipulate how the tuple streams interact.
Assemblies
• Pipe
• Each
• GroupBy
• CoGroup
• Every
• SubAssembly
Function
[Applicable to Each operation]
• Identity Function
• Debug Function
• Sample and Limit Functions
• Insert Function
• Text Functions
• Regular Expression Operations
• Java Expression Operations
• "first-name" is a valid field name for use with Cascading, but this
expression, first-name.trim(), will fail.
• Custom Functions
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source,
sink, pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source,
sink, pipe );
flow.complete();
4. Filter
[Applicable to Each operation]
• And
• Or
• Not
• Xor
• NotNull
• Null
• RegexFilter
• Custom Filter
Aggregator
[Applicable to Every operation]
• Average
• Count
• First
• Last
• Max
• Min
• Sum
• Custom Aggregator
Buffer
• It is very similar to the typical Reducer interface
• It is very useful when header or footer values need to be inserted into a
grouping, or if values need to be inserted into the middle of the group
values
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink,
pipe );
flow.complete();
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source,
sink, pipe );
flow.complete();
5. CoGroup
• InnerJoin
• OuterJoin
• LeftJoin
• RightJoin
• MixedJoin
Flow
• To create a Flow, it must be planned though the FlowConnector object.
• The connect() method is used to create new Flow instances based on a set
of sink Taps, source Taps, and a pipe assembly.
Cascades
• Groups of Flow are called Cascades
• Custom MapReduce jobs can participate in Cascade
Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );
flow.complete();
CascadeConnector cascadeConnector=new CascadeConnector();
Cascade cascade = cascadeConnector.connect(flow1, flow2, flow3);
cascade.complete();
LHS = [0,a] [1,b] [2,c] RHS = [0,A] [2,C] [3,D]
InnerJoin: [0,a,0,A] [2,c,2,C]
OuterJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C] [null,null,3,D]
LeftJoin: [0,a,0,A] [1,b,null,null] [2,c,2,C]
RightJoin: [0,a,0,A] [2,c,2,C] [null,null,3,D]
6. Flow Connector
• HadoopFlowConnector
• LocalFlowConnector
Monitoring
Implement FlowListener interface
• onStarting
• onStopping
• onCompleted
• onThrowable
Testing
• Use ClusterTestCase if you want to launch an embedded Hadoop cluster
inside your TestCase
• A few validation and hadoop functions are provided
• Doesn’t support Hadoop 0.21 testing library
• CascadingTestCase with JUnit
Flow flow = new FlowConnector(new Properties()).connect( "flow-name",
source, sink, pipe );
flow.complete();
7. Field Algebra
Fields sets are constant values on the Fields class and can be used in many places the Fields class
is expected. They are:
• Fields.ALL
• Fields.RESULTS
• Fields.REPLACE
• Fields.SWAP
• Fields.ARGS
• Fields.GROUP
• Fields.VALUES
• Fields.UNKNOWN
Pipe assembly = new Each( assembly, Fields.ALL, function ,Fields.RESULTS);
8. Word Counting the Hadoop Way with Cascading
As “word counting” has become the customary “Hello, World!” application for those programmers
who are new to Hadoop and Map-Reduce paradigm, let’s have a look at an example in Cascading
Hadoop:
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
9. Just a few notes worth mentioning about the overall logic adopted in the preceding code listing:
• the code creates some Pipes objects for the input/output of data, then uses a
“RegexSplitGenerator” (that implements a regex to split the text on word boundaries) inside
an iterator (the “Each” object), that returns another Pipe (the “docPipe”) to split the document
text into a token stream.
• A “GroupBy” is defined to count the occurrences of each token, and then the pipes are
connected together, like in a “cascade”, via the “flowDef” object.
• Finally, a DOT file is generated, to depict the Cascading flow graphically.
The DOT file can be loaded into OmniGraffle or Visio, and is really helpful for troubleshooting
MapReduce workflows in Cascading.