2. What is Cascading?
âCascading is the proven application
development platform for building data
applications on Hadoop.â
(www.cascading.org)
ď§ Java API for large-scale batch processing
ď§ Programs are specified as data flows
⢠pipes, taps, flow, cascade, âŚ
⢠each, groupBy, every, coGroup, merge, âŚ
ď§ Originally for Hadoop MapReduce
⢠Compiled to workflows of Hadoop MapReduce jobs
ď§ Open Source (AL2)
⢠Developed by Concurrent
2
3. Why Cascading?
ď§ Vastly simplified API compared to pure MR API
⢠Reuse of code, connecting flows, âŚ
ď§ Automatic translation to MR jobs
⢠Minimizes number of MR jobs
ď§ Rock-solid execution due to Hadoop MapReduce
ď§ More APIs have been put on top
⢠Scalding (Scala) by Twitter
⢠Cascalog (Datalog)
⢠Lingual (SQL)
⢠Fluent (fluent Java API)
ď§ Runs in many production settings
⢠Twitter, Soundcloud, Etsy, Airbnb, âŚ
3
4. Cascading Example
4
ď§ Compute TF-IDF scores for a set of documents
⢠TF-IDF: Term-Frequency / Inverted-Document-Frequency
⢠Used for weighting the relevance of terms in search engines
ď§ Building this against the MapReduce API is painful
Example taken from docs.cascading.org/impatient
5. Cascading 3.0
ď§ Released in June 2015
ď§ A new planner
⢠Execution backend can be changed
ď§ Apache Tez executor
⢠Cascading programs are compiled to Tez jobs
⢠No identity mappers
⢠No writing to HDFS between jobs
5
6. Why Cascading on Flink?
ď§ Flinkâs unique batch processing runtime
⢠Pipelined data exchange
⢠Actively managed memory on- & off-heap
⢠Efficient in-memory & out-of-core operators
⢠Sorting and hashing on binary data
⢠No tuning for robust operation (OOME, GC)
ď§ YARN integration
6
7. Cascading on Flink released
ď§ Available on Github
⢠Apache License V2
ď§ Depends on
⢠Cascading 3.1 WIP
⢠Flink 0.10-SNAPSHOT
⢠Will be fixed to next releases of Cascading and Flink
ď§ Check Github for details:
http://github.com/dataartisans/cascading-flink
7
8. Executing Cascading on Flink
ď§ Cascading programs are translated into Flink
programs
ď§ Execution leverages all runtime features
⢠Memory-safe execution
⢠In-memory operators
⢠Pipelining
⢠Native serializers & binary comparators
(if program provides data types)
ď§ Use Flinkâs regular execution clients
8
9. Current limitations
ď§ HashJoin only supported as InnerJoin
⢠HashJoin can be replaced by CoGroup
ď§ Support will be added once Flink supports
hash-based outer joins
⢠This is work in progress
9
10. How to run Cascading on Flink
ď§ No binaries available yet ď
⢠Clone the repository
⢠And build it (mvn âDskipTests clean install)
ď§ Add the cascading-flink Maven dependency to your
Cascading project
ď§ Change just one line of code in your Cascading program
⢠Replace Hadoop2MR1FlowConnector by FlinkConnector
⢠Do not change any application logic (except replacing HashJoin
for non-InnerJoins)
ď§ Execute Cascading program as regular Flink program
ď§ Detailed instructions on Github
10
11. Example: TF-IDF
ď§ Taken from âCascading for the impatientâ
⢠2 CoGroup, 7 GroupBy, 1 HashJoin
11http://docs.cascading.org/impatient
12. TF-IDF on MapReduce
ď§ Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
ď§ Each job
⢠Reads data from HDFS
⢠Applies a Map function
⢠Shuffles the data over the network
⢠Sorts the data
⢠Applies a Reduce function
⢠Writes the data to HDFS
12
13. TF-IDF on Flink
ď§ Cascading on Flink translates the TF-
IDF job into one Flink job
13
14. TF-IDF on Flink
ď§ Shuffle is pipelined
ď§ Intermediate results are not
written to or read from HDFS
14
15. TF-IDF: MR vs. Flink
ď§ 8 worker node
⢠8 CPUs, 30GB RAM, 2 local SSDs
ď§ Hadoop 2.7.1 (YARN, HDFS, MapReduce)
ď§ Flink 0.10-SNAPSHOT
ď§ 80GB data (intermediate data larger)
15
Cascading on Flink -> 3:24h
Cascading on MapReduce -> 8:33h
16. Conclusion
ď§ Executing Cascading jobs on Apache Flink
⢠Improves runtime
⢠Reduces parameter tuning and avoids failures
⢠Virtually no code changes
ď§ Apache Flinkâs runtime is very versatile
⢠Apache Hadoop MR
⢠Apache Storm
⢠Google Dataflow
⢠Apache Samoa (incubating)
⢠+ Flinkâs own APIs and librariesâŚ
16