2. What is Cascading?
“Cascading is the proven application
development platform for building data
applications on Hadoop.”
(www.cascading.org)
Java API for large-scale batch processing
Programs are specified as data flows
• pipes, taps, flow, cascade, …
• each, groupBy, every, coGroup, merge, …
Originally for Hadoop MapReduce
• Compiled to workflows of Hadoop MapReduce jobs
Open Source (AL2)
• Developed by Concurrent
2
3. Why Cascading?
Vastly simplified API compared to pure MR API
• Reuse of code, connecting flows, …
Automatic translation to MR jobs
• Minimizes number of MR jobs
Rock-solid execution due to Hadoop MapReduce
More APIs have been put on top
• Scalding (Scala) by Twitter
• Cascalog (Datalog)
• Lingual (SQL)
• Fluent (fluent Java API)
Runs in many production settings
• Twitter, Soundcloud, Etsy, Airbnb, …
3
4. Cascading Example
4
Compute TF-IDF scores for a set of documents
• TF-IDF: Term-Frequency / Inverted-Document-Frequency
• Used for weighting the relevance of terms in search engines
Building this against the MapReduce API is painful
Example taken from docs.cascading.org/impatient
5. Cascading 3.0
Released in June 2015
A new planner
• Execution backend can be changed
Apache Tez executor
• Cascading programs are compiled to Tez jobs
• No identity mappers
• No writing to HDFS between jobs
5
6. Why Cascading on Flink?
Flink’s unique batch processing runtime
• Pipelined data exchange
• Actively managed memory on- & off-heap
• Efficient in-memory & out-of-core operators
• Sorting and hashing on binary data
• No tuning for robust operation (OOME, GC)
YARN integration
6
7. Cascading on Flink released
Available on Github
• Apache License V2
Depends on
• Cascading 3.1 WIP
• Flink 0.10-SNAPSHOT
• Will be fixed to next releases of Cascading and Flink
Check Github for details:
http://github.com/dataartisans/cascading-flink
7
8. Executing Cascading on Flink
Cascading programs are translated into Flink
programs
Execution leverages all runtime features
• Memory-safe execution
• In-memory operators
• Pipelining
• Native serializers & binary comparators
(if program provides data types)
Use Flink’s regular execution clients
8
9. Current limitations
HashJoin only supported as InnerJoin
• HashJoin can be replaced by CoGroup
Support will be added once Flink supports
hash-based outer joins
• This is work in progress
9
10. How to run Cascading on Flink
No binaries available yet
• Clone the repository
• And build it (mvn –DskipTests clean install)
Add the cascading-flink Maven dependency to your
Cascading project
Change just one line of code in your Cascading program
• Replace Hadoop2MR1FlowConnector by FlinkConnector
• Do not change any application logic (except replacing HashJoin
for non-InnerJoins)
Execute Cascading program as regular Flink program
Detailed instructions on Github
10
11. Example: TF-IDF
Taken from “Cascading for the impatient”
• 2 CoGroup, 7 GroupBy, 1 HashJoin
11http://docs.cascading.org/impatient
12. TF-IDF on MapReduce
Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
Each job
• Reads data from HDFS
• Applies a Map function
• Shuffles the data over the network
• Sorts the data
• Applies a Reduce function
• Writes the data to HDFS
12
13. TF-IDF on Flink
Cascading on Flink translates the TF-
IDF job into one Flink job
13
14. TF-IDF on Flink
Shuffle is pipelined
Intermediate results are not
written to or read from HDFS
14
15. TF-IDF: MR vs. Flink
8 worker node
• 8 CPUs, 30GB RAM, 2 local SSDs
Hadoop 2.7.1 (YARN, HDFS, MapReduce)
Flink 0.10-SNAPSHOT
80GB data (intermediate data larger)
15
Cascading on Flink -> 3:24h
Cascading on MapReduce -> 8:33h
16. Conclusion
Executing Cascading jobs on Apache Flink
• Improves runtime
• Reduces parameter tuning and avoids failures
• Virtually no code changes
Apache Flink’s runtime is very versatile
• Apache Hadoop MR
• Apache Storm
• Google Dataflow
• Apache Samoa (incubating)
• + Flink’s own APIs and libraries…
16