SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Cascading on Flink
Fabian Hueske
@fhueske
What is Cascading?
“Cascading is the proven application
development platform for building data
applications on Hadoop.”
(www.cascading.org)
 Java API for large-scale batch processing
 Programs are specified as data flows
• pipes, taps, flow, cascade, …
• each, groupBy, every, coGroup, merge, …
 Originally for Hadoop MapReduce
• Compiled to workflows of Hadoop MapReduce jobs
 Open Source (AL2)
• Developed by Concurrent
2
Why Cascading?
 Vastly simplified API compared to pure MR API
• Reuse of code, connecting flows, …
 Automatic translation to MR jobs
• Minimizes number of MR jobs
 Rock-solid execution due to Hadoop MapReduce
 More APIs have been put on top
• Scalding (Scala) by Twitter
• Cascalog (Datalog)
• Lingual (SQL)
• Fluent (fluent Java API)
 Runs in many production settings
• Twitter, Soundcloud, Etsy, Airbnb, …
3
Cascading Example
4
 Compute TF-IDF scores for a set of documents
• TF-IDF: Term-Frequency / Inverted-Document-Frequency
• Used for weighting the relevance of terms in search engines
 Building this against the MapReduce API is painful
Example taken from docs.cascading.org/impatient
Cascading 3.0
 Released in June 2015
 A new planner
• Execution backend can be changed
 Apache Tez executor
• Cascading programs are compiled to Tez jobs
• No identity mappers
• No writing to HDFS between jobs
5
Why Cascading on Flink?
 Flink’s unique batch processing runtime
• Pipelined data exchange
• Actively managed memory on- & off-heap
• Efficient in-memory & out-of-core operators
• Sorting and hashing on binary data
• No tuning for robust operation (OOME, GC)
 YARN integration
6
Cascading on Flink released
 Available on Github
• Apache License V2
 Depends on
• Cascading 3.1 WIP
• Flink 0.10-SNAPSHOT
• Will be fixed to next releases of Cascading and Flink
 Check Github for details:
http://github.com/dataartisans/cascading-flink
7
Executing Cascading on Flink
 Cascading programs are translated into Flink
programs
 Execution leverages all runtime features
• Memory-safe execution
• In-memory operators
• Pipelining
• Native serializers & binary comparators
(if program provides data types)
 Use Flink’s regular execution clients
8
Current limitations
 HashJoin only supported as InnerJoin
• HashJoin can be replaced by CoGroup
 Support will be added once Flink supports
hash-based outer joins
• This is work in progress
9
How to run Cascading on Flink
 No binaries available yet 
• Clone the repository
• And build it (mvn –DskipTests clean install)
 Add the cascading-flink Maven dependency to your
Cascading project
 Change just one line of code in your Cascading program
• Replace Hadoop2MR1FlowConnector by FlinkConnector
• Do not change any application logic (except replacing HashJoin
for non-InnerJoins)
 Execute Cascading program as regular Flink program
 Detailed instructions on Github
10
Example: TF-IDF
 Taken from “Cascading for the impatient”
• 2 CoGroup, 7 GroupBy, 1 HashJoin
11http://docs.cascading.org/impatient
TF-IDF on MapReduce
 Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
 Each job
• Reads data from HDFS
• Applies a Map function
• Shuffles the data over the network
• Sorts the data
• Applies a Reduce function
• Writes the data to HDFS
12
TF-IDF on Flink
 Cascading on Flink translates the TF-
IDF job into one Flink job
13
TF-IDF on Flink
 Shuffle is pipelined
 Intermediate results are not
written to or read from HDFS
14
TF-IDF: MR vs. Flink
 8 worker node
• 8 CPUs, 30GB RAM, 2 local SSDs
 Hadoop 2.7.1 (YARN, HDFS, MapReduce)
 Flink 0.10-SNAPSHOT
 80GB data (intermediate data larger)
15
Cascading on Flink -> 3:24h
Cascading on MapReduce -> 8:33h
Conclusion
 Executing Cascading jobs on Apache Flink
• Improves runtime
• Reduces parameter tuning and avoids failures
• Virtually no code changes
 Apache Flink’s runtime is very versatile
• Apache Hadoop MR
• Apache Storm
• Google Dataflow
• Apache Samoa (incubating)
• + Flink’s own APIs and libraries…
16

Weitere ähnliche Inhalte

Was ist angesagt?

Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 

Was ist angesagt? (20)

Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 

Andere mochten auch

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
Adam Cataldo
 
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Keith Kraus
 

Andere mochten auch (20)

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & Elasticsearch
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
 
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure Detection
 

Ähnlich wie Overview of Cascading 3.0 on Apache Flink

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 

Ähnlich wie Overview of Cascading 3.0 on Apache Flink (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 

Mehr von Cascading

Mehr von Cascading (11)

Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
 
The Cascading (big) data application framework - AndrĂŠ Keple, Sr. Engineer, C...
The Cascading (big) data application framework - AndrĂŠ Keple, Sr. Engineer, C...The Cascading (big) data application framework - AndrĂŠ Keple, Sr. Engineer, C...
The Cascading (big) data application framework - AndrĂŠ Keple, Sr. Engineer, C...
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 

KĂźrzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

KĂźrzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Overview of Cascading 3.0 on Apache Flink

  • 1. Cascading on Flink Fabian Hueske @fhueske
  • 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Originally for Hadoop MapReduce • Compiled to workflows of Hadoop MapReduce jobs  Open Source (AL2) • Developed by Concurrent 2
  • 3. Why Cascading?  Vastly simplified API compared to pure MR API • Reuse of code, connecting flows, …  Automatic translation to MR jobs • Minimizes number of MR jobs  Rock-solid execution due to Hadoop MapReduce  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API)  Runs in many production settings • Twitter, Soundcloud, Etsy, Airbnb, … 3
  • 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful Example taken from docs.cascading.org/impatient
  • 5. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 5
  • 6. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 6
  • 7. Cascading on Flink released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 7
  • 8. Executing Cascading on Flink  Cascading programs are translated into Flink programs  Execution leverages all runtime features • Memory-safe execution • In-memory operators • Pipelining • Native serializers & binary comparators (if program provides data types)  Use Flink’s regular execution clients 8
  • 9. Current limitations  HashJoin only supported as InnerJoin • HashJoin can be replaced by CoGroup  Support will be added once Flink supports hash-based outer joins • This is work in progress 9
  • 10. How to run Cascading on Flink  No binaries available yet  • Clone the repository • And build it (mvn –DskipTests clean install)  Add the cascading-flink Maven dependency to your Cascading project  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic (except replacing HashJoin for non-InnerJoins)  Execute Cascading program as regular Flink program  Detailed instructions on Github 10
  • 11. Example: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 11http://docs.cascading.org/impatient
  • 12. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 12
  • 13. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 13
  • 14. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 14
  • 15. TF-IDF: MR vs. Flink  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB data (intermediate data larger) 15 Cascading on Flink -> 3:24h Cascading on MapReduce -> 8:33h
  • 16. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 16