Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Unified, Efficient, and Portable Data Processing with Apache Beam
1. Abstract
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of
these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data
processing programming model that evolved from more than a decade of experience building Big Data infrastructure
within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates
properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time
environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and
proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling,
resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful
programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in
real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark
and Google Cloud Dataflow in a live setting.
This session is a Technical (Intermediate) talk in our IoT and Streaming track. It focuses on Apache Flink, Apache
Kafka, Apache Spark, Cloud and is geared towards Architect, Data Scientist, Developer / Engineer audiences.
2. Unified, Efficient and
Portable Data Processing
with Apache Beam
Davor Bonaci
PMC Chair, Apache Beam
Software Engineer, Google Inc.
3. Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
4. The evolution of Apache Beam
MapReduce Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
5. Agenda
1. Expressing data-parallel pipelines with the Beam model
2. The Beam vision for portability
3. Parallel and portable pipelines in practice
6. Apache Beam is
a unified programming model
designed to provide
portable data processing pipelines
(efficient too)
9. The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
12. PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
18. Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
19. The Beam vision for
portablility
Write once,
run anywhere“
”
20. Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
21. ● Beam’s Java SDK runs on multiple
runtime environments, including:
○ Apache Apex
○ Apache Spark
○ Apache Flink
○ Google Cloud Dataflow
○ [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
○ Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Vision: as of April 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
22. Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
23. How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
30. Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation
31. Related sessions
Hadoop Summit San Jose 2016
● Apache Beam: A Unified Model for Batch and Streaming Data Processing
○ Speaker: Davor Bonaci
Hadoop Summit Melbourne 2016
● Stream/Batch processing portable across on-premise and Cloud with Apache Beam
○ Speaker: Eric Anderson
DataWorks Summit San Jose 2017
● Realizing the promise of portable data processing with Apache Beam
○ Speaker: Davor Bonaci
● Stateful processing of massive out-of-order streams with Apache Beam
○ Speaker: Kenneth Knowles
32. Apache Beam is
a unified programming model
designed to provide
portable data processing pipelines
(efficient too)