2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
3. Our Agenda
01
02
03
04
05
What is Apache Beam
Architecture of Apache Beam
Why Apache Beam
Beam Programming Model
Fundamentals of Apache Beam
06 Demo
4. Introduction to Apache Beam
● Apache Beam (Batch + strEAM) is a unified programming model for batch and
streaming data processing jobs. It provides a software development kit to define
and construct data processing pipelines as well as runners to execute them.
● Apache Beam is designed to provide a portable programming layer.
● Apache Beam raises portability and flexibility. We focus on our logic rather than
the underlying details. Moreover, we can change the data processing backend at
any time.
● There are Java, Python, Go, and Scala SDKs available for Apache Beam. Indeed,
everybody on the team can use it with their language of choice.
5. ● An Apache Beam pipeline is an ordered graph of different operations
(transformations) for a data processing workflow.
● Apache Spark, Apache Flink, Apex, Google Dataflow, and Apache
Samza are some of the well-known frameworks supported by Beam
at the moment.
6. Why Apache Beam
● Portable: We can use the same code with different runners and backends on
premise, in the cloud or locally. For example- Spark, Flink, Cloud dataflow etc.
● Unified: Same unified model for batch and stream processing. While others do so
via separate APIs.
● Extensible model and SDK: Extensible API can define custom sources to read
and write in parallel.
Execution Platform Agnostic
Data Agnostic
Programming Agnostic
7. Fundamental Concepts of Apache Beam
● Pipeline: All the operations, inputs, and outputs are defined in the scope of a
pipeline. It is also possible to configure where and how to run a pipeline.
● PCollection: An immutable collection of elements of a specific type. It can
contain either a bounded or an unbounded number of elements. A PCollection is
the input and output for each PTransform.
● PTransform: A PTransform is an operation that needs to be performed on a
single data element. It takes an input PCollection and transforms it into zero or
more output PCollections.
● Runner: A runner translates the beam pipeline into the compatible API of the
chosen distributed processing backend, such as Direct Runner, Apache Flink, or
Apache Spark.
8. Architecture of Apache Beam
● Write the pipeline in your choice of programming language SDKs — Java,
Python or Go.
● Beam / Runner API converts it to a language generic standard which can
be consumed by execution engines.
● Fn API provides language-specific SDK workers which act as an RPC
interface for UDFs that are embedded in the pipeline as a specification of
the function.
● The selected Runner executes the pipeline on underlying resources and
the right choice of the runner is the key for efficient execution
9. Apache Beam Programming Model
There are three considerations when developing an Apache Beam pipeline
● How or where is your input data stored, and how are you going to read it?
● What transformations are required? For example, do general beam
operators meet the data transformation needs, or is it necessary to write
custom transformers using ParDo?
● What will the output format be, and where will it be stored so that you can
decide what transforms need to be applied?
10. PTransformations
Transform: a step in the pipeline, taking PCollections as input and produce
PCollections.
● Core Transforms- common transformation provided (ParDo, GroupByKey etc)
● Composite transforms- combine multiple transforms, such as counting or combining elements in a collection.
● IO transforms- endpoints of a pipeline to create PCollections or use PCollections to ‘write’ data outside of the
pipeline(producer)