Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

•

11 likes•6,760 views

This document discusses dynamic scaling in Apache Flink. It describes Flink's approach to dynamically scaling stateful jobs to adapt to changing workloads. Key points include: repartitioning of keyed and non-keyed state when scaling workers, supporting manual rescaling through savepoints and restarts currently, and future work on scaling operators without restarts and implementing automatic scaling policies.

Technology

Till Rohrmann
trohrmann@apache.org
@stsffap
Dynamic Scaling: How Apache
Flink® Adapts to Changing
Workloads

Resource Adaption
3
+
time
Workload
Resources
time
Workload
Resources

What Is This Talk About?
§ Flink’s approach to dynamic scaling
§ Current state with demo
§ Outlook on next development steps
4

Basic Idea
6
• Spread work across more workers to decrease workload

Scaling Stateless Jobs
7
Scale Up Scale Down
Source
Mapper
Sink
• Scale up: Deploy new tasks
• Scale down: Cancel running tasks

Scaling Stateful Jobs
8
?
• Problem: Which state to assign to new task?

Keyed vs. Non-keyed State
10
• State bound to a key
• E.g. Keyed UDF and window state
• State bound to a subtask
• E.g. Source state
Keyed Non-keyed

Repartitioning Keyed State
§ Similar to consistent
hashing
§ Split key space into
key groups
§ Assign key groups to
tasks
11
Key space
Key group #1 Key group #2
Key group #3Key group #4

Repartitioning Keyed State contd.
§ Rescaling changes
key group
assignment
§ Maximum parallelism
defined by #key
groups
12

Repartitioning Non-keyed state
§ User defined merge and
split functions
• Most general approach
§ Breaking non-keyed state
up into finer granularity
• State has to contain
multiple entries
• Automatic repartitioning
wrt granularity
13
#1 #2
#3

Repartitioning Non-keyed State contd.
§ Non-keyed state entries gathered at the
job manager
§ Repartitioning schemes
• Repartition & send
• Union & broadcast
14

Example: Kafka Source
15
partitionId: 1, offset: 42
partitionId: 3, offset: 10
partitionId: 6, offset: 27
• Store offset for each partition
• Individual entries are repartitionable
partitionId: 1, offset: 42
partitionId: 3, offset: 10
partitionId: 6, offset: 27

Rescaling: Why is That so Hard?
§ Handling of state
§ Repartitioning of keyed & non-keyed
state
§ Unique among open source stream
processors, afaik
16

Demo Topology
18
Kafka Source Counter
KeyBy

Current State
§ Manual rescaling
1. Take savepoint
2. Stop the job
3. Restart job with adjusted parallelism and
savepoint
20

Next Steps
§ Integrate savepoint with stop signal
§ Rescaling individual operators w/o restart
§ Dynamic container de-/allocation
• “Running Flink Everywhere” by Stephan
Ewen, 16:45 at Kesselhaus
21

Auto Scaling Policies
22
• Latency
• Throughput
• Resource utilization
• Kubernetes on GCE, EC2 and Mesos (marathon-
autoscale) already support auto-scaling

Conclusion
§ Scaling of keyed and non-keyed state
§ Flink supports manual rescaling with
restart
(WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state)
§ Future versions might support scaling on
the fly and automatic rescaling policies
23

What's hot

(Jason Gustafson, Confluent) Kafka Summit SF 2018 Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+. In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following: 1. How Kafka replication works 2. What weaknesses we have found over the years 3. How these problems have been fixed 4. How we have used TLA+ to verify the fixed protocol. This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.

Hardening Kafka Replication

confluent

Most of the time, We forget that GC exists because it handles memory on its own. But, unfortunately, it is often involved in production incidents. This is at that moment it reminds you it exists and not everything is magic! Moreover, OpenJDK brings a handful of GCs with different characteristics and the default one (well not always...) is not the easiest to understand. Though, this choice of GCs allows the JVM to adapt to different workloads and applications in terms of latency or throughput. I will explain how to tame those beasts and how to take advantage of them to improve your applications and resources.

Mastering GC.pdf

Jean-Philippe BEMPEL

Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.

Building a fully managed stream processing platform on Flink at scale for Lin...

Flink Forward

HBase and HDFS: Understanding FileSystem Usage in HBase

enissoz

kafka

Amikam Snir

RocksDB compaction

MIJIN AN

During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines. On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time. The takeaways from this two are twofold: — How to integrate Spark SQL with a distributed database engine and the benefit of it — How to leverage Spark SQL’s experimental methods to extend its capacity.

When Apache Spark Meets TiDB with Xiaoyu Ma

Databricks

Kafka At Scale in the Cloud

confluent

Netflix Data Pipeline With Kafka

Allen (Xiaozhong) Wang

Cassandra at Instagram (August 2013)

Rick Branson

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

Evening out the uneven: dealing with skew in Flink

Flink Forward

Apache Kafka Introduction

Amita Mirajkar

File Format Benchmark - Avro, JSON, ORC & Parquet

DataWorks Summit/Hadoop Summit

Apache ZooKeeper

Scott Leberknight

Raffi Krikorian, Twitter Timelines at Scale

Mariano Amartino

Apache Flink and what it is used for

Aljoscha Krettek

Node Labels in YARN

DataWorks Summit

During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Flink Forward

Flink Forward San Francisco 2022. Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework. by Rainie Li & Kanchi Masalia

Flink powered stream processing platform at Pinterest

Flink Forward

"Have you ever had your stateful Kafka Streams app killed by Kubernetes with the termination reason ""OOMKilled""? Even if you did set up JVM heap limit, the pod still got killed? This is likely due to your RocksDB off-heap memory usage. This talk will explore ways of diagnosing the problem, including: - analysis of heap and off-heap memory; - diving into RocksDB metrics; - looking at k8s java pod measurements. It will also show a possible solution to the problem with RocksDB and pod memory tuning. Attendants will get hands-on experience of live debugging Kafka Streams app together with useful tips and tricks on its production usage in Kubernetes. As a Streaming Platform Owner at Raiffeisen Bank and a former Data Engineer with 5+ years of experience, I will share some insights on building scalable and reliable Kafka Streams stateful applications."

What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy

HostedbyConfluent

What's hot (20)

Hardening Kafka Replication

Mastering GC.pdf

Building a fully managed stream processing platform on Flink at scale for Lin...

HBase and HDFS: Understanding FileSystem Usage in HBase

kafka

RocksDB compaction

When Apache Spark Meets TiDB with Xiaoyu Ma

Kafka At Scale in the Cloud

Netflix Data Pipeline With Kafka

Cassandra at Instagram (August 2013)

Evening out the uneven: dealing with skew in Flink

Apache Kafka Introduction

File Format Benchmark - Avro, JSON, ORC & Parquet

Apache ZooKeeper

Raffi Krikorian, Twitter Timelines at Scale

Apache Flink and what it is used for

Node Labels in YARN

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Flink powered stream processing platform at Pinterest

What to do if Your Kafka Streams App Gets OOMKilled? with Andrey Serebryanskiy

Viewers also liked

Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.

Taking a look under the hood of Apache Flink's relational APIs.

Fabian Hueske

In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this presentation, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language-independent form enables high-level algebraic optimizations. Different optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark and Apache Flink, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-off between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...

Till Rohrmann

The talk explains how Apache Flink checkpoints stateful jobs using the asynchronous barrier snapshotting algorithm to give exactly once semantics in streaming. Furthermore, Flink's approach to master high availability (HA) is described which solves the problem of the JobManager being the single point of failure. Job checkpointing in combination with HA is the basis for Flink's fault tolerance mechanism to recover from occurring failures.

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Till Rohrmann

http://flink-forward.org/kb_sessions/running-apache-flink-everywhere-standalone-yarn-mesos-docker-kubernetes-etc/ The world of cluster managers and deployment frameworks is getting complicated. There is zoo of tools to deploy and manage data processing jobs, all of which have different resource management and fault tolerance slightly different. Some tools have a only per-job processes (Yarn, Docker/Kubernetes), while others require some long running processes (Mesos, Standalone). In some frameworks, streaming jobs control their own resource allocation (Yarn, Mesos), while for other frameworks, resource management is handled by external tools (Kubernetes). To be broadly usable in a variety of setups, Flink needs to play well with all these frameworks and their paradigms. This talk describes Flink’s new proposed process and deployment model that will make it work together well with the above mentioned frameworks. The new abstraction is designed to cover a variety of use cases, like isolated single job deployments, sessions of multiple short jobs, and multi-tenant setups.

Stephan Ewen - Running Flink Everywhere

Flink Forward

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Till Rohrmann

Data Stream Processing with Apache Flink

Fabian Hueske

Stream Analytics with SQL on Apache Flink

Fabian Hueske

Apache Calcite overview

Julian Hyde

SQL is undoubtedly the most widely used language for data analytics for many good reasons. It is declarative, many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap. One approach is to support standard SQL which is known by users and tools but comes at the cost of cumbersome workarounds for many common streaming computations. Other approaches are to design custom SQL-inspired stream analytics languages or to extend SQL with streaming-specific keywords. While such solutions tend to result in more intuitive syntax, they suffer from not being established standards and thereby exclude many users and tools. Apache Flink is a distributed stream processing system with very good support for streaming analytics. Flink features two relational APIs, the Table API and SQL. The Table API is a language-integrated relational API with stream-specific features. Flink’s SQL interface implements the plain SQL standard. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. In this talk we present the future of Apache Flink’s relational APIs for stream analytics, discuss their conceptual model, and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency. We conclude our talk demonstrating the power and expressiveness of Flink’s relational APIs by presenting how common stream analytics use cases can be realized.

Fabian Hueske - Stream Analytics with SQL on Apache Flink

Ververica

Click-Through Example for Flink’s KafkaConsumer Checkpointing

Robert Metzger

This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way: How to setup and configure your Apache Flink environment? How to use Apache Flink tools? 3. How to run the examples in the Apache Flink bundle? 4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink? 5. How to write your Apache Flink program in an IDE?

Step-by-Step Introduction to Apache Flink

Slim Baltagi

Flink vs. Spark

Slim Baltagi

Apache Flink: Streaming Done Right @ FOSDEM 2016

Till Rohrmann

Catalogo luminarias-distribuicao-nov10

Marcello Pellegrina

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Till Rohrmann

Keynote, PNW Scala 2013

Paul Phillips

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

Fabian Hueske

Towards sql for streams

Radu Tudoran

Juggling with Bits and Bytes - How Apache Flink operates on binary data

Fabian Hueske

The microprocessor OMAP 3430 contains a dual-core architecture consisting on an ARM Cortex-A8 (General-Purpose Processor) and a DSP C64x (as an Image-Video-Audio Accelerator). The dspbridge is a kernel driver for the OMAP3 architecture. It provides a interface to control and communicate with the DSP, enabling parallel processing in embedded devices. This talk will be about how to build a system with the dspbridge driver and how to use it for multimedia applications.

The DSP/BIOS Bridge - OMAP3

vjaquez

Viewers also liked (20)

Taking a look under the hood of Apache Flink's relational APIs.

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Stephan Ewen - Running Flink Everywhere

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Data Stream Processing with Apache Flink

Stream Analytics with SQL on Apache Flink

Apache Calcite overview

Fabian Hueske - Stream Analytics with SQL on Apache Flink

Click-Through Example for Flink’s KafkaConsumer Checkpointing

Step-by-Step Introduction to Apache Flink

Flink vs. Spark

Apache Flink: Streaming Done Right @ FOSDEM 2016

Catalogo luminarias-distribuicao-nov10

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Keynote, PNW Scala 2013

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

Towards sql for streams

Juggling with Bits and Bytes - How Apache Flink operates on binary data

The DSP/BIOS Bridge - OMAP3

Similar to Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

Apache Apex Fault Tolerance and Processing Semantics

Apache Apex

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

Zurich Flink Meetup

Konstantinos Kloudas

The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications. In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

Apache flink 1.7 and Beyond

Till Rohrmann

Functional? Reactive? Why?

Aleksandr Tavgen

Stream Processing @ Lyft

Jamie Grier

Introduction to Stateful Stream Processing with Apache Flink.

Konstantinos Kloudas

[262] netflix 빅데이터 플랫폼

NAVER D2

Until recently, developers have had to deal with some serious tradeoffs when picking a database technology. One could pick a SQL database and deal with their eventual scaling problems or pick a NoSQL database and have to work around their lack of transactions, strong consistency, and/or secondary indexes. However, a new class of distributed database engines is emerging that combines the transactional consistency guarantees of traditional relational databases with the horizontal scalability and high availability of popular NoSQL databases. In this talk, we'll examine the history of databases to see how we got here, covering the motivations for this new class of systems and why developers should care about them. We'll then take a deep dive into the key design choices behind one open source distributed SQL database, CockroachDB, that enable it to offer such properties and compare them to past SQL and NoSQL designs. We will look specifically at how to achieve the easy deployment and management of a scalable, self-healing, strongly-consistent database with techniques such as dynamic sharding and rebalancing, consensus protocols, lock-free transactions, and more.

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

Alex Robinson

Apache Kafka Streams

Apache Kafka TLV

Version control 101

Michelangelo van Dam

Flink Forward San Francisco 2022. The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples." by Thomas Weise

Introducing the Apache Flink Kubernetes Operator

Flink Forward

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Bolke de Bruin

2017 big data landscape and cutting edge innovations public

Evans Ye

Kafka streams fifth elephant 2018

Giridhar Addepalli

Moving from Lambda and Kappa Architectures to Kappa+ at Uber Kappa+ is a new approach developed at Uber to overcome the limitations of the Lambda and Kappa architectures. Whether your realtime infrastructure processes data at Uber scale (well over a trillion messages daily) or only a fraction of that, chances are you will need to reprocess old data at some point. There can be many reasons for this. Perhaps a bug fix in the realtime code needs to be retroactively applied (aka backfill), or there is a need to train realtime machine learning models on last few months of data before bringing the models online. Kafka's data retention is limited in practice and generally insufficient for such needs. So data must be processed from archives. Aside from addressing such situations, enabling efficient stream processing on archived as well as realtime data also broadens the applicability of stream processing. This talk introduces the Kappa+ architecture which enables the reuse of streaming realtime logic (stateful and stateless) to efficiently process any amounts of historic data without requiring it to be in Kafka. We shall discuss the complexities involved in such kind of processing and the specific techniques employed in Kappa+ to tackle them.

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...

Flink Forward

SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...

Lucidworks

The Fn project is a container-native Apache 2.0 licensed serverless platform that you can run anywhere – on any cloud or on-premise. It’s easy to use, supports every programming language, and is extensible and performant. This YourStory-Oracle Developer Meetup covers various design aspects of Serverless for polyglot programming, implementation of Saga pattern, etc. It also emphasizes on the monitoring aspect of Fn project using Prometheus and Grafana

Serverless design with Fn project

Siva Rama Krishna Chunduru

When Streaming Needs Batch With Konstantin Knauf | Current 2022 A streaming application is started once and then continuously ingests endless, fairly steady streams of events. That's as far as the theory goes. Unfortunately, reality is more complicated. Over time your application's ability to process large historical data sets robustly, efficiently and correctly will be critical: - for exploratory data analysis during development - for bootstrapping the initial state of an application - for back-filling following an outage or bugfix - for keeping up with bursty input streams These scenarios call for batch processing techniques. Apache Flink is as streaming-first as it gets. Yet over the last releases, the community has invested significant resources into unifying stream- and batch processing on all layers of the stack: scheduler to APIs. In this talk, I'll introduce Apache Flink's approach to unified stream and batch processing and discuss - by example - how these scenarios can already be addressed today and what might be possible in the future.

When Streaming Needs Batch With Konstantin Knauf | Current 2022

HostedbyConfluent

Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms. Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data. Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Apache Apex

Similar to Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16) (20)

Apache Apex Fault Tolerance and Processing Semantics

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Zurich Flink Meetup

Apache flink 1.7 and Beyond

Functional? Reactive? Why?

Stream Processing @ Lyft

Introduction to Stateful Stream Processing with Apache Flink.

[262] netflix 빅데이터 플랫폼

The Hows and Whys of a Distributed SQL Database - Strange Loop 2017

Apache Kafka Streams

Version control 101

Introducing the Apache Flink Kubernetes Operator

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

2017 big data landscape and cutting edge innovations public

Kafka streams fifth elephant 2018

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...

SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...

Serverless design with Fn project

When Streaming Needs Batch With Konstantin Knauf | Current 2022

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

More from Till Rohrmann

Container technology experiences an ever increasing adoption throughout many industries. Not only does this technology make your applications portable across different machines and operating systems, it also allows to scale applications in a matter of seconds. Moreover, it significantly simplifies and speeds up deployments which decreases development and operation costs. Consequently, more and more Flink deployments run in containerized environments which poses new challenges for Flink. In this talk, we will take a look at Flink's current and future container support which will make it a first class citizen of the container world. First of all, we will explain how the new reactive execution mode will solve the problem of seamless application scaling and how it blends in with any environment. Complementary to the reactive mode, the active execution mode demonstrates its strengths when it comes to changing workloads such as batch jobs. Last but not least, we will take a look beyond Flink's own nose and investigate how Flink can be used together with Kubernetes operators or data Artisans' Application Manager. We will conclude the talk with a short demo of Flink's native Kubernetes support and giving an outlook on future developments in the container realm.

Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...

Till Rohrmann

One of the big operational challenges when running streaming applications is to cope with varying workloads. Variations, e.g. daily cycles, seasonal spikes or sudden events, require that allocated resources are constantly adapted. Otherwise, service quality deteriorates or money is wasted. Apache Flink 1.5 includes a lot of enhancements to support full resource elasticity on cluster management frameworks such as Apache Mesos. With the latest version, it is now possible to build elastic applications which can be programmatically scaled up or down in order to react to changing workloads. In this talk, we will discuss recent improvements to Flink's deployment model which also enables full resource elasticity. In particular, we will discuss how Flink leverages cluster management frameworks, e.g. Mesos, and already-introduced features like scalable state to support elastic streaming applications. We will conclude the presentation with a short demo showing how a stateful Flink application can be rescaled on top of Mesos.

Elastic Streams at Scale @ Flink Forward 2018 Berlin

Till Rohrmann

Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.

Scaling stream data pipelines with Pravega and Apache Flink

Till Rohrmann

In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time. Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications. In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

Till Rohrmann

Apache Mesos allows operators to run distributed applications across an entire datacenter and is attracting ever increasing interest. As much as distributed applications see increased use enabled by Mesos, Mesos also sees increasing use due to a growing ecosystem of well integrated applications. One of the latest additions to the Mesos family is Apache Flink. Flink is one of the most popular open source systems for real-time high scale data processing and allows users to deal with low-latency streaming analytical workloads on Mesos. In this talk we explain the challenges solved while integrating Flink with Mesos, including how Flink’s distributed architecture can be modeled as a Mesos framework, and how Flink was integrated with Fenzo. Next, we describe how Flink was packaged to easily run on DC/OS.

Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin

Till Rohrmann

Apache Flink® Meets Apache Mesos® and DC/OS

Till Rohrmann

With Flink 1.3 being released, the Flink community is already working towards the upcoming release 1.4. Given Flink's high development pace, which manifested in Flink 1.3 being one of the feature-wise biggest releases in its recent history, it becomes more and more difficult to keep track of all development threads. Moreover, it requires more effort to learn about newly added features and which value they provide for your application. In this talk, I want to present and explain some of Flink's latest features, including incremental checkpointing, fine grained recovery, side outputs and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

From Apache Flink® 1.3 to 1.4

Till Rohrmann

Apache Mesos allows operators to run distributed applications across an entire datacenter and is attracting ever increasing interest. As much as distributed applications see increased use enabled by Mesos, Mesos also sees increasing use due to a growing ecosystem of well-integrated applications. One of the latest additions to the Mesos family is Apache Flink. Flink is one of the most popular open source systems for real-time high scale data processing and allows users to deal with low-latency streaming analytical workloads on Mesos. In this talk, we explain the challenges solved while integrating Flink with Mesos, including how Flink’s distributed architecture can be modeled as a Mesos framework, and how Flink was integrated with Fenzo. Next, we describe how Flink was packaged to easily run on DC/OS.

Apache Flink and More @ MesosCon Asia 2017

Till Rohrmann

As stream processing engines become more and more popular and are used in different environments, the demand to support different deployment scenarios increases. Depending on the user's infrastructure, a stream processor might be run on a bare metal cluster in standalone mode, deployed via Apache Yarn and Mesos, or run in a containerized environment. In order to fulfill the requirements of different deployment options and to provide enough flexibility for the future, the Flink community has recently started to redesign Flink's distributed architecture. This talk will explain the limitations of the old architecture and how they are solved with the new design. We will present the new building blocks of a Flink cluster and demonstrate, using the example of Flink's Mesos and Docker support, how they can be combined to run Flink nearly everywhere.

Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017

Till Rohrmann

Streaming Analytics & CEP - Two sides of the same coin?

Till Rohrmann

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Till Rohrmann

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Till Rohrmann

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

Introduction to Apache Flink - Fast and reliable big data processing

Till Rohrmann

More from Till Rohrmann (13)

Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...

Elastic Streams at Scale @ Flink Forward 2018 Berlin

Scaling stream data pipelines with Pravega and Apache Flink

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin

Apache Flink® Meets Apache Mesos® and DC/OS

From Apache Flink® 1.3 to 1.4

Apache Flink and More @ MesosCon Asia 2017

Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017

Streaming Analytics & CEP - Two sides of the same coin?

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Introduction to Apache Flink - Fast and reliable big data processing

Recently uploaded

Real Time Object Detection Using Open CV

Khem

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Recently uploaded (20)

Real Time Object Detection Using Open CV

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

How to Troubleshoot Apps for the Modern Connected Worker

HTML Injection Attacks: Impact and Mitigation Strategies

How to Troubleshoot Apps for the Modern Connected Worker

Apidays New York 2024 - The value of a flexible API Management solution for O...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Artificial Intelligence: Facts and Myths

Handwritten Text Recognition for manuscripts and early printed texts

What Are The Drone Anti-jamming Systems Technology?

Finology Group – Insurtech Innovation Award 2024

presentation ICT roal in 21st century education

Scaling API-first – The story of a global engineering organization

Data Cloud, More than a CDP by Matt Robison

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Partners Life - Insurer Innovation Award 2024

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

1. Till Rohrmann trohrmann@apache.org @stsffap Dynamic Scaling: How Apache Flink® Adapts to Changing Workloads

2. Changing Workloads And SLAs 2

3. Resource Adaption 3 + time Workload Resources time Workload Resources

4. What Is This Talk About? § Flink’s approach to dynamic scaling § Current state with demo § Outlook on next development steps 4

5. Dynamic Scaling 5

6. Basic Idea 6 • Spread work across more workers to decrease workload

7. Scaling Stateless Jobs 7 Scale Up Scale Down Source Mapper Sink • Scale up: Deploy new tasks • Scale down: Cancel running tasks

8. Scaling Stateful Jobs 8 ? • Problem: Which state to assign to new task?

9. State in Apache Flink 9

10. Keyed vs. Non-keyed State 10 • State bound to a key • E.g. Keyed UDF and window state • State bound to a subtask • E.g. Source state Keyed Non-keyed

11. Repartitioning Keyed State § Similar to consistent hashing § Split key space into key groups § Assign key groups to tasks 11 Key space Key group #1 Key group #2 Key group #3Key group #4

12. Repartitioning Keyed State contd. § Rescaling changes key group assignment § Maximum parallelism defined by #key groups 12

13. Repartitioning Non-keyed state § User defined merge and split functions • Most general approach § Breaking non-keyed state up into finer granularity • State has to contain multiple entries • Automatic repartitioning wrt granularity 13 #1 #2 #3

14. Repartitioning Non-keyed State contd. § Non-keyed state entries gathered at the job manager § Repartitioning schemes • Repartition & send • Union & broadcast 14

15. Example: Kafka Source 15 partitionId: 1, offset: 42 partitionId: 3, offset: 10 partitionId: 6, offset: 27 • Store offset for each partition • Individual entries are repartitionable partitionId: 1, offset: 42 partitionId: 3, offset: 10 partitionId: 6, offset: 27

16. Rescaling: Why is That so Hard? § Handling of state § Repartitioning of keyed & non-keyed state § Unique among open source stream processors, afaik 16

17. Demo Time 17

18. Demo Topology 18 Kafka Source Counter KeyBy

19. Current State and next Steps 19

20. Current State § Manual rescaling 1. Take savepoint 2. Stop the job 3. Restart job with adjusted parallelism and savepoint 20

21. Next Steps § Integrate savepoint with stop signal § Rescaling individual operators w/o restart § Dynamic container de-/allocation • “Running Flink Everywhere” by Stephan Ewen, 16:45 at Kesselhaus 21

22. Auto Scaling Policies 22 • Latency • Throughput • Resource utilization • Kubernetes on GCE, EC2 and Mesos (marathon- autoscale) already support auto-scaling

23. Conclusion § Scaling of keyed and non-keyed state § Flink supports manual rescaling with restart (WIP branch: https://github.com/tillrohrmann/flink/tree/partitionable-op-state) § Future versions might support scaling on the fly and automatic rescaling policies 23

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)

Similar to Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16) (20)

More from Till Rohrmann

More from Till Rohrmann (13)

Recently uploaded

Recently uploaded (20)

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForward 2016 09/12/16)