Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like.
Time in data stream must be quasi monotonous to produce time progress (watermarks)
Always have close-to-latest incremental results
Resource requirements change over time
Recovery must catch up very fast
Order of time in data does not matter (parallel unordered reads)
Bulk operations (2 phase hash/sort)
Longer time for recovery (no low latency SLA)
Resource requirements change fast throughout the execution of a single job
Understanding this difference will help later, when we discuss scheduling changes.
Different requirements
Optimization potential for batch and streaming
Also: historic developments and slow-changing organizations
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Recall the earlier processing-styles slide:
batch wants step by step
streaming is all at once
This has been mentioned a lot.
Lyft has given a talk about this at last FF
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Batch:
random reads
Coordinated by JM
Streaming:
sequential read
No coordination between sources
This must support both batch and streaming use cases, allow Flink to be clever, be able to deal with event-time, watermarks, source idiosyncrasies, and enable snapshotting
This should enable new features: generic idleness detection, event-time alignment*
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this
Talk about how this will enable event-time alignment for sources in generic way
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
You have to decide between DataSet and DataStream when writing a job
Two (slightly) different APIs, with different capabilities
Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
Different performance characteristics
Different fault-tolerance behavior
Different scheduling logic
With Table API, you only have to learn one API
Still, the set of supported connectors depends on the underlying execution API
Feature set depends on whether there is an implementation for your underlying API
You cannot combine more batch-y with more stream-y sources/sinks
A “soft problem”: with two stacks of everything, less developer power will go into each one individual stack less features, worse performance, more bugs that are fixed slower
Mention here that you can basically build your Job Jar that includes flink-runtime, and execute that any way you want: Put it in docker, Spring boot, just start multiple of these.
As-a-library mode
Note that this nicely jibes with the pull-based model. Enables the things we need for batch.
Mention the dog with the hose. Sources just keep spitting out records as fast as they can.
Possibly put these on separate slides, with fewer words. Or even some graphics.
Possibly put these on separate slides, with fewer words. Or even some graphics.
There are some quirks when you use DataStream for batch
a groupReduce would be window with a GlobalWindow
MapPartition would have to finalizing things in close()
Joins would have to specify global window
Of course, state requirements are bad for the naïve approach, i.e. large state, inefficient access patterns
Joins and grouping can be a lot faster with specific algorithms
Hash Join, Merge join, etc…
For example
different window operator
Different join implementations
The scheduling stuff and networking would be a whole talk on their own. Memory management is another issue.
Pull-based operator is how most databases were/are implemented.
Note how the pull model enables hash join, merge join, …
Side inputs benefit from a pull-based model
Bring the dog-drinking-from-hose example, also for Join operator
This will allow porting batch operators/algorithms to StreamOperator