Flink Table API was initially created to address the relational query use case. It has been a good addition to DataStream and DataSet API for users to write declarative queries. Moreover, Table API provides a unified API for batch and stream processing. We have been exploring extending the capability of Flink Table API to go beyond the classical relational query. With these work, we are establishing an ecosystem on top of the Table API. This talk will introduce the following enhancements we have made on Table API to expand its horizon. Most of the work has been or will be contributed back to Apache Flink. We will also share our experience of building an ecosystem around Flink Table API, and our vision for Table API in the future.
Non-relational processing API
Relational query is natively supported by Table API. It is also very powerful to express complicated computation logic. However, non-relational API become handy to perform a general purpose computation. We have introduced a set of non-relational methods, such as map() and flatMap(), to Table API in a systematic manner to improve the user experience in general.
Interactive programming
Ad-hoc queries is a pretty common use case for processing engines, especially for batch processing. In order to meet the requirements for such use cases, we introduced interactive programming to Table API, which allows users to cache the intermediate result. We envision the underlying service, which caches the intermediate Flink Table, will grow significantly to provide more sophisticated capabilities.
Iterative processing
Compared with DataSet and DataStream, one thing missing from Table is native iteration support. Instead of naively copying the native iteration API from DataSet / DataStream, we designed a new API to address the caveats that we have seen in the existing iteration support in DataStream and DataSet.
ML on Table API
One important part of the Flink ecosystem is ML. We have proposed to build a ML on top of Table API, so that the algorithm engineers can also benefit from the optimizations provided by Flink, in both batch and stream jobs.
What Are The Drone Anti-jamming Systems Technology?
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang
1. Build a Table-centric Apache
Flink Ecosystem
Shaoxuan Wang
Director, Senior Staff Engineer at Alibaba
2019 Flink Forward San Francisco
2. Broadcom
High-Perf Platform
Senior Software Engineer
Facebook
Core Infra
Software Engineer
Alibaba Group
Data Infra / Cloud
Senior Staff Engineer
Peking University
EECS
University of California
at San Diego
PhD, Computer Engineer
Flink Committer
Since 2017
Shaoxuan Wang
Director, Senior Staff Engineer at
Alibaba
shaoxuan.wsx@alibaba-inc.com
shaoxuan@apache.org
3. Apache SAMOA
Yarn
Apache Flink is the most sophisticated open-source Stream Processor
Messaging
Queue
Messaging
Queue
Apache Flink Ecosystem - Present
5. Streaming Compute
samza Apex
Batch Compute
Apache Flink
Unified Compute
Engine
AI Compute
Build Intelligent Big Data Platform with Apache Flink
6. Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
AI
Processing
Enhance User
Experience Container-centric
Management
Apache Flink Ecosystem - Future
Batch
Processing
7. Runtime
DAG API & Operators
Query Processor
Query Optimizer & Query Executor
SQL & Table API
Relational
Local
Single JVM
Cloud
GCE, EC2
Cluster
Standalone, YARN
Same Results
Batch mode
Same SQL Query
Stream Mode
TableAPI: declaritive API with nature Optimization
similar as SQL, batch&stream unified, declarative API with
nature optimization framework
8. TableAPI is more than SQL
TableAPI SQL e.g.
Stream and batch unified Y Y
SELECT/AGG/WIN
DOW etc.
Functional scalability Y N
flatAgg/Column
operations etc.
Rich expression Y N
map/flatMap/intersec
t etc.
Compile check Y N Java/Scala IDE
SQL
Table API
map
flatMap
Iteration
flatAggregate
…
Aggregate
9. Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
Interactive
Programming
Multi-Language
(e.x. Python)
Iteration
Integration of Mllib and
Deep Learning Engine
Functionality and
productivity of API
What Else are Needed on TableAPI
Data Storage
14. Map Operator in Table API
Method signature
def map(scalaFunction: Expression): Table
Pros
table.select(udf(‘c1), udf(’c2), udf(’c3))
VS
table.map(udf(’c1,’c2,’c3))
INPUT(Row) OUTPUT(Row)
1 1
class MyMap extends ScalarFunction {
var param : String = ""
override def open(context: FunctionContext): Unit
= param = context.getJobParameter("paramKey","")
def eval(index: Int): Row = {
val result = new Row(3)
// Business processing based on data and parameters
result
}
override def getResultType(signature: Array[Class[_]]): TypeInformation[_]
= {
Types.ROW(Types.STRING, Types.INT, Types.LONG)
}
}
Example
val res = tab
.map(fun(‘e)).as(‘a, ‘b, ‘c)
.select(‘a, ‘c)
15. FatMap Operator in Table API
Method signature
def flatMap(tableFunction: Expression): Table
Pros
table.join(udtf) VS table.flatMap(udtf())
INPUT(Row) OUTPUT(Row)
1 N(N>=0)
case class User(name: String, age: Int)
class MyFlatMap extends TableFunction[User] {
def eval(user: String): Unit = {
if (user.contains("#")) {
val splits = user.split("#")
collect(User(splits(0),splits(1).toInt))
}
}
}
Example
val res = tab
.flatMap(fun(‘e,’f)).as(‘name, ‘age)
.select(‘name, ‘age)
16. Aggregate Operator in Table API
Method signature
def aggregate(aggregateFunction: Expression): AggregatedTable
class AggregatedTable(table: Table, groupKeys: Seq[Expression], aggFunction: Expression)
Pros
table.select(agg1(), agg2(), agg3()….)
VS
table.aggregate(agg())
INPUT(Row) OUTPUT(Row)
N(N>=0) 1
class CountAccumulator extends JTuple1[Long] {
f0 = 0L //count
}
class CountAgg extends AggregateFunction[JLong, CountAccumulator] {
def accumulate(acc: CountAccumulator): Unit = {
acc.f0 += 1L
}
override def getValue(acc: CountAccumulator): JLong = {
acc.f0
}
... retract()/merge()
}
Example
val res = groupedTab
.groupBy(‘a)
.aggregate(
aggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)
17. FlatAggregate Operator in Table API
Method signature
def flatAggregate(tableAggregateFunction: Expression): GroupedFlatAggregateTable
class GroupedFlatAggTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression)
Pros
A completely new feature
INPUT(Row) OUTPUT(Row)
N(N>=0) M(M>=0)
class TopNAcc {
var data: MapView[JInt, JLong] = _ // (rank -> value)
...
}
class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] {
def accumulate(acc: TopNAcc, category: Int, value: Long) {
...
}
def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = {
...
}
...retract/merge
}
Example
val res = groupedTab
.groupBy(‘a)
.faltAggregate(
flatAggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)
18. Examples of Aggregate and FlatAggregate
ID NAME PRICE
1 Latte 6
2 Milk 3
3 Breve 5
4 Mocha 8
5 Tea 4
Accumulator
(MAX-ACC)
createAccumulator()
accumulate(acc, PRICE)
tab.flatAggregate(topTableAgg(PRICE))
Examples of Max (aggregate) and Top2 (flatAggregate)
tab.aggregate(maxAgg(PRICE))
getValue(ACC acc)
PRICE
8
8 8 6 6 6
Accumulator
(Top2-ACC)
ID NAME PRICE
4 Mocha 8
1 Latte 6
66,36,58,68,6
emitValue(acc, out)
1
2
3 3
21. A example code snippet
{
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}
22. Interactive Programming without Cache or External
Storage
Execution EnvironmentUser Program
job1
Execution Environment
job2
Execution Environment
job3
orders
Source
Cons:
Jobs are independent
Redundant computation
Job1 is
recomputed
Job1 is
recomputed
23. Interactive Programming with External Storage
Execution EnvironmentUser Program
job1
External Storage
Execution Environment
job2
Execution Environment
job3
smallOrders
orders
Source
Pros:
No redundant computation
Cons:
External storage required
Explicit source/sink creation
Instead of recomputing “orders”,
directly load “smallOrders”
24. Interactive Programming with Cached Result
(FLIP36)
SourceTable Environment
User Program Flink Cluster
job1
Cached Tables
smallOrders
job2
job3
orders
Pros:
Jobs in the same context
No external storage
Better performance
Cache Table
Service
25. {
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
smallOrders.cache()
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}
Introducing Table.cache() for Interactive Programming
27. Flink Python TableAPI
Python
process
Java process
input
Python TableAPI Python UDF
Python
TableAPI
Java gateway
Server
RPC (Py4j)
Python
gateway
Python VM
DAGGragh
upstream input
downstream output
output
28. Multi-Language for Flink TableAPI
Python
TableAPI
Java gateway
Server
Python
gateway
Go
TableAPI
Go
gateway
Flink Cluster
R
TableAPI
R
gateway
different VM JVM distributed
cluster
DAGGragh
31. Deep Learning Pipeline on Flink
Data Ingestion
Data Process and
Transformation
Model
Training
Testing and
Validation
Deployment
Change model or
parameters
32. When Table meets AI: Build
Flink AI Ecosystem on
Table API
Shaoxuan Wang,
Director, Senior Staff Engineer at
Alibaba
4:30pm - 5:10pm
Nikko II & III
Iteration Execute MLlib
and DL engine
Find More Details in Another Session
Our answer is Yes. The foundation of AI computing is the big data computing engine.
As long as Apache flink has both high-performance batch computing and stream computing capabilities, and provides a good interface for algorithm users to implement the algorithm library, or integrate a variety of machine learning training engines on flink, it will be very promising to use the flink to build an Intelligent Big Data Platform.
我们的答案是Yes
AI computing的基础是大数据计算引擎。
只要让Apache flink同时具有高性能的批计算和流计算的能力,再提供很好的接口让算法用户实现算法library,或者在flink上集成各式各样的机器学习训练引擎,就能很好的实现一套完备的Intelligent Big Data Platform。