Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Build a Table-centric Apache
Flink Ecosystem
Shaoxuan Wang
Director, Senior Staff Engineer at Alibaba
2019 Flink Forward San Francisco

Broadcom
High-Perf Platform
Senior Software Engineer
Facebook
Core Infra
Software Engineer
Alibaba Group
Data Infra / Cloud
Senior Staff Engineer
Peking University
EECS
University of California
at San Diego
PhD, Computer Engineer
Flink Committer
Since 2017
Shaoxuan Wang
Director, Senior Staff Engineer at
Alibaba
shaoxuan.wsx@alibaba-inc.com
shaoxuan@apache.org

Apache SAMOA
Yarn
Apache Flink is the most sophisticated open-source Stream Processor
Messaging
Queue
Messaging
Queue
Apache Flink Ecosystem - Present

Business
Intelligence Artificial
Intelligence
Big Data
Infrastructure
Can Apache Flink become an unified engine for intelligent big data computing?
Intelligent Big Data Computing
Batch & stream
processing
AI Processing

Streaming Compute
samza Apex
Batch Compute
Apache Flink
Unified Compute
Engine
AI Compute
Build Intelligent Big Data Platform with Apache Flink

Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
AI
Processing
Enhance User
Experience Container-centric
Management
Apache Flink Ecosystem - Future
Batch
Processing

Runtime
DAG API & Operators
Query Processor
Query Optimizer & Query Executor
SQL & Table API
Relational
Local
Single JVM
Cloud
GCE, EC2
Cluster
Standalone, YARN
Same Results
Batch mode
Same SQL Query
Stream Mode
TableAPI: declaritive API with nature Optimization
similar as SQL, batch&stream unified, declarative API with
nature optimization framework

TableAPI is more than SQL
TableAPI SQL e.g.
Stream and batch unified Y Y
SELECT/AGG/WIN
DOW etc.
Functional scalability Y N
flatAgg/Column
operations etc.
Rich expression Y N
map/flatMap/intersec
t etc.
Compile check Y N Java/Scala IDE
SQL
Table API
map
flatMap
Iteration
flatAggregate
…
Aggregate

Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
Interactive
Programming
Multi-Language
(e.x. Python)
Iteration
Integration of Mllib and
Deep Learning Engine
Functionality and
productivity of API
What Else are Needed on TableAPI
Data Storage

Multi-languageInteractive
programming
Functionality
and Productivity
Enhancement
Iteration Execute MLlib
and DL engine
What Else are Needed on TableAPI

TableAPI Enhancement
Row-based data
processing API
HintColumn
operation
Functionality,
productivity
Productivity
optimization instruction,
Resource configuration

Introduce Row-based TableAPI (FLIP29)

Map Operator in Table API
Method signature
def map(scalaFunction: Expression): Table
Pros
table.select(udf(‘c1), udf(’c2), udf(’c3))
VS
table.map(udf(’c1,’c2,’c3))
INPUT(Row) OUTPUT(Row)
1 1
class MyMap extends ScalarFunction {
var param : String = ""
override def open(context: FunctionContext): Unit
= param = context.getJobParameter("paramKey","")
def eval(index: Int): Row = {
val result = new Row(3)
// Business processing based on data and parameters
result
}
override def getResultType(signature: Array[Class[_]]): TypeInformation[_]
= {
Types.ROW(Types.STRING, Types.INT, Types.LONG)
}
}
Example
val res = tab
.map(fun(‘e)).as(‘a, ‘b, ‘c)
.select(‘a, ‘c)

FatMap Operator in Table API
Method signature
def flatMap(tableFunction: Expression): Table
Pros
table.join(udtf) VS table.flatMap(udtf())
1 N(N>=0)
case class User(name: String, age: Int)
class MyFlatMap extends TableFunction[User] {
def eval(user: String): Unit = {
if (user.contains("#")) {
val splits = user.split("#")
collect(User(splits(0),splits(1).toInt))
}
}
}
Example
val res = tab
.flatMap(fun(‘e,’f)).as(‘name, ‘age)
.select(‘name, ‘age)

Aggregate Operator in Table API
Method signature
def aggregate(aggregateFunction: Expression): AggregatedTable
class AggregatedTable(table: Table, groupKeys: Seq[Expression], aggFunction: Expression)
Pros
table.select(agg1(), agg2(), agg3()….)
VS
table.aggregate(agg())
N(N>=0) 1
class CountAccumulator extends JTuple1[Long] {
f0 = 0L //count
}
class CountAgg extends AggregateFunction[JLong, CountAccumulator] {
def accumulate(acc: CountAccumulator): Unit = {
acc.f0 += 1L
}
override def getValue(acc: CountAccumulator): JLong = {
acc.f0
}
... retract()/merge()
}
Example
val res = groupedTab
.groupBy(‘a)
.aggregate(
aggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)

FlatAggregate Operator in Table API
Method signature
def flatAggregate(tableAggregateFunction: Expression): GroupedFlatAggregateTable
class GroupedFlatAggTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression)
Pros
A completely new feature
N(N>=0) M(M>=0)
class TopNAcc {
var data: MapView[JInt, JLong] = _ // (rank -> value)
...
}
class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] {
def accumulate(acc: TopNAcc, category: Int, value: Long) {
...
}
def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = {
...
}
...retract/merge
}
Example
val res = groupedTab
.groupBy(‘a)
.faltAggregate(
flatAggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)

Examples of Aggregate and FlatAggregate
ID NAME PRICE
1 Latte 6
2 Milk 3
3 Breve 5
4 Mocha 8
5 Tea 4
Accumulator
(MAX-ACC)
createAccumulator()
accumulate(acc, PRICE)
tab.flatAggregate(topTableAgg(PRICE))
Examples of Max (aggregate) and Top2 (flatAggregate)
tab.aggregate(maxAgg(PRICE))
getValue(ACC acc)
PRICE
8
8 8 6 6 6
Accumulator
(Top2-ACC)
ID NAME PRICE
4 Mocha 8
1 Latte 6
66,36,58,68,6
emitValue(acc, out)
1
2
3 3

Summary
Single Row Input Multiple Row Input
Single Row Output
ScalarFunction
(select/map)
AggregateFunction
(select/aggregate)
Multiple Row Output
TableFunction
(cross join/flatmap)
TableAggregateFunction
(flatAggregate)

A example code snippet
{
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}

Interactive Programming without Cache or External
Storage
Execution EnvironmentUser Program
job1
Execution Environment
job2
job3
orders
Source
Cons:
Jobs are independent
Redundant computation
Job1 is
recomputed
Job1 is
recomputed

Interactive Programming with External Storage
Execution EnvironmentUser Program
job1
External Storage
job2
job3
smallOrders
orders
Source
Pros:
No redundant computation
Cons:
External storage required
Explicit source/sink creation
Instead of recomputing “orders”,
directly load “smallOrders”

Interactive Programming with Cached Result
(FLIP36)
SourceTable Environment
User Program Flink Cluster
job1
Cached Tables
smallOrders
job2
job3
orders
Pros:
Jobs in the same context
No external storage
Better performance
Cache Table
Service

{
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
smallOrders.cache()
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}
Introducing Table.cache() for Interactive Programming

Flink Python TableAPI
Python
process
Java process
input
Python TableAPI Python UDF
Python
TableAPI
Java gateway
Server
RPC (Py4j)
Python
gateway
Python VM
DAGGragh
upstream input
downstream output
output

Multi-Language for Flink TableAPI
Python
TableAPI
Java gateway
Server
Python
gateway
Go
TableAPI
Go
gateway
Flink Cluster
R
TableAPI
R
gateway
different VM JVM distributed
cluster
DAGGragh

PipelineStage
EstimatorTransformerModel
K-Means
NaiveBayes
Linearregression
DecisionTree
RandomForest
GBDT
Table based ML Pipeline
EstimatorTransformer
table2=Transformer.tr
ansform(table1) Estimator.fit(table2)
ML Lib Developers ML Lib Users
ML Pipeline - Overview (Target for Flink 1.9)
…… Input
Table
Output
Table

Deep Learning Pipeline on Flink
Data Ingestion
Data Process and
Transformation
Model
Training
Testing and
Validation
Deployment
Change model or
parameters

When Table meets AI: Build
Flink AI Ecosystem on
Table API
Shaoxuan Wang,
Director, Senior Staff Engineer at
Alibaba
4:30pm - 5:10pm
Nikko II & III
Iteration Execute MLlib
and DL engine
Find More Details in Another Session

Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Ähnlich wie Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang (20)

Mehr von Flink Forward

Mehr von Flink Forward (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Hinweis der Redaktion