SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Build a Table-centric Apache
Flink Ecosystem
Shaoxuan Wang
Director, Senior Staff Engineer at Alibaba
2019 Flink Forward San Francisco
Broadcom
High-Perf Platform
Senior Software Engineer
Facebook
Core Infra
Software Engineer
Alibaba Group
Data Infra / Cloud
Senior Staff Engineer
Peking University
EECS
University of California
at San Diego
PhD, Computer Engineer
Flink Committer
Since 2017
Shaoxuan Wang
Director, Senior Staff Engineer at
Alibaba
shaoxuan.wsx@alibaba-inc.com
shaoxuan@apache.org
Apache SAMOA
Yarn
Apache Flink is the most sophisticated open-source Stream Processor
Messaging
Queue
Messaging
Queue
Apache Flink Ecosystem - Present
Business
Intelligence Artificial
Intelligence
Big Data
Infrastructure
Can Apache Flink become an unified engine for intelligent big data computing?
Intelligent Big Data Computing
Batch & stream
processing
AI Processing
Streaming Compute
samza Apex
Batch Compute
Apache Flink
Unified Compute
Engine
AI Compute
Build Intelligent Big Data Platform with Apache Flink
Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
AI
Processing
Enhance User
Experience Container-centric
Management
Apache Flink Ecosystem - Future
Batch
Processing
Runtime
DAG API & Operators
Query Processor
Query Optimizer & Query Executor
SQL & Table API
Relational
Local
Single JVM
Cloud
GCE, EC2
Cluster
Standalone, YARN
Same Results
Batch mode
Same SQL Query
Stream Mode
TableAPI: declaritive API with nature Optimization
similar as SQL, batch&stream unified, declarative API with
nature optimization framework
TableAPI is more than SQL
TableAPI SQL e.g.
Stream and batch unified Y Y
SELECT/AGG/WIN
DOW etc.
Functional scalability Y N
flatAgg/Column
operations etc.
Rich expression Y N
map/flatMap/intersec
t etc.
Compile check Y N Java/Scala IDE
SQL
Table API
map
flatMap
Iteration
flatAggregate
…
Aggregate
Apache SAMOA
Yarn
Flink DL
Flink ML
Messaging
Queue
Messaging
Queue
Interactive
Programming
Multi-Language
(e.x. Python)
Iteration
Integration of Mllib and
Deep Learning Engine
Functionality and
productivity of API
What Else are Needed on TableAPI
Data Storage
Multi-languageInteractive
programming
Functionality
and Productivity
Enhancement
Iteration Execute MLlib
and DL engine
What Else are Needed on TableAPI
Agenda
TableAPI Enhancement
Row-based data
processing API
HintColumn
operation
Functionality,
productivity
Productivity
optimization instruction,
Resource configuration
Introduce Row-based TableAPI (FLIP29)
Map Operator in Table API
Method signature
def map(scalaFunction: Expression): Table
Pros
table.select(udf(‘c1), udf(’c2), udf(’c3))
VS
table.map(udf(’c1,’c2,’c3))
INPUT(Row) OUTPUT(Row)
1 1
class MyMap extends ScalarFunction {
var param : String = ""
override def open(context: FunctionContext): Unit
= param = context.getJobParameter("paramKey","")
def eval(index: Int): Row = {
val result = new Row(3)
// Business processing based on data and parameters
result
}
override def getResultType(signature: Array[Class[_]]): TypeInformation[_]
= {
Types.ROW(Types.STRING, Types.INT, Types.LONG)
}
}
Example
val res = tab
.map(fun(‘e)).as(‘a, ‘b, ‘c)
.select(‘a, ‘c)
FatMap Operator in Table API
Method signature
def flatMap(tableFunction: Expression): Table
Pros
table.join(udtf) VS table.flatMap(udtf())
INPUT(Row) OUTPUT(Row)
1 N(N>=0)
case class User(name: String, age: Int)
class MyFlatMap extends TableFunction[User] {
def eval(user: String): Unit = {
if (user.contains("#")) {
val splits = user.split("#")
collect(User(splits(0),splits(1).toInt))
}
}
}
Example
val res = tab
.flatMap(fun(‘e,’f)).as(‘name, ‘age)
.select(‘name, ‘age)
Aggregate Operator in Table API
Method signature
def aggregate(aggregateFunction: Expression): AggregatedTable
class AggregatedTable(table: Table, groupKeys: Seq[Expression], aggFunction: Expression)
Pros
table.select(agg1(), agg2(), agg3()….)
VS
table.aggregate(agg())
INPUT(Row) OUTPUT(Row)
N(N>=0) 1
class CountAccumulator extends JTuple1[Long] {
f0 = 0L //count
}
class CountAgg extends AggregateFunction[JLong, CountAccumulator] {
def accumulate(acc: CountAccumulator): Unit = {
acc.f0 += 1L
}
override def getValue(acc: CountAccumulator): JLong = {
acc.f0
}
... retract()/merge()
}
Example
val res = groupedTab
.groupBy(‘a)
.aggregate(
aggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)
FlatAggregate Operator in Table API
Method signature
def flatAggregate(tableAggregateFunction: Expression): GroupedFlatAggregateTable
class GroupedFlatAggTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression)
Pros
A completely new feature
INPUT(Row) OUTPUT(Row)
N(N>=0) M(M>=0)
class TopNAcc {
var data: MapView[JInt, JLong] = _ // (rank -> value)
...
}
class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] {
def accumulate(acc: TopNAcc, category: Int, value: Long) {
...
}
def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = {
...
}
...retract/merge
}
Example
val res = groupedTab
.groupBy(‘a)
.faltAggregate(
flatAggFun(‘e,’f) as (‘a, ‘b, ‘c))
.select(‘a, ‘c)
Examples of Aggregate and FlatAggregate
ID NAME PRICE
1 Latte 6
2 Milk 3
3 Breve 5
4 Mocha 8
5 Tea 4
Accumulator
(MAX-ACC)
createAccumulator()
accumulate(acc, PRICE)
tab.flatAggregate(topTableAgg(PRICE))
Examples of Max (aggregate) and Top2 (flatAggregate)
tab.aggregate(maxAgg(PRICE))
getValue(ACC acc)
PRICE
8
8 8 6 6 6
Accumulator
(Top2-ACC)
ID NAME PRICE
4 Mocha 8
1 Latte 6
66,36,58,68,6
emitValue(acc, out)
1
2
3 3
Summary
Single Row Input Multiple Row Input
Single Row Output
ScalarFunction
(select/map)
AggregateFunction
(select/aggregate)
Multiple Row Output
TableFunction
(cross join/flatmap)
TableAggregateFunction
(flatAggregate)
Agenda
A example code snippet
{
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}
Interactive Programming without Cache or External
Storage
Execution EnvironmentUser Program
job1
Execution Environment
job2
Execution Environment
job3
orders
Source
Cons:
Jobs are independent
Redundant computation
Job1 is
recomputed
Job1 is
recomputed
Interactive Programming with External Storage
Execution EnvironmentUser Program
job1
External Storage
Execution Environment
job2
Execution Environment
job3
smallOrders
orders
Source
Pros:
No redundant computation
Cons:
External storage required
Explicit source/sink creation
Instead of recomputing “orders”,
directly load “smallOrders”
Interactive Programming with Cached Result
(FLIP36)
SourceTable Environment
User Program Flink Cluster
job1
Cached Tables
smallOrders
job2
job3
orders
Pros:
Jobs in the same context
No external storage
Better performance
Cache Table
Service
{
val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity)
val smallOrders = orders.filter('quantity < 100)
smallOrders.cache()
val countriesOfSmallOrders = smallOrders.select('country).distinct()
countriesOfSmallOrders.print()
val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg)
smallOrdersByColor.print()
}
Introducing Table.cache() for Interactive Programming
Agenda
Flink Python TableAPI
Python
process
Java process
input
Python TableAPI Python UDF
Python
TableAPI
Java gateway
Server
RPC (Py4j)
Python
gateway
Python VM
DAGGragh
upstream input
downstream output
output
Multi-Language for Flink TableAPI
Python
TableAPI
Java gateway
Server
Python
gateway
Go
TableAPI
Go
gateway
Flink Cluster
R
TableAPI
R
gateway
different VM JVM distributed
cluster
DAGGragh
Agenda
PipelineStage
EstimatorTransformerModel
K-Means
NaiveBayes
Linearregression
DecisionTree
RandomForest
GBDT
Table based ML Pipeline
EstimatorTransformer
table2=Transformer.tr
ansform(table1) Estimator.fit(table2)
ML Lib Developers ML Lib Users
ML Pipeline - Overview (Target for Flink 1.9)
…… Input
Table
Output
Table
Deep Learning Pipeline on Flink
Data Ingestion
Data Process and
Transformation
Model
Training
Testing and
Validation
Deployment
Change model or
parameters
When Table meets AI: Build
Flink AI Ecosystem on
Table API
Shaoxuan Wang,
Director, Senior Staff Engineer at
Alibaba
4:30pm - 5:10pm
Nikko II & III
Iteration Execute MLlib
and DL engine
Find More Details in Another Session
Summary
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Weitere ähnliche Inhalte

Was ist angesagt?

Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Extending Flux - Writing Your Own Functions by Adam Anthony
Extending Flux - Writing Your Own Functions by Adam AnthonyExtending Flux - Writing Your Own Functions by Adam Anthony
Extending Flux - Writing Your Own Functions by Adam AnthonyInfluxData
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Flink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkAljoscha Krettek
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Sigmoid
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flinkKenny Gorman
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0InfluxData
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 

Was ist angesagt? (20)

Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Extending Flux - Writing Your Own Functions by Adam Anthony
Extending Flux - Writing Your Own Functions by Adam AnthonyExtending Flux - Writing Your Own Functions by Adam Anthony
Extending Flux - Writing Your Own Functions by Adam Anthony
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Intro to Akka Streams
Intro to Akka StreamsIntro to Akka Streams
Intro to Akka Streams
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 

Ähnlich wie Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...confluent
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark mlleorick lin
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveJeff Smith
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009David Pollak
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperChester Chen
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
How to make the fastest Router in Python
How to make the fastest Router in PythonHow to make the fastest Router in Python
How to make the fastest Router in Pythonkwatch
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsMatt Stubbs
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Madina Kamzina
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functionsYan Cui
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPatrick Allaert
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 

Ähnlich wie Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang (20)

Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
Tools for Making Machine Learning more Reactive
Tools for Making Machine Learning more ReactiveTools for Making Machine Learning more Reactive
Tools for Making Machine Learning more Reactive
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
How to make the fastest Router in Python
How to make the fastest Router in PythonHow to make the fastest Router in Python
How to make the fastest Router in Python
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8Gdg almaty. Функциональное программирование в Java 8
Gdg almaty. Функциональное программирование в Java 8
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
What is new in Java 8
What is new in Java 8What is new in Java 8
What is new in Java 8
 
Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
PHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & PinbaPHP applications/environments monitoring: APM & Pinba
PHP applications/environments monitoring: APM & Pinba
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 

Mehr von Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Mehr von Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosystem - Shaoxuan Wang

  • 1. Build a Table-centric Apache Flink Ecosystem Shaoxuan Wang Director, Senior Staff Engineer at Alibaba 2019 Flink Forward San Francisco
  • 2. Broadcom High-Perf Platform Senior Software Engineer Facebook Core Infra Software Engineer Alibaba Group Data Infra / Cloud Senior Staff Engineer Peking University EECS University of California at San Diego PhD, Computer Engineer Flink Committer Since 2017 Shaoxuan Wang Director, Senior Staff Engineer at Alibaba shaoxuan.wsx@alibaba-inc.com shaoxuan@apache.org
  • 3. Apache SAMOA Yarn Apache Flink is the most sophisticated open-source Stream Processor Messaging Queue Messaging Queue Apache Flink Ecosystem - Present
  • 4. Business Intelligence Artificial Intelligence Big Data Infrastructure Can Apache Flink become an unified engine for intelligent big data computing? Intelligent Big Data Computing Batch & stream processing AI Processing
  • 5. Streaming Compute samza Apex Batch Compute Apache Flink Unified Compute Engine AI Compute Build Intelligent Big Data Platform with Apache Flink
  • 6. Apache SAMOA Yarn Flink DL Flink ML Messaging Queue Messaging Queue AI Processing Enhance User Experience Container-centric Management Apache Flink Ecosystem - Future Batch Processing
  • 7. Runtime DAG API & Operators Query Processor Query Optimizer & Query Executor SQL & Table API Relational Local Single JVM Cloud GCE, EC2 Cluster Standalone, YARN Same Results Batch mode Same SQL Query Stream Mode TableAPI: declaritive API with nature Optimization similar as SQL, batch&stream unified, declarative API with nature optimization framework
  • 8. TableAPI is more than SQL TableAPI SQL e.g. Stream and batch unified Y Y SELECT/AGG/WIN DOW etc. Functional scalability Y N flatAgg/Column operations etc. Rich expression Y N map/flatMap/intersec t etc. Compile check Y N Java/Scala IDE SQL Table API map flatMap Iteration flatAggregate … Aggregate
  • 9. Apache SAMOA Yarn Flink DL Flink ML Messaging Queue Messaging Queue Interactive Programming Multi-Language (e.x. Python) Iteration Integration of Mllib and Deep Learning Engine Functionality and productivity of API What Else are Needed on TableAPI Data Storage
  • 12. TableAPI Enhancement Row-based data processing API HintColumn operation Functionality, productivity Productivity optimization instruction, Resource configuration
  • 14. Map Operator in Table API Method signature def map(scalaFunction: Expression): Table Pros table.select(udf(‘c1), udf(’c2), udf(’c3)) VS table.map(udf(’c1,’c2,’c3)) INPUT(Row) OUTPUT(Row) 1 1 class MyMap extends ScalarFunction { var param : String = "" override def open(context: FunctionContext): Unit = param = context.getJobParameter("paramKey","") def eval(index: Int): Row = { val result = new Row(3) // Business processing based on data and parameters result } override def getResultType(signature: Array[Class[_]]): TypeInformation[_] = { Types.ROW(Types.STRING, Types.INT, Types.LONG) } } Example val res = tab .map(fun(‘e)).as(‘a, ‘b, ‘c) .select(‘a, ‘c)
  • 15. FatMap Operator in Table API Method signature def flatMap(tableFunction: Expression): Table Pros table.join(udtf) VS table.flatMap(udtf()) INPUT(Row) OUTPUT(Row) 1 N(N>=0) case class User(name: String, age: Int) class MyFlatMap extends TableFunction[User] { def eval(user: String): Unit = { if (user.contains("#")) { val splits = user.split("#") collect(User(splits(0),splits(1).toInt)) } } } Example val res = tab .flatMap(fun(‘e,’f)).as(‘name, ‘age) .select(‘name, ‘age)
  • 16. Aggregate Operator in Table API Method signature def aggregate(aggregateFunction: Expression): AggregatedTable class AggregatedTable(table: Table, groupKeys: Seq[Expression], aggFunction: Expression) Pros table.select(agg1(), agg2(), agg3()….) VS table.aggregate(agg()) INPUT(Row) OUTPUT(Row) N(N>=0) 1 class CountAccumulator extends JTuple1[Long] { f0 = 0L //count } class CountAgg extends AggregateFunction[JLong, CountAccumulator] { def accumulate(acc: CountAccumulator): Unit = { acc.f0 += 1L } override def getValue(acc: CountAccumulator): JLong = { acc.f0 } ... retract()/merge() } Example val res = groupedTab .groupBy(‘a) .aggregate( aggFun(‘e,’f) as (‘a, ‘b, ‘c)) .select(‘a, ‘c)
  • 17. FlatAggregate Operator in Table API Method signature def flatAggregate(tableAggregateFunction: Expression): GroupedFlatAggregateTable class GroupedFlatAggTable(table: Table, groupKey: Seq[Expression], tableAggFun: Expression) Pros A completely new feature INPUT(Row) OUTPUT(Row) N(N>=0) M(M>=0) class TopNAcc { var data: MapView[JInt, JLong] = _ // (rank -> value) ... } class TopN(n: Int) extends TableAggregateFunction[(Int, Long), TopNAccum] { def accumulate(acc: TopNAcc, category: Int, value: Long) { ... } def emitValue(acc: TopNAcc, out: Collector[(Int, Long)]): Unit = { ... } ...retract/merge } Example val res = groupedTab .groupBy(‘a) .faltAggregate( flatAggFun(‘e,’f) as (‘a, ‘b, ‘c)) .select(‘a, ‘c)
  • 18. Examples of Aggregate and FlatAggregate ID NAME PRICE 1 Latte 6 2 Milk 3 3 Breve 5 4 Mocha 8 5 Tea 4 Accumulator (MAX-ACC) createAccumulator() accumulate(acc, PRICE) tab.flatAggregate(topTableAgg(PRICE)) Examples of Max (aggregate) and Top2 (flatAggregate) tab.aggregate(maxAgg(PRICE)) getValue(ACC acc) PRICE 8 8 8 6 6 6 Accumulator (Top2-ACC) ID NAME PRICE 4 Mocha 8 1 Latte 6 66,36,58,68,6 emitValue(acc, out) 1 2 3 3
  • 19. Summary Single Row Input Multiple Row Input Single Row Output ScalarFunction (select/map) AggregateFunction (select/aggregate) Multiple Row Output TableFunction (cross join/flatmap) TableAggregateFunction (flatAggregate)
  • 21. A example code snippet { val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity) val smallOrders = orders.filter('quantity < 100) val countriesOfSmallOrders = smallOrders.select('country).distinct() countriesOfSmallOrders.print() val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg) smallOrdersByColor.print() }
  • 22. Interactive Programming without Cache or External Storage Execution EnvironmentUser Program job1 Execution Environment job2 Execution Environment job3 orders Source Cons: Jobs are independent Redundant computation Job1 is recomputed Job1 is recomputed
  • 23. Interactive Programming with External Storage Execution EnvironmentUser Program job1 External Storage Execution Environment job2 Execution Environment job3 smallOrders orders Source Pros: No redundant computation Cons: External storage required Explicit source/sink creation Instead of recomputing “orders”, directly load “smallOrders”
  • 24. Interactive Programming with Cached Result (FLIP36) SourceTable Environment User Program Flink Cluster job1 Cached Tables smallOrders job2 job3 orders Pros: Jobs in the same context No external storage Better performance Cache Table Service
  • 25. { val orders = tEnv.fromCollection(data).as ('country, 'color, 'quantity) val smallOrders = orders.filter('quantity < 100) smallOrders.cache() val countriesOfSmallOrders = smallOrders.select('country).distinct() countriesOfSmallOrders.print() val smallOrdersByColor = smallOrders.groupBy('color).select('color, 'quantity.avg as 'avg) smallOrdersByColor.print() } Introducing Table.cache() for Interactive Programming
  • 27. Flink Python TableAPI Python process Java process input Python TableAPI Python UDF Python TableAPI Java gateway Server RPC (Py4j) Python gateway Python VM DAGGragh upstream input downstream output output
  • 28. Multi-Language for Flink TableAPI Python TableAPI Java gateway Server Python gateway Go TableAPI Go gateway Flink Cluster R TableAPI R gateway different VM JVM distributed cluster DAGGragh
  • 30. PipelineStage EstimatorTransformerModel K-Means NaiveBayes Linearregression DecisionTree RandomForest GBDT Table based ML Pipeline EstimatorTransformer table2=Transformer.tr ansform(table1) Estimator.fit(table2) ML Lib Developers ML Lib Users ML Pipeline - Overview (Target for Flink 1.9) …… Input Table Output Table
  • 31. Deep Learning Pipeline on Flink Data Ingestion Data Process and Transformation Model Training Testing and Validation Deployment Change model or parameters
  • 32. When Table meets AI: Build Flink AI Ecosystem on Table API Shaoxuan Wang, Director, Senior Staff Engineer at Alibaba 4:30pm - 5:10pm Nikko II & III Iteration Execute MLlib and DL engine Find More Details in Another Session

Hinweis der Redaktion

  1. Our answer is Yes. The foundation of AI computing is the big data computing engine. As long as Apache flink has both high-performance batch computing and stream computing capabilities, and provides a good interface for algorithm users to implement the algorithm library, or integrate a variety of machine learning training engines on flink, it will be very promising to use the flink to build an Intelligent Big Data Platform. 我们的答案是Yes AI computing的基础是大数据计算引擎。 只要让Apache flink同时具有高性能的批计算和流计算的能力,再提供很好的接口让算法用户实现算法library,或者在flink上集成各式各样的机器学习训练引擎,就能很好的实现一套完备的Intelligent Big Data Platform。
  2. 下来,我介绍一下interactive programming on tableAPI.