SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Apache Flink Tutorial
DataSet API
Who Am I?
● Newbie in Apache Flink
● BlueWell Technology
○ Big Data Architect
○ Focuses
■ Open DC/OS
■ CoreOS
■ Kubernetes
■ Apache Flink
■ Data Science
Agenda
● Apache Flink Type System
○ Atomic
○ Composite
○ Tuple
○ POJO
○ Scala case class
● Transformations
○ Transformations on DataSet
○ Rich Functions
○ Accumulators & Counters
○ Annotations
Agenda
● Iterations
○ Bulk iterations
○ Delta iterations
Basic Structure Of Apache Flink Programs
● For each Apache Flink Program, the basic structure is listed as follows.
○ Obtain an execution environment.
■ ExecutionEnvironment.getExecutionEnvironment()
○ Load/create DataSets from data sources.
■ readFile(), readTextFile(), readCsvFile(), etc.
■ fromElements(), fromCollection(), etc.
○ Execute some transformations on the DataSets.
■ filter(), map(), reduce(), etc.
○ Specify where to save results of the computations.
■ write(), writeAsCsv(), etc.
■ collect(), print(), etc.
○ Trigger the program execution.
Hands-on
BasicStructure
Apache Flink Type System
● Flink attempts to support all data types
○ Facilitate programming
○ Seamlessly integrate with legacy code
● Flink analyzes applications before execution for
○ Identifying used data types
○ Determining serializers and comparators
● Data types could be
○ Atomic data types
○ Composite data types
Composite Data Types
● Composite Data Types include
○ Tuples
■ In Java
■ In Scala
○ POJOs
○ Scala case class
Tuple Data Types
● Flink supports Tuple in
○ Java: org.apache.flink.api.java.tuple.Tuple<n>
■ n = 1, …, 25
○ Scala: premitive Tuple<n>
■ n = 1, …, 22
● Key declarations
○ Field index
○ E.g., dataset.groupBy(0).sum(1)
○ E.g., dataset.groupBy(“_1”).sum(“_2”)
POJOs
● POJOs – A java class with
○ A default constructor without parameters.
○ All fields are
■ public or
■ private but have getter & setter
■ Ex.
public class Car {
public int id;
public String brand;
public Car() {};
public Car(int id, String brand) {…};
}
POJOs
● Key Declarations
○ Field name as a string
○ E.g., cars.groupBy(“brand”)
Scala Case Class
● Primitive Scala case classes are supported
○ E.g., case class Car(id: Int, brand: String)
● Key declarations
○ Field name as a string
■ E.g., cars.groupBy(“brand”)
○ Field name
■ E.g., cars.groupBy(_.brand)
Hands-on
TypeSystem
DataSet Transformations
Hands-on
Transformation
Introduction To Rich Functions
● Purpose
○ Implement complicated user-defined functions
○ Broadcast variables
○ Access to Accumulators & Counters
● Structure
○ open(): initialization, one-shot setup work.
○ close(): tear-down, clean-up work.
○ get/setRuntimeContext(): access to RuntimeContext.
○ corresponding transformation method, e.g., map, join, etc.
Broadcast Variables
● Register
○ dataset.map(new RichMapFunction())
.withBroadcastSet(toBroadcast, “varName”)
● Access in Rich Functions
○ Initialize the broadcasted variables in open() via
○ getRuntimeContext().getBroadcastVariable(“varName”)
○ Access them in the whole class scope.
Accumulators & Counters
● Purpose
○ Debugging
○ First glance of DataSets
● Counters are kinds of accumulator
● Structure
○ An add operation
○ A final accumulated result (available after the job ended)
● Flink will automatically sum up all partial results.
Accumulators & Counters
● Built-in Accumulators
○ IntCounter, LongCounter, DoubleCounter
○ Histogram: map from integer to integer, distributions
● Register
○ new IntCounter()
○ getRuntimeContext().addAccumulator(“accuName”, counter)
● Access
○ In Rich Functions
■ getRuntimeContext().getAccumulator(“accuName”)
○ In the end of job
■ JobExecutionResult.getAccumulatorResult(“accuName")
Semantic Annotations
● Give Flink hints about the behavior of a function
○ A powerful means to speed up execution
○ Reusing sort orders or partitions across multiple operations
○ Prevent programs from unnecessary data shuffling or unnecessary sorts
● Types of Annotation
○ Forwarded fields annotations (@ForwardedFields)
○ Non-forwarded fields annotations (@NonForwardedFields)
■ Black or White in place
○ Read fields annotations (@ReadFields)
○ Fields to be read and evaluated
Iterations
● Bulk iterations
○ Partial Solution
○ Iteration Result
Iterations
● Delta Iterations
○ Workset / Update Solution Set
○ Iteration Result
The End
Thanks!!

Weitere ähnliche Inhalte

Was ist angesagt?

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)PingCAP
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Hakka Labs
 
Modern Java Features
Modern Java Features Modern Java Features
Modern Java Features Florian Hopf
 
Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLAnant Corporation
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...Caner Ünal
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Corley S.r.l.
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streamsMikhail Girkin
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQLPingCAP
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouseAltinity Ltd
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable worldTechExeter
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...PingCAP
 

Was ist angesagt? (20)

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
 
Tick
TickTick
Tick
 
Modern Java Features
Modern Java Features Modern Java Features
Modern Java Features
 
Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
What Reika Taught us
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
 
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
Javantura v3 - ELK – Big Data for DevOps – Maarten MuldersJavantura v3 - ELK – Big Data for DevOps – Maarten Mulders
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable world
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...
 

Andere mochten auch

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Flink Taiwan User Group
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Apache Flink Taiwan User Group
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Taiwan User Group
 
Yarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine LearningYarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine Learningojavajava
 
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Jing-Doo Wang
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentAnna Yen
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform SecurityJazz Yao-Tsung Wang
 
2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈晨揚 施
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiMakoto Yui
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackHadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackLen Chang
 

Andere mochten auch (15)

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
 
Yarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine LearningYarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine Learning
 
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
 
2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
BI in Xuenn
BI in XuennBI in Xuenn
BI in Xuenn
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackHadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
 

Ähnlich wie Apache Flink Tutorial: DataSet API Transformations and Iterations

How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAPMariaDB plc
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Semmle Codeql
Semmle Codeql Semmle Codeql
Semmle Codeql M. S.
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streamsKam Kasravi
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersMariaDB plc
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAwaretwilmes
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3MariaDB plc
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB plc
 

Ähnlich wie Apache Flink Tutorial: DataSet API Transformations and Iterations (20)

How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Semmle Codeql
Semmle Codeql Semmle Codeql
Semmle Codeql
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streams
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Google V8 engine
Google V8 engineGoogle V8 engine
Google V8 engine
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
 

Kürzlich hochgeladen

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 

Kürzlich hochgeladen (16)

Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 

Apache Flink Tutorial: DataSet API Transformations and Iterations

  • 2. Who Am I? ● Newbie in Apache Flink ● BlueWell Technology ○ Big Data Architect ○ Focuses ■ Open DC/OS ■ CoreOS ■ Kubernetes ■ Apache Flink ■ Data Science
  • 3. Agenda ● Apache Flink Type System ○ Atomic ○ Composite ○ Tuple ○ POJO ○ Scala case class ● Transformations ○ Transformations on DataSet ○ Rich Functions ○ Accumulators & Counters ○ Annotations
  • 4. Agenda ● Iterations ○ Bulk iterations ○ Delta iterations
  • 5. Basic Structure Of Apache Flink Programs ● For each Apache Flink Program, the basic structure is listed as follows. ○ Obtain an execution environment. ■ ExecutionEnvironment.getExecutionEnvironment() ○ Load/create DataSets from data sources. ■ readFile(), readTextFile(), readCsvFile(), etc. ■ fromElements(), fromCollection(), etc. ○ Execute some transformations on the DataSets. ■ filter(), map(), reduce(), etc. ○ Specify where to save results of the computations. ■ write(), writeAsCsv(), etc. ■ collect(), print(), etc. ○ Trigger the program execution. Hands-on BasicStructure
  • 6. Apache Flink Type System ● Flink attempts to support all data types ○ Facilitate programming ○ Seamlessly integrate with legacy code ● Flink analyzes applications before execution for ○ Identifying used data types ○ Determining serializers and comparators ● Data types could be ○ Atomic data types ○ Composite data types
  • 7. Composite Data Types ● Composite Data Types include ○ Tuples ■ In Java ■ In Scala ○ POJOs ○ Scala case class
  • 8. Tuple Data Types ● Flink supports Tuple in ○ Java: org.apache.flink.api.java.tuple.Tuple<n> ■ n = 1, …, 25 ○ Scala: premitive Tuple<n> ■ n = 1, …, 22 ● Key declarations ○ Field index ○ E.g., dataset.groupBy(0).sum(1) ○ E.g., dataset.groupBy(“_1”).sum(“_2”)
  • 9. POJOs ● POJOs – A java class with ○ A default constructor without parameters. ○ All fields are ■ public or ■ private but have getter & setter ■ Ex. public class Car { public int id; public String brand; public Car() {}; public Car(int id, String brand) {…}; }
  • 10. POJOs ● Key Declarations ○ Field name as a string ○ E.g., cars.groupBy(“brand”)
  • 11. Scala Case Class ● Primitive Scala case classes are supported ○ E.g., case class Car(id: Int, brand: String) ● Key declarations ○ Field name as a string ■ E.g., cars.groupBy(“brand”) ○ Field name ■ E.g., cars.groupBy(_.brand) Hands-on TypeSystem
  • 13. Introduction To Rich Functions ● Purpose ○ Implement complicated user-defined functions ○ Broadcast variables ○ Access to Accumulators & Counters ● Structure ○ open(): initialization, one-shot setup work. ○ close(): tear-down, clean-up work. ○ get/setRuntimeContext(): access to RuntimeContext. ○ corresponding transformation method, e.g., map, join, etc.
  • 14. Broadcast Variables ● Register ○ dataset.map(new RichMapFunction()) .withBroadcastSet(toBroadcast, “varName”) ● Access in Rich Functions ○ Initialize the broadcasted variables in open() via ○ getRuntimeContext().getBroadcastVariable(“varName”) ○ Access them in the whole class scope.
  • 15. Accumulators & Counters ● Purpose ○ Debugging ○ First glance of DataSets ● Counters are kinds of accumulator ● Structure ○ An add operation ○ A final accumulated result (available after the job ended) ● Flink will automatically sum up all partial results.
  • 16. Accumulators & Counters ● Built-in Accumulators ○ IntCounter, LongCounter, DoubleCounter ○ Histogram: map from integer to integer, distributions ● Register ○ new IntCounter() ○ getRuntimeContext().addAccumulator(“accuName”, counter) ● Access ○ In Rich Functions ■ getRuntimeContext().getAccumulator(“accuName”) ○ In the end of job ■ JobExecutionResult.getAccumulatorResult(“accuName")
  • 17. Semantic Annotations ● Give Flink hints about the behavior of a function ○ A powerful means to speed up execution ○ Reusing sort orders or partitions across multiple operations ○ Prevent programs from unnecessary data shuffling or unnecessary sorts ● Types of Annotation ○ Forwarded fields annotations (@ForwardedFields) ○ Non-forwarded fields annotations (@NonForwardedFields) ■ Black or White in place ○ Read fields annotations (@ReadFields) ○ Fields to be read and evaluated
  • 18. Iterations ● Bulk iterations ○ Partial Solution ○ Iteration Result
  • 19. Iterations ● Delta Iterations ○ Workset / Update Solution Set ○ Iteration Result