SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 1
Flink Meetup #8
Data flow vs. procedural
programming: How to put your
algorithms into Flink
June 23, 2015
Mikio L. Braun
@mikiobraun
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 2
Programming how we're used to
● Computing a sum
● Tools at our disposal:
– variables
– control flow (loops, if)
– function calls as basic piece of abstraction
def computeSum(a):
sum = 0
for i in range(len(a))
sum += a[i]
return sum
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 3
Data Analysis Algorithms
Let's consider centering
becomes
or even just
def centerPoints(xs):
sum = xs[0].copy()
for i in range(1, len(xs)):
sum += xs[i]
mean = sum / len(xs)
for i in range(len(xs)):
xs[i] -= mean
return xs
xs -
xs.mean(axis=0)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 4
Don't use for-loops
● Put your data into a matrix
● Don't use for loops
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 5
Least Squares Regression
● Compute
● Becomes
What you learn is thinking in matrices, breaking
down computations in terms of matrix algebra
def lsr(X, y, lam):
d = X.shape[1]
C = X.T.dot(X) + lam * pl.eye(d)
w = np.linalg.solve(C, X.T.dot(y))
return w
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 6
Basic tools
Advantage
– very familiar
– close to math
Disadvantage
– hard to scale
● Basic procedural programming paradigm
● Variables
● Ordered arrays and efficient functions on those
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 7
Parallel Data Flow
Often you have stuff like
Which is inherently easy to scale
for i in someSet:
map x[i] to y[i]
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 8
New Paradigm
● Basic building block is an (unordered) set.
● Basic operations inherently parallel
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 9
Computing, Data Flow Style
Computing a sum
Computing a mean
sum(x) = xs.reduce((x,y) => x + y)
mean(x) = xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 10
Apache Flink
● Data Flow system
● Basic building block is a DataSet[X]
● For execution, sets up all computing nodes,
streams through data
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 11
Apache Flink: Getting Started
● Use Scala API
● Minimal project with Maven (build tool) or
Gradle
● Use an IDE like IntelliJ
● Always import
org.apache.flink.api.scala._
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 12
Centering (First Try)
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.map(x => x – mean)
}
You cannot nest DataSet operations!
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 13
Sorry, restrictions apply.
● Variables hold (lazy) computations
● You can't work with sets within the operations
● Even if result is just a single element, it's a
DataSet[Elem].
● So what to do?
– cross joins
– broadcast variables
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 14
Centering (Second Try)
Works, but seems excessive because the mean
is copied to each data element.
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.crossWithTiny(mean).map(xm => xm._1 – xm._2)
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 15
Broadcast Variables
● Side information sent to all worker nodes
● Can be a DataSet
● Gets accessed as a Java collection
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 16
class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O)
extends RichMapFunction[T, O] {
var broadcastVariable: B = _
@throws(classOf[Exception])
override def open(configuration: Configuration): Unit = {
broadcastVariable = getRuntimeContext
.getBroadcastVariable[B]("broadcastVariable")
.get(0)
}
override def map(value: T): O = {
fun(value, broadcastVariable)
}
}
Broadcast Variables
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 17
Centering (Third Try)
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.mapWithBcVar(mean).map((x, m) => x – m)
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 18
Intermediate Results pattern
val x = someDataSetComputation()
val y = someOtherDataSetComputation()
val z = dataSet.mapWithBcVar(x)((d, x) => …)
val result = anotherDataSet.mapWithBcVar((y,z)) {
(d, yz) =>
val (y,z) = yz
…
}
x = someComputation()
y = someOtherComputation()
z = someComputationOn(dataSet, x)
result = moreComputationOn(y, z)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 19
Matrix Algebra
● No ordered sets per se in Data Flow context.
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 20
Vector operations by explicit joins
● Encode vector (a1, a2, …, an) with
{(1, a1), (2, a2), … (n, an)}
● Addition:
– a.join(b).where(0).equalTo(0)
.map((ab) => (ab._1._1, ab._1._2 + ab._2._2))
after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 21
Back to Least Squares Regression
Two operations: computing X'X and X'Y
def lsr(xys: DataSet[(DenseVector, Double)]) = {
val XTX = xs.map(x => x.outer(x)).reduce(_ + _)
val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _)
C = XTX.mapWithBcVar(XTY) { vars =>
val XTX = vars._1
val XTY = var.s_2
val weight = XTX  XTY
}
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 22
Summary and Outlook
● Procedural vs. Data Flow
– basic building blocks elementwise operations on
unordered sets
– can't be nested
– combine intermediate results via broadcast vars
● Iterations
● Beware of TypeInformation implicits.

Weitere ähnliche Inhalte

Was ist angesagt?

Computing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental ApplicationsComputing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental ApplicationsStatistics Assignment Help
 
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPERCLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPERRc Os
 
Your first TensorFlow programming with Jupyter
Your first TensorFlow programming with JupyterYour first TensorFlow programming with Jupyter
Your first TensorFlow programming with JupyterEtsuji Nakai
 
GIS Work Example Portfolio
GIS Work Example PortfolioGIS Work Example Portfolio
GIS Work Example PortfolioNicholas Raio
 
Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry Nicole Valerio
 
R class 5 -data visualization
R class 5 -data visualizationR class 5 -data visualization
R class 5 -data visualizationVivian S. Zhang
 
2 transformation computer graphics
2 transformation computer graphics2 transformation computer graphics
2 transformation computer graphicscairo university
 
System approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvsSystem approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvsvrushali sasane
 
Functional Programming for Fun and Profit
Functional Programming for Fun and ProfitFunctional Programming for Fun and Profit
Functional Programming for Fun and Profitjamesottaway
 
Enhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points AnalysisEnhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points Analysisjfrchicanog
 
Graphing Functions and Their Transformations
Graphing Functions and Their TransformationsGraphing Functions and Their Transformations
Graphing Functions and Their Transformationszacho1c
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Scandala Tamang
 
Machine Learning lecture5(octave)
Machine Learning lecture5(octave)Machine Learning lecture5(octave)
Machine Learning lecture5(octave)cairo university
 

Was ist angesagt? (20)

Probability
ProbabilityProbability
Probability
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Computing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental ApplicationsComputing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental Applications
 
Lecture 3.6 bt
Lecture 3.6 btLecture 3.6 bt
Lecture 3.6 bt
 
Dma
DmaDma
Dma
 
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPERCLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPER
 
Your first TensorFlow programming with Jupyter
Your first TensorFlow programming with JupyterYour first TensorFlow programming with Jupyter
Your first TensorFlow programming with Jupyter
 
GIS Work Example Portfolio
GIS Work Example PortfolioGIS Work Example Portfolio
GIS Work Example Portfolio
 
Lecture 3.2 bt
Lecture 3.2 btLecture 3.2 bt
Lecture 3.2 bt
 
Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry
 
R class 5 -data visualization
R class 5 -data visualizationR class 5 -data visualization
R class 5 -data visualization
 
2 transformation computer graphics
2 transformation computer graphics2 transformation computer graphics
2 transformation computer graphics
 
Data visualization
Data visualizationData visualization
Data visualization
 
System approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvsSystem approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvs
 
Functional Programming for Fun and Profit
Functional Programming for Fun and ProfitFunctional Programming for Fun and Profit
Functional Programming for Fun and Profit
 
Enhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points AnalysisEnhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points Analysis
 
Graphing Functions and Their Transformations
Graphing Functions and Their TransformationsGraphing Functions and Their Transformations
Graphing Functions and Their Transformations
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
R programmingmilano
R programmingmilanoR programmingmilano
R programmingmilano
 
Machine Learning lecture5(octave)
Machine Learning lecture5(octave)Machine Learning lecture5(octave)
Machine Learning lecture5(octave)
 

Andere mochten auch

Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis PatternsMikio L. Braun
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in PracticeMikio L. Braun
 
Individual and societal risk
Individual and societal riskIndividual and societal risk
Individual and societal riskSruthi Madhu
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodlecruizgaray
 
El cambio
El cambioEl cambio
El cambiomemoop
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming SystemAshish Tadose
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera, Inc.
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynoteparallellabs
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingApache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataApache Apex
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentApache Apex
 

Andere mochten auch (20)

Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 
Individual and societal risk
Individual and societal riskIndividual and societal risk
Individual and societal risk
 
REDES NEURONALES
REDES NEURONALESREDES NEURONALES
REDES NEURONALES
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodle
 
El cambio
El cambioEl cambio
El cambio
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
The influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patientsThe influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patients
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming System
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 

Ähnlich wie Data flow vs. procedural programming: How to put your algorithms into Flink

Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Flink Forward
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developersAbdul Muneer
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingDr. Manjunatha. P
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying ObjectsDavid Evans
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Omkar Rane
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to MatlabAmr Rashed
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and SparkOswald Campesato
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
Matlab solved problems
Matlab solved problemsMatlab solved problems
Matlab solved problemsMake Mannan
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streamsYueshen Xu
 

Ähnlich wie Data flow vs. procedural programming: How to put your algorithms into Flink (20)

Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
slides.07.pptx
slides.07.pptxslides.07.pptx
slides.07.pptx
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processing
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying Objects
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Matlab solved problems
Matlab solved problemsMatlab solved problems
Matlab solved problems
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Matlab1
Matlab1Matlab1
Matlab1
 

Mehr von Mikio L. Braun

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Mikio L. Braun
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlMikio L. Braun
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI ApplicationsMikio L. Braun
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Mikio L. Braun
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine LearningMikio L. Braun
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An IntroductionMikio L. Braun
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine EinführungMikio L. Braun
 

Mehr von Mikio L. Braun (7)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 

Kürzlich hochgeladen

Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxAS Design & AST.
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...kalichargn70th171
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 

Kürzlich hochgeladen (20)

Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Mastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptxMastering Project Planning with Microsoft Project 2016.pptx
Mastering Project Planning with Microsoft Project 2016.pptx
 
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
The Ultimate Guide to Performance Testing in Low-Code, No-Code Environments (...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 

Data flow vs. procedural programming: How to put your algorithms into Flink

  • 1. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 1 Flink Meetup #8 Data flow vs. procedural programming: How to put your algorithms into Flink June 23, 2015 Mikio L. Braun @mikiobraun
  • 2. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 2 Programming how we're used to ● Computing a sum ● Tools at our disposal: – variables – control flow (loops, if) – function calls as basic piece of abstraction def computeSum(a): sum = 0 for i in range(len(a)) sum += a[i] return sum
  • 3. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 3 Data Analysis Algorithms Let's consider centering becomes or even just def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs xs - xs.mean(axis=0)
  • 4. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 4 Don't use for-loops ● Put your data into a matrix ● Don't use for loops
  • 5. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 5 Least Squares Regression ● Compute ● Becomes What you learn is thinking in matrices, breaking down computations in terms of matrix algebra def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w
  • 6. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 6 Basic tools Advantage – very familiar – close to math Disadvantage – hard to scale ● Basic procedural programming paradigm ● Variables ● Ordered arrays and efficient functions on those
  • 7. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 7 Parallel Data Flow Often you have stuff like Which is inherently easy to scale for i in someSet: map x[i] to y[i]
  • 8. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 8 New Paradigm ● Basic building block is an (unordered) set. ● Basic operations inherently parallel
  • 9. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 9 Computing, Data Flow Style Computing a sum Computing a mean sum(x) = xs.reduce((x,y) => x + y) mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)
  • 10. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 10 Apache Flink ● Data Flow system ● Basic building block is a DataSet[X] ● For execution, sets up all computing nodes, streams through data
  • 11. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 11 Apache Flink: Getting Started ● Use Scala API ● Minimal project with Maven (build tool) or Gradle ● Use an IDE like IntelliJ ● Always import org.apache.flink.api.scala._
  • 12. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 12 Centering (First Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.map(x => x – mean) } You cannot nest DataSet operations!
  • 13. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 13 Sorry, restrictions apply. ● Variables hold (lazy) computations ● You can't work with sets within the operations ● Even if result is just a single element, it's a DataSet[Elem]. ● So what to do? – cross joins – broadcast variables
  • 14. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 14 Centering (Second Try) Works, but seems excessive because the mean is copied to each data element. def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2) }
  • 15. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 15 Broadcast Variables ● Side information sent to all worker nodes ● Can be a DataSet ● Gets accessed as a Java collection
  • 16. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 16 class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _ @throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) } override def map(value: T): O = { fun(value, broadcastVariable) } } Broadcast Variables
  • 17. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 17 Centering (Third Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.mapWithBcVar(mean).map((x, m) => x – m) }
  • 18. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 18 Intermediate Results pattern val x = someDataSetComputation() val y = someOtherDataSetComputation() val z = dataSet.mapWithBcVar(x)((d, x) => …) val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz … } x = someComputation() y = someOtherComputation() z = someComputationOn(dataSet, x) result = moreComputationOn(y, z)
  • 19. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 19 Matrix Algebra ● No ordered sets per se in Data Flow context.
  • 20. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 20 Vector operations by explicit joins ● Encode vector (a1, a2, …, an) with {(1, a1), (2, a2), … (n, an)} ● Addition: – a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2)) after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
  • 21. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 21 Back to Least Squares Regression Two operations: computing X'X and X'Y def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _) C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2 val weight = XTX XTY } }
  • 22. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 22 Summary and Outlook ● Procedural vs. Data Flow – basic building blocks elementwise operations on unordered sets – can't be nested – combine intermediate results via broadcast vars ● Iterations ● Beware of TypeInformation implicits.