This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
2. What is Flink
§ Large-scale data processing engine
§ Easy and powerful APIs for batch and real-time
streaming analysis (Java / Scala)
§ Backed by a very robust execution backend
• with true streaming capabilities,
• custom memory manager,
• native iteration execution,
• and a cost-based optimizer.
2
3. Technology inside Flink
§ Technology inspired by compilers +
MPP databases + distributed systems
§ For ease of use, reliable performance,
and scalability
case
class
Path
(from:
Long,
to:
Long)
val
tc
=
edges.iterate(10)
{
paths:
DataSet[Path]
=>
val
next
=
paths
.join(edges)
.where("to")
.equalTo("from")
{
(path,
edge)
=>
Path(path.from,
edge.to)
}
.union(paths)
.distinct()
next
}
Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streaming
Task
scheduling
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-flight
(client) Master
Workers
5. Example: WordCount
5
case
class
Word
(word:
String,
frequency:
Int)
val
env
=
ExecutionEnvironment.getExecutionEnvironment()
val
lines
=
env.readTextFile(...)
lines
.flatMap
{line
=>
line.split("
").map(word
=>
Word(word,1))}
.groupBy("word").sum("frequency”)
.print()
env.execute()
Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.
6. Flink API in a Nutshell
§ map, flatMap, filter,
groupBy, reduce,
reduceGroup,
aggregate, join,
coGroup, cross,
project, distinct, union,
iterate, iterateDelta, ...
§ All Hadoop input
formats are supported
§ API similar for data sets
and data streams with
slightly different
operator semantics
§ Window functions for
data streams
§ Counters,
accumulators, and
broadcast variables
6
10. Machine learning pipelines
§ Pipelining inspired by scikit-learn
§ Transformer: Modify data
§ Learner: Train a model
§ Reusable components
§ Let’s you quickly build ML pipelines
§ Model inherits pipeline of learner
10
11. Linear regression in polynomial space
val
polynomialBase
=
PolynomialBase()
val
learner
=
MultipleLinearRegression()
val
pipeline
=
polynomialBase.chain(learner)
val
trainingDS
=
env.fromCollection(trainingData)
val
parameters
=
ParameterMap()
.add(PolynomialBase.Degree,
3)
.add(MultipleLinearRegression.Stepsize,
0.002)
.add(MultipleLinearRegression.Iterations,
100)
val
model
=
pipeline.fit(trainingDS,
parameters)
11
Input
Data
Polynomial
Base
Mapper
Mul4ple
Linear
Regression
Linear
Model
12. Current state of Flink-ML
§ Existing learners
• Multiple linear regression
• Alternating least squares
• Communication efficient distributed dual
coordinate ascent (PR pending)
§ Feature transformer
• Polynomial base feature mapper
§ Tooling
12
13. Distributed linear algebra
§ Linear algebra universal
language for data
analysis
§ High-level abstraction
§ Fast prototyping
§ Pre- and post-processing
step
13
14. Example: Gaussian non-negative matrix
factorization
§ Given input matrix V, find W and H such
that
§ Iterative approximation
14
Ht+1 = Ht ∗ Wt
T
V /Wt
T
Wt Ht( )
Wt+1 = Wt ∗ VHt+1
T
/Wt Ht+1Ht+1
T
( )
V ≈ WH
var
i
=
0
var
H:
CheckpointedDrm[Int]
=
randomMatrix(k,
V.numCols)
var
W:
CheckpointedDrm[Int]
=
randomMatrix(V.numRows,
k)
while(i
<
maxIterations)
{
H
=
H
*
(W.t
%*%
V
/
W.t
%*%
W
%*%
H)
W
=
W
*
(V
%*%
H.t
/
W
%*%
H
%*%
H.t)
i
+=
1
}
16. Flink’s features
§ Stateful iterations
• Keep state across iterations
§ Delta iterations
• Limit computation to elements which matter
§ Pipelining
• Avoiding materialization of large
intermediate state
16
23. Collaborative Filtering
§ Recommend items based on users with
similar preferences
§ Latent factor models capture underlying
characteristics of items and preferences
of user
§ Predicted preference:
23
ˆru,i = xu
T
yi
25. Alternating least squares
§ Fixing one matrix gives a quadratic form
§ Solution guarantees to decrease overall
cost function
§ To calculate , all rated item vectors and
ratings are needed
25
xu = YSu
YT
+ λnuΙ( )
−1
Yru
T
Sii
u
=
1 if ru,i ≠ 0
0 else
"
#
$
%$
xu
27. Naïve ALS
case
class
Rating(userID:
Int,
itemID:
Int,
rating:
Double)
case
class
ColumnVector(columnIndex:
Int,
vector:
Array[Double])
val
items:
DataSet[ColumnVector]
=
_
val
ratings:
DataSet[Rating]
=
_
//
Generate
tuples
of
items
with
their
ratings
val
uVA
=
items.join(ratings).where(0).equalTo(1)
{
(item,
ratingEntry)
=>
{
val
Rating(uID,
_,
rating)
=
ratingEntry
(uID,
rating,
item.vector)
}
}
27
28. Naïve ALS contd.
uVA.groupBy(0).reduceGroup
{
vectors
=>
{
var
uID
=
-‐1
val
matrix
=
FloatMatrix.zeros(factors,
factors)
val
vector
=
FloatMatrix.zeros(factors)
var
n
=
0
for((id,
rating,
v)
<-‐
vectors)
{
uID
=
id
vector
+=
rating
*
v
matrix
+=
outerProduct(v
,
v)
n
+=
1
}
for(idx
<-‐
0
until
factors)
{
matrix(idx,
idx)
+=
lambda
*
n
}
new
ColumnVector(uID,
Solve(matrix,
vector))
}
}
28
29. Problems of naïve ALS
§ Problem:
• Item vectors are sent redundantly à High
network load
§ Solution:
• Blocking of user and item vectors to share
common data
• Avoids blown up intermediate state
29
33. Why is streaming ML important?
§ Spam detection in mails
§ Patterns might change over time
§ Retraining of model necessary
§ Best solution: Online models
33
38. Flink-ML Outlook
§ Support more algorithms
§ Support for distributed linear algebra
§ Integration with streaming machine learning
§ Interactive programs and Zeppelin
38