17. 1. Easy
scalable
ML
development
(ML
Developers)
2. User-‐friendly
ML
at
scale
(End
Users)
ML
Experts Systems
ExpertsMLbase
18. 1. Easy
scalable
ML
development
(ML
Developers)
2. User-‐friendly
ML
at
scale
(End
Users)
Along
the
way,
we
gain
insight
into
data
intensive
compu2ng
ML
Experts Systems
ExpertsMLbase
22. Lapack
Matlab Interface
Matlab
Stack
Single Machine
✦ Lapack:
low-‐level
Fortran
linear
algebra
library
✦ Matlab
Interface
✦ Higher-‐level
abstrac2ons
for
data
access
/
processing
✦ More
extensive
func2onality
than
Lapack
✦ Leverages
Lapack
whenever
possible
23. Lapack
Matlab Interface
Matlab
Stack
Single Machine
✦ Lapack:
low-‐level
Fortran
linear
algebra
library
✦ Matlab
Interface
✦ Higher-‐level
abstrac2ons
for
data
access
/
processing
✦ More
extensive
func2onality
than
Lapack
✦ Leverages
Lapack
whenever
possible
✦ Similar
stories
for
R
and
Python
27. MLbase
Stack
Runtime(s)
MLlib
Spark
Spark:
cluster
compu=ng
system
designed
for
itera=ve
computa=on
MLlib:
low-‐level
ML
library
in
Spark
✦ Callable
from
Scala,
Java
Lapack
Matlab Interface
Single Machine
28. MLbase
Stack
Runtime(s)
MLlib
MLI
Spark
Spark:
cluster
compu=ng
system
designed
for
itera=ve
computa=on
MLlib:
low-‐level
ML
library
in
Spark
✦ Callable
from
Scala,
Java
MLI:
API
/
plaHorm
for
feature
extrac=on
and
algorithm
development
✦ Includes
higher-‐level
func2onality
with
faster
dev
cycle
than
MLlib
Lapack
Matlab Interface
Single Machine
29. MLbase
Stack
Runtime(s)
MLlib
MLI
ML Optimizer
Spark
Spark:
cluster
compu=ng
system
designed
for
itera=ve
computa=on
MLlib:
low-‐level
ML
library
in
Spark
✦ Callable
from
Scala,
Java
MLI:
API
/
plaHorm
for
feature
extrac=on
and
algorithm
development
✦ Includes
higher-‐level
func2onality
with
faster
dev
cycle
than
MLlib
ML
OpAmizer:
automates
model
selec=on
✦ Solves
a
search
problem
over
feature
extractors
and
algorithms
in
MLI
Lapack
Matlab Interface
Single Machine
30. MLlib
MLI
ML Optimizer
End
User
MLbase
Stack
Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
31. MLlib
MLI
ML Optimizer
End
User
MLbase
Stack
Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
32. MLlib
MLI
ML Optimizer
End
User
MLbase
Stack
Status
Goal 1:
Summer Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
33. MLlib
MLI
ML Optimizer
End
User
MLbase
Stack
Status
Goal 1:
Summer Release
Goal 2:
Winter Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
36. Example:
MLlib
✦ Goal:
Classifica2on
of
text
file
✦ Featurize
data
manually
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
37. Example:
MLlib
✦ Goal:
Classifica2on
of
text
file
✦ Featurize
data
manually
✦ Calls
MLlib’s
LR
func2on
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
39. Example:
MLI
✦ Use
built-‐in
feature
extrac2on
func2onality
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
40. Example:
MLI
✦ Use
built-‐in
feature
extrac2on
func2onality
✦ MLI
Logis2c
Regression
leverages
MLlib
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
41. Example:
MLI
✦ Use
built-‐in
feature
extrac2on
func2onality
✦ MLI
Logis2c
Regression
leverages
MLlib
✦ Extensions:
✦ Embed
in
cross-‐valida2on
rou2ne
✦ Use
different
feature
extractors
/
algorithms
or
write
new
ones
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
42. Example:
ML
Op2mizer
var
X
=
load(”text_file”,
2
to
10)
var
y
=
load(”text_file”,
1)
var
(fn-‐model,
summary)
=
doClassify(X,
y)
✦ User
declara2vely
specifies
task
✦ ML
Op2mizer
searches
through
MLI
47. Matlab,
R
x
Ease
of
use
Performance,
Scalability
GraphLab,
VW
x
Mahout
x
Lay
of
the
Land
48. Matlab,
R
x
Ease
of
use
Performance,
Scalability
GraphLab,
VW
x
Mahout
x
Lay
of
the
Land
MLlib
x
49. Logis2c
Regression,
Linear
SVM
(+L1,
L2)
Linear
Regression
(+Lasso,
Ridge)
Alterna2ng
Least
Squares
K-‐Means
SGD,
Parallel
Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve
Filtering:
Clustering:
OpAmizaAon
PrimiAves:
50. Logis2c
Regression,
Linear
SVM
(+L1,
L2)
Linear
Regression
(+Lasso,
Ridge)
Alterna2ng
Least
Squares
K-‐Means
SGD,
Parallel
Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve
Filtering:
Clustering:
OpAmizaAon
PrimiAves:
Included
within
Spark
codebase
✦ Unlike
Mahout/Hadoop
✦ Part
of
Spark
0.8
release
✦ Con2nued
support
via
Spark
project
53. ✦ WallAme:
elapsed
2me
to
execute
task
✦ Weak
scaling
✦ fix
problem
size
per
processor
✦ ideally:
constant
wall2me
as
we
grow
cluster
MLlib
Performance
54. ✦ WallAme:
elapsed
2me
to
execute
task
✦ Weak
scaling
✦ fix
problem
size
per
processor
✦ ideally:
constant
wall2me
as
we
grow
cluster
✦ Strong
scaling
✦ fix
total
problem
size
✦ ideally:
linear
speed
up
as
we
grow
cluster
MLlib
Performance
55. ✦ WallAme:
elapsed
2me
to
execute
task
✦ Weak
scaling
✦ fix
problem
size
per
processor
✦ ideally:
constant
wall2me
as
we
grow
cluster
✦ Strong
scaling
✦ fix
total
problem
size
✦ ideally:
linear
speed
up
as
we
grow
cluster
✦ EC2
Experiments
✦ m2.4xlarge
instances,
up
to
32
machine
clusters
MLlib
Performance
63. Logis2c
Regression
-‐
Strong
Scaling
✦ Fixed
Dataset:
50K
images,
160K
dense
features
✦ MLlib
exhibits
beTer
scaling
proper=es
✦ MLlib
faster
than
VW
with
16
and
32
machines
MLbase VW Matlab
0
1000
wa
Fig. 5: Walltime for weak scaling for logistic regressi
MLbase VW Matlab
0
200
400
600
800
1000
1200
1400
walltime(s)
1 Machine
2 Machines
4 Machines
8 Machines
16 Machines
32 Machines
Fig. 7: Walltime for strong scaling for logistic regress
with respect to computation. In practice, we see comp
scaling results as more machines are added.
In MATLAB, we implement gradient descent inste
SGD, as gradient descent requires roughly the same nu
of numeric operations as SGD but does not require an
loop to pass over the data. It can thus be implemented
MLlib
0 5 10 15 20 25 30
0
# machines
ig. 6: Weak scaling for logistic regression
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
# machines
speedup
MLbase
VW
Ideal
8: Strong scaling for logistic regression
System Lines of Code
MLbase 32
GraphLab 383
MLlib
65. ALS
-‐
Wall2me
✦ Dataset:
Scaled
version
of
NeHlix
data
(9X
in
size)
✦ Cluster:
9
machines
66. ALS
-‐
Wall2me
✦ Dataset:
Scaled
version
of
NeHlix
data
(9X
in
size)
✦ Cluster:
9
machines
System WallAme
(seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
67. ALS
-‐
Wall2me
✦ Dataset:
Scaled
version
of
NeHlix
data
(9X
in
size)
✦ Cluster:
9
machines
System WallAme
(seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
68. ALS
-‐
Wall2me
✦ Dataset:
Scaled
version
of
NeHlix
data
(9X
in
size)
✦ Cluster:
9
machines
✦ MLlib
an
order
of
magnitude
faster
than
Mahout
✦ MLlib
within
factor
of
2
of
GraphLab
System WallAme
(seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
71. Deployment
Considera2ons
Vowpal
Wabbit,
GraphLab
✦ Data
prepara=on
specific
to
each
program
✦ Non-‐trivial
setup
on
cluster
✦ No
fault
tolerance
MLlib
✦ Reads
files
from
HDFS
✦ Launch/compile/run
on
cluster
with
a
few
commands
✦ RDD’s
are
fault
tolerance
76. Current
Op2ons
+
Easy
(Resembles
math,
limited
/
no
set
up
cost)
+
Sufficient
for
prototyping
/
wri2ng
papers
—
Ad-‐hoc,
non-‐scalable
scripts
—
Loss
of
transla2on
upon
re-‐implementa2on
77. Current
Op2ons
+
Easy
(Resembles
math,
limited
/
no
set
up
cost)
+
Sufficient
for
prototyping
/
wri2ng
papers
—
Ad-‐hoc,
non-‐scalable
scripts
—
Loss
of
transla2on
upon
re-‐implementa2on
78. Current
Op2ons
+
Easy
(Resembles
math,
limited
/
no
set
up
cost)
+
Sufficient
for
prototyping
/
wri2ng
papers
—
Ad-‐hoc,
non-‐scalable
scripts
—
Loss
of
transla2on
upon
re-‐implementa2on
+
Scalable
and
(some2mes)
fast
+
Exis2ng
open-‐source
library
of
ML
algorithms
—
Difficult
to
set
up,
extend
84. Insight:
Programming
Abstrac2ons
✦ Shield
ML
Developers
from
low-‐details:
provide
familiar
mathema2cal
operators
in
distributed
sejng
✦ ML
Developer
API
(MLI)
85. Insight:
Programming
Abstrac2ons
✦ Shield
ML
Developers
from
low-‐details:
provide
familiar
mathema2cal
operators
in
distributed
sejng
✦ ML
Developer
API
(MLI)
✦ Table
Computa2on:
MLTable
✦ Linear
Algebra:
MLSubMatrix
✦ Op2miza2on
Primi2ves:
MLSolve
86. Insight:
Programming
Abstrac2ons
✦ Shield
ML
Developers
from
low-‐details:
provide
familiar
mathema2cal
operators
in
distributed
sejng
✦ ML
Developer
API
(MLI)
✦ Table
Computa2on:
MLTable
✦ Linear
Algebra:
MLSubMatrix
✦ Op2miza2on
Primi2ves:
MLSolve
✦ MLI
Examples:
✦ DFC:
~50
lines
of
code
87. Insight:
Programming
Abstrac2ons
✦ Shield
ML
Developers
from
low-‐details:
provide
familiar
mathema2cal
operators
in
distributed
sejng
✦ ML
Developer
API
(MLI)
✦ Table
Computa2on:
MLTable
✦ Linear
Algebra:
MLSubMatrix
✦ Op2miza2on
Primi2ves:
MLSolve
✦ MLI
Examples:
✦ DFC:
~50
lines
of
code
✦ ALS:
early
stopping
in
3
lines;
<
40
lines
total
95. MLI
Details
OLD
val
x:
RDD[Array[Double]]
val
x:
RDD[spark.u=l.Vector]
val
x:
RDD[breeze.linalg.Vector]
96. MLI
Details
OLD
val
x:
RDD[Array[Double]]
val
x:
RDD[spark.u=l.Vector]
val
x:
RDD[breeze.linalg.Vector]
val
x:
RDD[BIDMat.SMat]
97. MLI
Details
OLD
val
x:
RDD[Array[Double]]
val
x:
RDD[spark.u=l.Vector]
val
x:
RDD[breeze.linalg.Vector]
val
x:
RDD[BIDMat.SMat]
98. MLI
Details
OLD
val
x:
RDD[Array[Double]]
val
x:
RDD[spark.u=l.Vector]
val
x:
RDD[breeze.linalg.Vector]
val
x:
RDD[BIDMat.SMat]
NEW
val
x:
MLTable
99. MLI
Details
OLD
val
x:
RDD[Array[Double]]
val
x:
RDD[spark.u=l.Vector]
val
x:
RDD[breeze.linalg.Vector]
val
x:
RDD[BIDMat.SMat]
NEW
val
x:
MLTable
✦ Generic
interface
for
feature
extrac2on
✦ Common
interface
to
support
an
op2mizer
✦ Abstract
interface
for
arbitrary
backends
100. MLTable
✦ Flexibility
when
loading
data
✦ e.g.,
CSV,
JSON,
XML
✦ Heterogenous
data
across
columns
✦ Missing
Data
✦ Feature
extrac2on
✦ Common
Interface
✦ Supports
MapReduce
and
Rela2onal
Operators
✦ Inspired
by
DataFrames
(R)
and
Pandas
(Python)
101. Feature
Extrac2on
where a ke
matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a
data. Outpu
table.
numRows None Long Returns nu
numCols None Long Returns the
Fig. 2: MLTable API Illustration. This table captures core operations of th
1 def main(args: Array[String]) {
2 val mc = new MLContext("local")
3
4 //Read in table from file on HDFS.
5 val rawTextTable = mc.textFile(args(0))
6
7 //Run feature extraction on the raw text - get the top 30000 bigrams.
8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))
9
10 //Cluster the data using K-Means.
11 val kMeansModel = KMeans(featurizedTable, k=50)
12 }
Fig. 3: Loading, featurizing, and learning clusters on a corpu
Family Example Uses Returns
Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
102. MLSubMatrix
✦ Linear
algebra
on
local
parAAons
✦ E.g.,
matrix-‐vector
opera2ons
for
mini-‐batch
logis2c
regression
✦ E.g.,
solving
linear
system
of
equa2ons
for
Alterna2ng
Least
Squares
✦ Sparse
and
Dense
Matrix
Support
103. Alterna2ng
Least
Squares
19 parfor q=1:n
20 Uq = U(Uinds{q},:);
21 V(q,:) = (Uq’*Uq + lambI) (Uq’ * M(Uinds{q},q));
22 end
23 end
24 end
1 object BroadcastALS extends Algorithm {
2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,
3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {
4 val lambI = MLSubMatrix.eye(k).mul(lambda)
5 var U = MLSubMatrix.rand(m, k)
6 var V = MLSubMatrix.rand(n, k)
7 var U_b = trainData.context.broadcast(U)
8 var V_b = trainData.context.broadcast(V)
9 for (iter <- 0 until maxIter) {
10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))
11 U_b = trainData.context.broadcast(U)
12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))
13 V_b = trainData.context.broadcast(V)
14 }
15 new ALSModel(U, V)
16 }
17
18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){
19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)
20 for (i <- 0 until trainDataPart.numRows) {
21 val q = trainDataPart.rowID(i)
22 val nz_inds = trainDataPart.nzCols(q)
23 val Yq = Y(trainDataPart.nzCols(q), ??)
24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)
25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)
26 }
27 return localX
28 }
29 }
104. MLSolve
✦ Distributed
implementaAons
of
common
opAmizaAon
paZerns
✦ E.g.,
Stochas2c
Gradient
Descent:
Applicable
to
summable
ML
losses
✦ E.g.,
LBFGS:
An
approximate
2nd-‐
order
op2miza2on
method
✦ E.g.,
ADMM:
Decomposi2on
/
coordina2on
procedure
105. Logis2c
Regression
5 grad = X’ * (sigmoid(X * w) - y);
6 w = w - learning_rate * grad;
7 end
8 end
9
10 % applies sigmoid function component-wise on the vector x
11 function s = sigmoid(x)
12 s = 1 ./ (1 + exp(-1 .* x));
13 end
1 object LogisticRegression extends Algorithm {
2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))
3
4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {
5 x.transpose * (sigmoid(x dot w) - y)
6 }
7
8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {
9 val d = data.numCols
10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),
11 maxIterations = p.maxIter, learningRate = p.learningRate,
12 gradientFunction = gradientFunction)
13 val weights = SGD(data, params)
14 new LogRegModel(weights)
15 }
16 }
1 object StochasticGradientDescent extends Optimizer {
2
3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,
4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr
5 var localWeights = weights
6 for (i <- 0 to x.numRows) {
106. Linear
Regression
(+Lasso,
Ridge)
Alterna2ng
Least
Squares,
[DFC]
K-‐Means,
[DP-‐Means]
Logis2c
Regression,
Linear
SVM
(+L1,
L2),
Mul2nomial
Regression,
[Naive
Bayes,
Decision
Trees]
SGD,
Parallel
Gradient,
Local
SGD,
[L-‐BFGS,
ADMM,
Adagrad]
Principal
Component
Analysis
(PCA),
N-‐grams,
feature
cleaning
/
normaliza2on
Cross
Valida2on,
Evalua2on
Metrics
MLI
Func2onality
Regression:
CollaboraAve
Filtering:
Clustering:
ClassificaAon:
OpAmizaAon
PrimiAves:
Feature
ExtracAon:
ML
Tools:
107. Linear
Regression
(+Lasso,
Ridge)
Alterna2ng
Least
Squares,
[DFC]
K-‐Means,
[DP-‐Means]
Logis2c
Regression,
Linear
SVM
(+L1,
L2),
Mul2nomial
Regression,
[Naive
Bayes,
Decision
Trees]
SGD,
Parallel
Gradient,
Local
SGD,
[L-‐BFGS,
ADMM,
Adagrad]
Principal
Component
Analysis
(PCA),
N-‐grams,
feature
cleaning
/
normaliza2on
Cross
Valida2on,
Evalua2on
Metrics
MLI
Func2onality
Regression:
CollaboraAve
Filtering:
Clustering:
ClassificaAon:
OpAmizaAon
PrimiAves:
Feature
ExtracAon:
ML
Tools:
110. Build
a
Classifier
for
X
What
you
want
to
do What
you
have
to
do
✦ Learn
the
internals
of
ML
classificaAon
algorithms,
sampling,
feature
selecAon,
X-‐validaAon,….
✦ PotenAally
learn
Spark/Hadoop/…
✦ Implement
3-‐4
algorithms
✦ Implement
grid-‐search
to
find
the
right
algorithm
parameters
✦ Implement
validaAon
algorithms
✦ Experiment
with
different
sampling-‐
sizes,
algorithms,
features
✦ ….
111. Build
a
Classifier
for
X
What
you
want
to
do What
you
have
to
do
✦ Learn
the
internals
of
ML
classificaAon
algorithms,
sampling,
feature
selecAon,
X-‐validaAon,….
✦ PotenAally
learn
Spark/Hadoop/…
✦ Implement
3-‐4
algorithms
✦ Implement
grid-‐search
to
find
the
right
algorithm
parameters
✦ Implement
validaAon
algorithms
✦ Experiment
with
different
sampling-‐
sizes,
algorithms,
features
✦ ….
and
in
the
end
Ask
For
Help
112. Insight:
A
Declara2ve
Approach
SQL Result
✦ End
Users
tell
the
system
what
they
want,
not
how
to
get
it
113. Insight:
A
Declara2ve
Approach
SQL Result MQL Model
✦ End
Users
tell
the
system
what
they
want,
not
how
to
get
it
114. var
X
=
load(”als_clinical”,
2
to
10)
var
y
=
load(”als_clinical”,
1)
var
(fn-‐model,
summary)
=
doClassify(X,
y)
Example:
Supervised
ClassificaAon
✦ End
Users
tell
the
system
what
they
want,
not
how
to
get
it
Insight:
A
Declara2ve
Approach
115. var
X
=
load(”als_clinical”,
2
to
10)
var
y
=
load(”als_clinical”,
1)
var
(fn-‐model,
summary)
=
doClassify(X,
y)
Example:
Supervised
ClassificaAon
Algorithm
Independent
✦ End
Users
tell
the
system
what
they
want,
not
how
to
get
it
Insight:
A
Declara2ve
Approach
116. ML
Op2mizer:
A
Search
Problem
5min
Boosting
SVM
✦ System
is
responsible
for
searching
through
model
space
✦ Opportuni2es
for
physical
op2miza2on
117. Systems
Op2miza2on
of
Model
Search
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
118. Systems
Op2miza2on
of
Model
Search
✦ Idea
from
databases
–
shared
cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
119. Systems
Op2miza2on
of
Model
Search
✦ Idea
from
databases
–
shared
cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
120. Systems
Op2miza2on
of
Model
Search
✦ Idea
from
databases
–
shared
cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
121. Systems
Op2miza2on
of
Model
Search
✦ Idea
from
databases
–
shared
cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single
pass
over
the
data,
many
models
trained
122. Systems
Op2miza2on
of
Model
Search
✦ Idea
from
databases
–
shared
cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single
pass
over
the
data,
many
models
trained
✦ Example
–
Logis2c
Regression
via
SGD
124. Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Rela2onship
with
MLI
✦ MLI
provides
common
interface
for
all
algorithms
MQL
125. Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Rela2onship
with
MLI
✦ MLI
provides
common
interface
for
all
algorithms
✦ Contracts:
Meta-‐data
for
algorithms
writen
against
MLI
MQL
126. Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Rela2onship
with
MLI
✦ MLI
provides
common
interface
for
all
algorithms
✦ Contracts:
Meta-‐data
for
algorithms
writen
against
MLI
✦ Type
(e.g.,
classifica2on)
✦ Parameters
✦ Run2me
(e.g.,
O(n))
✦ Input-‐Specifica2on
✦ Output-‐Specifica2on
✦ …
MQL
128. Contributors
✦ John
Duchi
✦ Michael
Franklin
✦ Joseph
Gonzalez
✦ Rean
Griffith
✦ Michael
Jordan
✦ Tim
Kraska
✦ Xinghao
Pan
✦ Virginia
Smith
✦ Shivaram
Venkarataram
✦ Matei
Zaharia
129. Contributors
✦ John
Duchi
✦ Michael
Franklin
✦ Joseph
Gonzalez
✦ Rean
Griffith
✦ Michael
Jordan
✦ Tim
Kraska
✦ Xinghao
Pan
✦ Virginia
Smith
✦ Shivaram
Venkarataram
✦ Matei
Zaharia
*
*
*
*
130. First
Release
(Summer)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Spark
131. First
Release
(Summer)
✦ MLlib:
low-‐level
ML
library
and
underlying
kernels
✦ Callable
from
Scala,
Java
✦ Included
as
part
of
Spark
✦ MLI:
API
for
feature
extrac2on
and
ML
algorithms
✦ Plaworm
for
ML
development
✦ Includes
more
extensive
library
and
with
faster
dev-‐cycle
than
MLlib
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Spark
132. Second
Release
(Winter)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Spark
133. Second
Release
(Winter)
✦ ML
OpAmizer:
automated
model
selec2on
✦ Search
problem
over
feature
extractors
and
algorithms
in
MLI
✦ Contracts
✦ Restricted
query
language
(MQL)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Spark
134. Second
Release
(Winter)
✦ ML
OpAmizer:
automated
model
selec2on
✦ Search
problem
over
feature
extractors
and
algorithms
in
MLI
✦ Contracts
✦ Restricted
query
language
(MQL)
✦ Feature
extracAon
for
image
data
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End
User
Spark
137. Future
Direc2ons
✦ IdenAfy
minimal
set
of
ML
operators
✦ Expose
internals
of
ML
algorithms
to
op2mizer
✦ Unified
language
for
End
Users
and
ML
Developers
138. Future
Direc2ons
✦ IdenAfy
minimal
set
of
ML
operators
✦ Expose
internals
of
ML
algorithms
to
op2mizer
✦ Unified
language
for
End
Users
and
ML
Developers
✦ Plug-‐ins
to
Python,
R
139. Future
Direc2ons
✦ IdenAfy
minimal
set
of
ML
operators
✦ Expose
internals
of
ML
algorithms
to
op2mizer
✦ Unified
language
for
End
Users
and
ML
Developers
✦ Plug-‐ins
to
Python,
R
✦ VisualizaAon
for
unsupervised
learning
and
explora2on
140. Future
Direc2ons
✦ IdenAfy
minimal
set
of
ML
operators
✦ Expose
internals
of
ML
algorithms
to
op2mizer
✦ Unified
language
for
End
Users
and
ML
Developers
✦ Plug-‐ins
to
Python,
R
✦ VisualizaAon
for
unsupervised
learning
and
explora2on
✦ Advanced
ML
capabiliAes
✦ Time-‐series
algorithms
✦ Graphical
models
✦ Advanced
Op2miza2on
(e.g.,
asynchronous
computa2on)
✦ Online
updates
✦ Sampling
for
efficiency