SlideShare ist ein Scribd-Unternehmen logo
1 von 141
Downloaden Sie, um offline zu lesen
Evan	
  Sparks	
  and	
  Ameet	
  Talwalkar
UC	
  Berkeley
UC Berkeley
baseML
baseML
M
ML
M
Three	
  Converging	
  Trends
Big	
  Data
Three	
  Converging	
  Trends
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Machine	
  
Learning
Distributed	
  
Compu2ng
Big	
  Data
Three	
  Converging	
  Trends
Machine	
  
Learning
MLbase
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Problem:	
  Scalable	
  implementaAons	
  
difficult	
  for	
  ML	
  Developers…
Me
S
ML Contract +
Code
ML
Too	
  many	
  
algorithms…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Doesn’t	
  scale…
Too	
  many	
  
algorithms…
Too	
  many	
  
knobs…
Problem:	
  ML	
  is	
  difficult
for	
  End	
  Users…
Difficult	
  to	
  
debug…
Reliable
Fast
Accurate
Provable
Doesn’t	
  scale…
ML	
  Experts Systems	
  ExpertsMLbase
1. Easy	
  scalable	
  ML	
  development	
  (ML	
  Developers)
2. User-­‐friendly	
  ML	
  at	
  scale	
  (End	
  Users)
ML	
  Experts Systems	
  ExpertsMLbase
1. Easy	
  scalable	
  ML	
  development	
  (ML	
  Developers)
2. User-­‐friendly	
  ML	
  at	
  scale	
  (End	
  Users)
Along	
  the	
  way,	
  we	
  gain	
  insight	
  into	
  data	
  intensive	
  
compu2ng
ML	
  Experts Systems	
  ExpertsMLbase
Matlab	
  Stack
Matlab	
  Stack
Single Machine
Lapack
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
Lapack
Matlab Interface
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
✦ Matlab	
  Interface
✦ Higher-­‐level	
  abstrac2ons	
  for	
  data	
  access	
  /	
  processing
✦ More	
  extensive	
  func2onality	
  than	
  Lapack
✦ Leverages	
  Lapack	
  whenever	
  possible
Lapack
Matlab Interface
Matlab	
  Stack
Single Machine
✦ Lapack:	
  low-­‐level	
  Fortran	
  linear	
  algebra	
  library
✦ Matlab	
  Interface
✦ Higher-­‐level	
  abstrac2ons	
  for	
  data	
  access	
  /	
  processing
✦ More	
  extensive	
  func2onality	
  than	
  Lapack
✦ Leverages	
  Lapack	
  whenever	
  possible
✦ Similar	
  stories	
  for	
  R	
  and	
  Python
MLbase	
  Stack
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
MLI
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
MLI:	
  API	
  /	
  plaHorm	
  for	
  feature	
  extrac=on	
  and	
  algorithm	
  development
✦ Includes	
  higher-­‐level	
  func2onality	
  with	
  faster	
  dev	
  cycle	
  than	
  MLlib
Lapack
Matlab Interface
Single Machine
MLbase	
  Stack
Runtime(s)
MLlib
MLI
ML Optimizer
Spark
Spark:	
  cluster	
  compu=ng	
  system	
  designed	
  for	
  itera=ve	
  computa=on
MLlib:	
  low-­‐level	
  ML	
  library	
  in	
  Spark
✦ Callable	
  from	
  Scala,	
  Java
MLI:	
  API	
  /	
  plaHorm	
  for	
  feature	
  extrac=on	
  and	
  algorithm	
  development
✦ Includes	
  higher-­‐level	
  func2onality	
  with	
  faster	
  dev	
  cycle	
  than	
  MLlib
ML	
  OpAmizer:	
  automates	
  model	
  selec=on
✦ Solves	
  a	
  search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
Lapack
Matlab Interface
Single Machine
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Goal 1:
Summer Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
MLlib
MLI
ML Optimizer
End	
  User
MLbase	
  Stack	
  Status
Goal 1:
Summer Release
Goal 2:
Winter Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
Example:	
  MLlib
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
✦ Featurize	
  data	
  manually
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
Example:	
  MLlib
✦ Goal:	
  Classifica2on	
  of	
  text	
  file
✦ Featurize	
  data	
  manually
✦ Calls	
  MLlib’s	
  LR	
  func2on
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
14 .train(featurizedData)
15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
Example:	
  MLI
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
✦ MLI	
  Logis2c	
  Regression	
  leverages	
  MLlib
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  MLI
✦ Use	
  built-­‐in	
  feature	
  extrac2on	
  func2onality
✦ MLI	
  Logis2c	
  Regression	
  leverages	
  MLlib
✦ Extensions:
✦ Embed	
  in	
  cross-­‐valida2on	
  rou2ne
✦ Use	
  different	
  feature	
  extractors	
  /	
  algorithms	
  or	
  
write	
  new	
  ones
1 def main(args: Array[String]) {
2 val mc = new MLContext("local", "MLILR")
3
4 //Read in file from HDFS
5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))
6
7 //Run feature extraction
8 val classes = rawTextTable(??, "class")
9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)
11
12 //Classify the data using Logistic Regression.
13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)
14 }
1 def main(args: Array[String]) {
2 val sc = new SparkContext("local", "SparkLR")
3
4 //Load data from HDFS
5 val data = sc.textFile(args(0)) //RDD[String]
6
7 //User is responsible for formatting/featurizing/normalizing their RDD!
8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)
9
10 //Train the model using MLlib.
11 val model = new LogisticRegressionLocalRandomSGD()
12 .setStepSize(0.1)
13 .setNumIterations(50)
Example:	
  ML	
  Op2mizer
var	
  X	
  =	
  load(”text_file”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”text_file”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
✦ User	
  declara2vely	
  specifies	
  task
✦ ML	
  Op2mizer	
  searches	
  through	
  MLI
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Ease	
  of	
  use
Performance,	
  
Scalability
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
Mahout
x
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2)
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares
K-­‐Means
SGD,	
  Parallel	
  Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve	
  Filtering:
Clustering:
OpAmizaAon	
  PrimiAves:
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2)
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares
K-­‐Means
SGD,	
  Parallel	
  Gradient
MLlib
ClassificaAon:
Regression:
CollaboraAve	
  Filtering:
Clustering:
OpAmizaAon	
  PrimiAves:
Included	
  within	
  Spark	
  codebase
✦ Unlike	
  Mahout/Hadoop
✦ Part	
  of	
  Spark	
  0.8	
  release
✦ Con2nued	
  support	
  via	
  Spark	
  project
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
✦ Strong	
  scaling
✦ fix	
  total	
  problem	
  size
✦ ideally:	
  linear	
  speed	
  up	
  as	
  we	
  grow	
  cluster
MLlib	
  Performance
✦ WallAme:	
  elapsed	
  2me	
  to	
  execute	
  task
✦ Weak	
  scaling
✦ fix	
  problem	
  size	
  per	
  processor
✦ ideally:	
  constant	
  wall2me	
  as	
  we	
  grow	
  cluster
✦ Strong	
  scaling
✦ fix	
  total	
  problem	
  size
✦ ideally:	
  linear	
  speed	
  up	
  as	
  we	
  grow	
  cluster
✦ EC2	
  Experiments
✦ m2.4xlarge	
  instances,	
  up	
  to	
  32	
  machine	
  clusters
MLlib	
  Performance
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
✦ Similar	
  weak	
  scaling
0 5 10 15 20 25 30
0
2
4
6
8
10
relativewalltime
# machines
MLbase
VW
Ideal
Fig. 6: Weak scaling for logistic regression
15
20
25
30
35
speedup
MLbase
VW
Ideal
MLlib
Logis2c	
  Regression	
  -­‐	
  Weak	
  Scaling
✦ Full	
  dataset:	
  200K	
  images,	
  160K	
  dense	
  features
✦ Similar	
  weak	
  scaling
✦ MLlib	
  within	
  a	
  factor	
  of	
  2	
  of	
  VW’s	
  wall=me
MLbase VW Matlab
0
1000
2000
3000
4000
walltime(s)
n=6K, d=160K
n=12.5K, d=160K
n=25K, d=160K
n=50K, d=160K
n=100K, d=160K
n=200K, d=160K
MLlib0 5 10 15 20 25 30
0
2
4
6
8
10
relativewalltime
# machines
MLbase
VW
Ideal
Fig. 6: Weak scaling for logistic regression
15
20
25
30
35
speedup
MLbase
VW
Ideal
MLlib
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
✦ MLlib	
  exhibits	
  beTer	
  scaling	
  proper=es
0 5 10 15 20 25 30
0
# machines
ig. 6: Weak scaling for logistic regression
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
# machines
speedup
MLbase
VW
Ideal
8: Strong scaling for logistic regression
System Lines of Code
MLbase 32
GraphLab 383
MLlib
Logis2c	
  Regression	
  -­‐	
  Strong	
  Scaling
✦ Fixed	
  Dataset:	
  50K	
  images,	
  160K	
  dense	
  features
✦ MLlib	
  exhibits	
  beTer	
  scaling	
  proper=es
✦ MLlib	
  faster	
  than	
  VW	
  with	
  16	
  and	
  32	
  machines
MLbase VW Matlab
0
1000
wa
Fig. 5: Walltime for weak scaling for logistic regressi
MLbase VW Matlab
0
200
400
600
800
1000
1200
1400
walltime(s)
1 Machine
2 Machines
4 Machines
8 Machines
16 Machines
32 Machines
Fig. 7: Walltime for strong scaling for logistic regress
with respect to computation. In practice, we see comp
scaling results as more machines are added.
In MATLAB, we implement gradient descent inste
SGD, as gradient descent requires roughly the same nu
of numeric operations as SGD but does not require an
loop to pass over the data. It can thus be implemented
MLlib
0 5 10 15 20 25 30
0
# machines
ig. 6: Weak scaling for logistic regression
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
# machines
speedup
MLbase
VW
Ideal
8: Strong scaling for logistic regression
System Lines of Code
MLbase 32
GraphLab 383
MLlib
ALS	
  -­‐	
  Wall2me
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS	
  -­‐	
  Wall2me
✦ Dataset:	
  Scaled	
  version	
  of	
  NeHlix	
  data	
  (9X	
  in	
  size)
✦ Cluster:	
  9	
  machines
✦ MLlib	
  an	
  order	
  of	
  magnitude	
  faster	
  than	
  Mahout
✦ MLlib	
  within	
  factor	
  of	
  2	
  of	
  GraphLab
System WallAme	
  (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
Deployment	
  Considera2ons
Deployment	
  Considera2ons
Vowpal	
  Wabbit,	
  GraphLab
✦ Data	
  prepara=on	
  specific	
  to	
  each	
  program
✦ Non-­‐trivial	
  setup	
  on	
  cluster
✦ No	
  fault	
  tolerance
Deployment	
  Considera2ons
Vowpal	
  Wabbit,	
  GraphLab
✦ Data	
  prepara=on	
  specific	
  to	
  each	
  program
✦ Non-­‐trivial	
  setup	
  on	
  cluster
✦ No	
  fault	
  tolerance
MLlib
✦ Reads	
  files	
  from	
  HDFS
✦ Launch/compile/run	
  on	
  cluster	
  with	
  a	
  few	
  commands
✦ RDD’s	
  are	
  fault	
  tolerance
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Matlab,	
  R
x
Ease	
  of	
  use
Performance,	
  
Scalability
GraphLab,	
  VW
x
MLI
x
Mahout
x
Lay	
  of	
  the	
  Land
MLlib
x
Current	
  Op2ons
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
Current	
  Op2ons
	
  +	
  	
  	
  Easy	
  (Resembles	
  math,	
  limited	
  /	
  no	
  set	
  up	
  cost)
	
  +	
  	
  	
  Sufficient	
  for	
  prototyping	
  /	
  wri2ng	
  papers
—	
  	
  Ad-­‐hoc,	
  non-­‐scalable	
  scripts
—	
  	
  Loss	
  of	
  transla2on	
  upon	
  re-­‐implementa2on
	
  +	
  	
  	
  Scalable	
  and	
  (some2mes)	
  fast
	
  +	
  	
  	
  Exis2ng	
  open-­‐source	
  library	
  of	
  ML	
  algorithms
—	
  	
  Difficult	
  to	
  set	
  up,	
  extend
Examples
ML Developer
Code
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Mahout	
  ALS	
  with	
  Early	
  Stopping
✦ Theory:	
  simple	
  if-­‐statement	
  (3	
  lines	
  of	
  code)
Examples
ML Developer
Code
‘Distributed’	
  Divide-­‐Factor-­‐Combine	
  (DFC)
✦ Ini2al	
  studies	
  in	
  MATLAB	
  (Not	
  distributed)
✦ Distributed	
  prototype	
  involving	
  compiled	
  MATLAB
Mahout	
  ALS	
  with	
  Early	
  Stopping
✦ Theory:	
  simple	
  if-­‐statement	
  (3	
  lines	
  of	
  code)
✦ Prac2ce:	
  sih	
  through	
  7	
  files,	
  nearly	
  1K	
  lines	
  of	
  code
Insight:	
  Programming	
  Abstrac2ons
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
✦ MLI	
  Examples:
✦ DFC:	
  ~50	
  lines	
  of	
  code
Insight:	
  Programming	
  Abstrac2ons
✦ Shield	
  ML	
  Developers	
  from	
  low-­‐details:	
  provide	
  
familiar	
  mathema2cal	
  operators	
  in	
  distributed	
  sejng
✦ ML	
  Developer	
  API	
  (MLI)
✦ Table	
  Computa2on:	
  MLTable
✦ Linear	
  Algebra:	
  MLSubMatrix
✦ Op2miza2on	
  Primi2ves:	
  MLSolve
✦ MLI	
  Examples:
✦ DFC:	
  ~50	
  lines	
  of	
  code
✦ ALS:	
  early	
  stopping	
  in	
  3	
  lines;	
  <	
  40	
  lines	
  total
Lines	
  of	
  Code
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines	
  of	
  Code
Logis2c	
  Regression
Alterna2ng	
  Least	
  Squares
System Lines	
  of	
  Code
Matlab 11
Vowpal	
  Wabbit 721
MLI 55
System Lines	
  of	
  Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI	
  Details
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
NEW
val	
  x:	
  MLTable
MLI	
  Details
OLD
val	
  x:	
  RDD[Array[Double]]
val	
  x:	
  RDD[spark.u=l.Vector]
val	
  x:	
  RDD[breeze.linalg.Vector]
val	
  x:	
  RDD[BIDMat.SMat]
NEW
val	
  x:	
  MLTable
✦ Generic	
  interface	
  for	
  feature	
  extrac2on
✦ Common	
  interface	
  to	
  support	
  an	
  op2mizer
✦ Abstract	
  interface	
  for	
  arbitrary	
  backends
MLTable
✦ Flexibility	
  when	
  loading	
  data
✦ e.g.,	
  CSV,	
  JSON,	
  XML
✦ Heterogenous	
  data	
  across	
  
columns
✦ Missing	
  Data
✦ Feature	
  extrac2on
✦ Common	
  Interface
✦ Supports	
  MapReduce	
  and	
  
Rela2onal	
  Operators	
  
✦ Inspired	
  by	
  DataFrames	
  (R)	
  and	
  Pandas	
  (Python)
Feature	
  Extrac2on
where a ke
matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a
data. Outpu
table.
numRows None Long Returns nu
numCols None Long Returns the
Fig. 2: MLTable API Illustration. This table captures core operations of th
1 def main(args: Array[String]) {
2 val mc = new MLContext("local")
3
4 //Read in table from file on HDFS.
5 val rawTextTable = mc.textFile(args(0))
6
7 //Run feature extraction on the raw text - get the top 30000 bigrams.
8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))
9
10 //Cluster the data using K-Means.
11 val kMeansModel = KMeans(featurizedTable, k=50)
12 }
Fig. 3: Loading, featurizing, and learning clusters on a corpu
Family Example Uses Returns
Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
MLSubMatrix
✦ Linear	
  algebra	
  on	
  local	
  parAAons
✦ E.g.,	
  matrix-­‐vector	
  opera2ons	
  for	
  
mini-­‐batch	
  logis2c	
  regression
✦ E.g.,	
  solving	
  linear	
  system	
  of	
  equa2ons	
  
for	
  Alterna2ng	
  Least	
  Squares
✦ Sparse	
  and	
  Dense	
  Matrix	
  Support
Alterna2ng	
  Least	
  Squares
19 parfor q=1:n
20 Uq = U(Uinds{q},:);
21 V(q,:) = (Uq’*Uq + lambI)  (Uq’ * M(Uinds{q},q));
22 end
23 end
24 end
1 object BroadcastALS extends Algorithm {
2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,
3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {
4 val lambI = MLSubMatrix.eye(k).mul(lambda)
5 var U = MLSubMatrix.rand(m, k)
6 var V = MLSubMatrix.rand(n, k)
7 var U_b = trainData.context.broadcast(U)
8 var V_b = trainData.context.broadcast(V)
9 for (iter <- 0 until maxIter) {
10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))
11 U_b = trainData.context.broadcast(U)
12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))
13 V_b = trainData.context.broadcast(V)
14 }
15 new ALSModel(U, V)
16 }
17
18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){
19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)
20 for (i <- 0 until trainDataPart.numRows) {
21 val q = trainDataPart.rowID(i)
22 val nz_inds = trainDataPart.nzCols(q)
23 val Yq = Y(trainDataPart.nzCols(q), ??)
24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)
25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)
26 }
27 return localX
28 }
29 }
MLSolve
✦ Distributed	
  implementaAons	
  of	
  
common	
  opAmizaAon	
  paZerns
✦ E.g.,	
  Stochas2c	
  Gradient	
  Descent:	
  
Applicable	
  to	
  summable	
  ML	
  losses
✦ E.g.,	
  LBFGS:	
  An	
  approximate	
  2nd-­‐
order	
  op2miza2on	
  method	
  
✦ E.g.,	
  ADMM:	
  Decomposi2on	
  /	
  
coordina2on	
  procedure
Logis2c	
  Regression
5 grad = X’ * (sigmoid(X * w) - y);
6 w = w - learning_rate * grad;
7 end
8 end
9
10 % applies sigmoid function component-wise on the vector x
11 function s = sigmoid(x)
12 s = 1 ./ (1 + exp(-1 .* x));
13 end
1 object LogisticRegression extends Algorithm {
2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))
3
4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {
5 x.transpose * (sigmoid(x dot w) - y)
6 }
7
8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {
9 val d = data.numCols
10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),
11 maxIterations = p.maxIter, learningRate = p.learningRate,
12 gradientFunction = gradientFunction)
13 val weights = SGD(data, params)
14 new LogRegModel(weights)
15 }
16 }
1 object StochasticGradientDescent extends Optimizer {
2
3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,
4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr
5 var localWeights = weights
6 for (i <- 0 to x.numRows) {
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares,	
  [DFC]
K-­‐Means,	
  [DP-­‐Means]
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2),	
  Mul2nomial	
  
Regression,	
  [Naive	
  Bayes,	
  Decision	
  Trees]
SGD,	
  Parallel	
  Gradient,	
  Local	
  SGD,	
  [L-­‐BFGS,	
  ADMM,	
  
Adagrad]
Principal	
  Component	
  Analysis	
  (PCA),	
  N-­‐grams,	
  feature	
  
cleaning	
  /	
  normaliza2on
Cross	
  Valida2on,	
  Evalua2on	
  Metrics
MLI	
  Func2onality
Regression:
CollaboraAve	
  Filtering:
Clustering:
ClassificaAon:
OpAmizaAon	
  PrimiAves:
Feature	
  ExtracAon:
ML	
  Tools:
Linear	
  Regression	
  (+Lasso,	
  Ridge)
Alterna2ng	
  Least	
  Squares,	
  [DFC]
K-­‐Means,	
  [DP-­‐Means]
Logis2c	
  Regression,	
  Linear	
  SVM	
  (+L1,	
  L2),	
  Mul2nomial	
  
Regression,	
  [Naive	
  Bayes,	
  Decision	
  Trees]
SGD,	
  Parallel	
  Gradient,	
  Local	
  SGD,	
  [L-­‐BFGS,	
  ADMM,	
  
Adagrad]
Principal	
  Component	
  Analysis	
  (PCA),	
  N-­‐grams,	
  feature	
  
cleaning	
  /	
  normaliza2on
Cross	
  Valida2on,	
  Evalua2on	
  Metrics
MLI	
  Func2onality
Regression:
CollaboraAve	
  Filtering:
Clustering:
ClassificaAon:
OpAmizaAon	
  PrimiAves:
Feature	
  ExtracAon:
ML	
  Tools:
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do What	
  you	
  have	
  to	
  do
✦ Learn	
  the	
  internals	
  of	
  ML	
  classificaAon	
  
algorithms,	
  sampling,	
  feature	
  selecAon,	
  
X-­‐validaAon,….
✦ PotenAally	
  learn	
  Spark/Hadoop/…
✦ Implement	
  3-­‐4	
  algorithms
✦ Implement	
  grid-­‐search	
  to	
  find	
  the	
  right	
  
algorithm	
  parameters
✦ Implement	
  validaAon	
  algorithms
✦ Experiment	
  with	
  different	
  sampling-­‐
sizes,	
  algorithms,	
  features
✦ ….
Build	
  a	
  Classifier	
  for	
  X
What	
  you	
  want	
  to	
  do What	
  you	
  have	
  to	
  do
✦ Learn	
  the	
  internals	
  of	
  ML	
  classificaAon	
  
algorithms,	
  sampling,	
  feature	
  selecAon,	
  
X-­‐validaAon,….
✦ PotenAally	
  learn	
  Spark/Hadoop/…
✦ Implement	
  3-­‐4	
  algorithms
✦ Implement	
  grid-­‐search	
  to	
  find	
  the	
  right	
  
algorithm	
  parameters
✦ Implement	
  validaAon	
  algorithms
✦ Experiment	
  with	
  different	
  sampling-­‐
sizes,	
  algorithms,	
  features
✦ ….
and	
  in	
  the	
  end
Ask	
  For	
  Help
Insight:	
  A	
  Declara2ve	
  Approach
SQL Result
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
SQL Result MQL Model
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
var	
  X	
  =	
  load(”als_clinical”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”als_clinical”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
Example:	
  Supervised	
  ClassificaAon
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
var	
  X	
  =	
  load(”als_clinical”,	
  2	
  to	
  10)
var	
  y	
  =	
  load(”als_clinical”,	
  1)
var	
  (fn-­‐model,	
  summary)	
  =	
  doClassify(X,	
  y)
Example:	
  Supervised	
  ClassificaAon
Algorithm	
  Independent	
  
✦ End	
  Users	
  tell	
  the	
  system	
  what	
  they	
  want,	
  not	
  how	
  
to	
  get	
  it
Insight:	
  A	
  Declara2ve	
  Approach
 ML	
  Op2mizer:	
  A	
  Search	
  Problem
5min
Boosting
SVM
✦ System	
  is	
  responsible	
  
for	
  searching	
  through	
  
model	
  space
✦ Opportuni2es	
  for	
  
physical	
  op2miza2on
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single	
  pass	
  over	
  the	
  
data,	
  many	
  models	
  
trained
Systems	
  Op2miza2on	
  of	
  Model	
  
Search
✦ Idea	
  from	
  databases	
  –	
  
shared	
  cursor!
35
Observation:
We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
QueryA
QueryB
✦ Single	
  pass	
  over	
  the	
  
data,	
  many	
  models	
  
trained
✦ Example	
  –	
  Logis2c	
  
Regression	
  via	
  SGD
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
✦ Contracts:	
  Meta-­‐data	
  for	
  algorithms	
  writen	
  against	
  MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Rela2onship	
  with	
  MLI
✦ MLI	
  provides	
  common	
  interface	
  for	
  all	
  algorithms
✦ Contracts:	
  Meta-­‐data	
  for	
  algorithms	
  writen	
  against	
  MLI
✦ Type	
  (e.g.,	
  classifica2on)
✦ Parameters
✦ Run2me	
  (e.g.,	
  O(n))
✦ Input-­‐Specifica2on
✦ Output-­‐Specifica2on
✦ …
MQL
Vision
MLlib
MLI
ML	
  OpAmizer
Release	
  Plan
Contributors
✦ John	
  Duchi
✦ Michael	
  Franklin
✦ Joseph	
  Gonzalez
✦ Rean	
  Griffith
✦ Michael	
  Jordan
✦ Tim	
  Kraska
✦ Xinghao	
  Pan
✦ Virginia	
  Smith
✦ Shivaram	
  Venkarataram
✦ Matei	
  Zaharia
Contributors
✦ John	
  Duchi
✦ Michael	
  Franklin
✦ Joseph	
  Gonzalez
✦ Rean	
  Griffith
✦ Michael	
  Jordan
✦ Tim	
  Kraska
✦ Xinghao	
  Pan
✦ Virginia	
  Smith
✦ Shivaram	
  Venkarataram
✦ Matei	
  Zaharia
*
*
*
*
First	
  Release	
  (Summer)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
First	
  Release	
  (Summer)
✦ MLlib:	
  low-­‐level	
  ML	
  library	
  and	
  underlying	
  kernels
✦ Callable	
  from	
  Scala,	
  Java
✦ Included	
  as	
  part	
  of	
  Spark
✦ MLI:	
  API	
  for	
  feature	
  extrac2on	
  and	
  ML	
  algorithms
✦ Plaworm	
  for	
  ML	
  development
✦ Includes	
  more	
  extensive	
  library	
  and	
  with	
  faster	
  dev-­‐cycle	
  than	
  MLlib
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
✦ ML	
  OpAmizer:	
  automated	
  model	
  selec2on
✦ Search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
✦ Contracts
✦ Restricted	
  query	
  language	
  (MQL)
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Second	
  Release	
  (Winter)
✦ ML	
  OpAmizer:	
  automated	
  model	
  selec2on
✦ Search	
  problem	
  over	
  feature	
  extractors	
  and	
  algorithms	
  in	
  MLI
✦ Contracts
✦ Restricted	
  query	
  language	
  (MQL)
✦ Feature	
  extracAon	
  for	
  image	
  data
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative
ML Task
ML Contract +
Code
Master Server
….
result
(e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX
Runtime
DMX
Runtime
DMX
Runtime
DMX
Runtime
LLP
PLP
MasterSlaves
End	
  User
Spark
Future	
  Direc2ons
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
✦ VisualizaAon	
  for	
  unsupervised	
  learning	
  and	
  explora2on
Future	
  Direc2ons
✦ IdenAfy	
  minimal	
  set	
  of	
  ML	
  operators
✦ Expose	
  internals	
  of	
  ML	
  algorithms	
  to	
  op2mizer
✦ Unified	
  language	
  for	
  End	
  Users	
  and	
  ML	
  Developers
✦ Plug-­‐ins	
  to	
  Python,	
  R
✦ VisualizaAon	
  for	
  unsupervised	
  learning	
  and	
  explora2on
✦ Advanced	
  ML	
  capabiliAes
✦ Time-­‐series	
  algorithms
✦ Graphical	
  models
✦ Advanced	
  Op2miza2on	
  (e.g.,	
  asynchronous	
  computa2on)
✦ Online	
  updates
✦ Sampling	
  for	
  efficiency	
  
ContribuAons	
  
encouraged!
Berkeley,	
  CA
August	
  29-­‐30www.mlbase.org
baseML
baseML
baseML
ML base
ML base
ML base
ML base

Weitere ähnliche Inhalte

Was ist angesagt?

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Jaroslav Bachorik
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 

Was ist angesagt? (20)

End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko SrkočJavantura v4 - Getting started with Apache Spark - Dinko Srkoč
Javantura v4 - Getting started with Apache Spark - Dinko Srkoč
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)Extending Spark With Java Agent (handout)
Extending Spark With Java Agent (handout)
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

Ähnlich wie MLlib sparkmeetup_8_6_13_final_reduced

MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For XmlLars Trieloff
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 

Ähnlich wie MLlib sparkmeetup_8_6_13_final_reduced (20)

MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
H2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal MalohlavaH2O World - Sparkling Water - Michal Malohlava
H2O World - Sparkling Water - Michal Malohlava
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 

Kürzlich hochgeladen

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideHironori Washizaki
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 

Kürzlich hochgeladen (20)

VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK GuideIEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
IEEE Computer Society’s Strategic Activities and Products including SWEBOK Guide
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 

MLlib sparkmeetup_8_6_13_final_reduced

  • 1. Evan  Sparks  and  Ameet  Talwalkar UC  Berkeley UC Berkeley baseML baseML M ML M
  • 5. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning
  • 6. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning MLbase
  • 8. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 9. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 10. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 11. Too  many   algorithms… Problem:  ML  is  difficult for  End  Users…
  • 12. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users…
  • 13. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug…
  • 14. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Doesn’t  scale…
  • 15. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Reliable Fast Accurate Provable Doesn’t  scale…
  • 16. ML  Experts Systems  ExpertsMLbase
  • 17. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) ML  Experts Systems  ExpertsMLbase
  • 18. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) Along  the  way,  we  gain  insight  into  data  intensive   compu2ng ML  Experts Systems  ExpertsMLbase
  • 21. Lapack Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library
  • 22. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible
  • 23. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible ✦ Similar  stories  for  R  and  Python
  • 26. MLbase  Stack Runtime(s)Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on Lapack Matlab Interface Single Machine
  • 27. MLbase  Stack Runtime(s) MLlib Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java Lapack Matlab Interface Single Machine
  • 28. MLbase  Stack Runtime(s) MLlib MLI Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib Lapack Matlab Interface Single Machine
  • 29. MLbase  Stack Runtime(s) MLlib MLI ML Optimizer Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib ML  OpAmizer:  automates  model  selec=on ✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI Lapack Matlab Interface Single Machine
  • 30. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 31. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 32. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 33. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Goal 2: Winter Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 35. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file
  • 36. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 37. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually ✦ Calls  MLlib’s  LR  func2on 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 39. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 40. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 41. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib ✦ Extensions: ✦ Embed  in  cross-­‐valida2on  rou2ne ✦ Use  different  feature  extractors  /  algorithms  or   write  new  ones 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 42. Example:  ML  Op2mizer var  X  =  load(”text_file”,  2  to  10) var  y  =  load(”text_file”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) ✦ User  declara2vely  specifies  task ✦ ML  Op2mizer  searches  through  MLI
  • 44. Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 45. Matlab,  R x Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 46. Matlab,  R x Ease  of  use Performance,   Scalability Mahout x Lay  of  the  Land
  • 47. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land
  • 48. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 49. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves:
  • 50. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves: Included  within  Spark  codebase ✦ Unlike  Mahout/Hadoop ✦ Part  of  Spark  0.8  release ✦ Con2nued  support  via  Spark  project
  • 52. ✦ WallAme:  elapsed  2me  to  execute  task MLlib  Performance
  • 53. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster MLlib  Performance
  • 54. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster MLlib  Performance
  • 55. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster ✦ EC2  Experiments ✦ m2.4xlarge  instances,  up  to  32  machine  clusters MLlib  Performance
  • 56. Logis2c  Regression  -­‐  Weak  Scaling
  • 57. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features
  • 58. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling 0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 59. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling ✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 60. Logis2c  Regression  -­‐  Strong  Scaling
  • 61. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features
  • 62. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 63. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es ✦ MLlib  faster  than  VW  with  16  and  32  machines MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regressi MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 65. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines
  • 66. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 67. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 68. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines ✦ MLlib  an  order  of  magnitude  faster  than  Mahout ✦ MLlib  within  factor  of  2  of  GraphLab System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 70. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance
  • 71. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance MLlib ✦ Reads  files  from  HDFS ✦ Launch/compile/run  on  cluster  with  a  few  commands ✦ RDD’s  are  fault  tolerance
  • 73. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 74. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x MLI x Mahout x Lay  of  the  Land MLlib x
  • 76. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 77. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 78. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on  +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms —    Difficult  to  set  up,  extend
  • 80. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB
  • 81. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code)
  • 82. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code) ✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code
  • 84. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI)
  • 85. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve
  • 86. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code
  • 87. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code ✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total
  • 89. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 90. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 91. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 93. MLI  Details OLD val  x:  RDD[Array[Double]]
  • 94. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector]
  • 95. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector]
  • 96. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 97. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 98. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable
  • 99. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable ✦ Generic  interface  for  feature  extrac2on ✦ Common  interface  to  support  an  op2mizer ✦ Abstract  interface  for  arbitrary  backends
  • 100. MLTable ✦ Flexibility  when  loading  data ✦ e.g.,  CSV,  JSON,  XML ✦ Heterogenous  data  across   columns ✦ Missing  Data ✦ Feature  extrac2on ✦ Common  Interface ✦ Supports  MapReduce  and   Rela2onal  Operators   ✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)
  • 101. Feature  Extrac2on where a ke matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a data. Outpu table. numRows None Long Returns nu numCols None Long Returns the Fig. 2: MLTable API Illustration. This table captures core operations of th 1 def main(args: Array[String]) { 2 val mc = new MLContext("local") 3 4 //Read in table from file on HDFS. 5 val rawTextTable = mc.textFile(args(0)) 6 7 //Run feature extraction on the raw text - get the top 30000 bigrams. 8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000)) 9 10 //Cluster the data using K-Means. 11 val kMeansModel = KMeans(featurizedTable, k=50) 12 } Fig. 3: Loading, featurizing, and learning clusters on a corpu Family Example Uses Returns Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
  • 102. MLSubMatrix ✦ Linear  algebra  on  local  parAAons ✦ E.g.,  matrix-­‐vector  opera2ons  for   mini-­‐batch  logis2c  regression ✦ E.g.,  solving  linear  system  of  equa2ons   for  Alterna2ng  Least  Squares ✦ Sparse  and  Dense  Matrix  Support
  • 103. Alterna2ng  Least  Squares 19 parfor q=1:n 20 Uq = U(Uinds{q},:); 21 V(q,:) = (Uq’*Uq + lambI) (Uq’ * M(Uinds{q},q)); 22 end 23 end 24 end 1 object BroadcastALS extends Algorithm { 2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable, 3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = { 4 val lambI = MLSubMatrix.eye(k).mul(lambda) 5 var U = MLSubMatrix.rand(m, k) 6 var V = MLSubMatrix.rand(n, k) 7 var U_b = trainData.context.broadcast(U) 8 var V_b = trainData.context.broadcast(V) 9 for (iter <- 0 until maxIter) { 10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k)) 11 U_b = trainData.context.broadcast(U) 12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k)) 13 V_b = trainData.context.broadcast(V) 14 } 15 new ALSModel(U, V) 16 } 17 18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){ 19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k) 20 for (i <- 0 until trainDataPart.numRows) { 21 val q = trainDataPart.rowID(i) 22 val nz_inds = trainDataPart.nzCols(q) 23 val Yq = Y(trainDataPart.nzCols(q), ??) 24 localX(i, ??) = ((Yq.transpose times Yq) + lambI) 25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose) 26 } 27 return localX 28 } 29 }
  • 104. MLSolve ✦ Distributed  implementaAons  of   common  opAmizaAon  paZerns ✦ E.g.,  Stochas2c  Gradient  Descent:   Applicable  to  summable  ML  losses ✦ E.g.,  LBFGS:  An  approximate  2nd-­‐ order  op2miza2on  method   ✦ E.g.,  ADMM:  Decomposi2on  /   coordina2on  procedure
  • 105. Logis2c  Regression 5 grad = X’ * (sigmoid(X * w) - y); 6 w = w - learning_rate * grad; 7 end 8 end 9 10 % applies sigmoid function component-wise on the vector x 11 function s = sigmoid(x) 12 s = 1 ./ (1 + exp(-1 .* x)); 13 end 1 object LogisticRegression extends Algorithm { 2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z)) 3 4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = { 5 x.transpose * (sigmoid(x dot w) - y) 6 } 7 8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = { 9 val d = data.numCols 10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1), 11 maxIterations = p.maxIter, learningRate = p.learningRate, 12 gradientFunction = gradientFunction) 13 val weights = SGD(data, params) 14 new LogRegModel(weights) 15 } 16 } 1 object StochasticGradientDescent extends Optimizer { 2 3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar, 4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr 5 var localWeights = weights 6 for (i <- 0 to x.numRows) {
  • 106. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 107. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 109. Build  a  Classifier  for  X What  you  want  to  do
  • 110. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ ….
  • 111. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ …. and  in  the  end Ask  For  Help
  • 112. Insight:  A  Declara2ve  Approach SQL Result ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 113. Insight:  A  Declara2ve  Approach SQL Result MQL Model ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 114. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 115. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon Algorithm  Independent   ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 116.  ML  Op2mizer:  A  Search  Problem 5min Boosting SVM ✦ System  is  responsible   for  searching  through   model  space ✦ Opportuni2es  for   physical  op2miza2on
  • 117. Systems  Op2miza2on  of  Model   Search 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 118. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 119. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA
  • 120. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB
  • 121. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained
  • 122. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained ✦ Example  –  Logis2c   Regression  via  SGD
  • 123. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI MQL
  • 124. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms MQL
  • 125. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI MQL
  • 126. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI ✦ Type  (e.g.,  classifica2on) ✦ Parameters ✦ Run2me  (e.g.,  O(n)) ✦ Input-­‐Specifica2on ✦ Output-­‐Specifica2on ✦ … MQL
  • 128. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia
  • 129. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia * * * *
  • 130. First  Release  (Summer) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 131. First  Release  (Summer) ✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels ✦ Callable  from  Scala,  Java ✦ Included  as  part  of  Spark ✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms ✦ Plaworm  for  ML  development ✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 132. Second  Release  (Winter) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 133. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 134. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) ✦ Feature  extracAon  for  image  data MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 136. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer
  • 137. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers
  • 138. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R
  • 139. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on
  • 140. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on ✦ Advanced  ML  capabiliAes ✦ Time-­‐series  algorithms ✦ Graphical  models ✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on) ✦ Online  updates ✦ Sampling  for  efficiency  
  • 141. ContribuAons   encouraged! Berkeley,  CA August  29-­‐30www.mlbase.org baseML baseML baseML ML base ML base ML base ML base