Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
8. MLlib logistic regression
Initialize
coefficients
Broadcast coefficients
to Executors
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Reduce from
executors to get
lossSum and
gradientSum
Handle regularization
and
use L-BFGS/OWL-QN
to find next step Final model
coefficients
Executors/Workers
Driver/Controller
loop untial converge
9. Bottleneck 1
Initialize
coefficients
Broadcast coefficients
to Executors
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Reduce from
executors to get
lossSum and
gradientSum
Handle regularization
and
use L-BFGS/OWL-QN
to find next step Final model
coefficients
Executors/Workers
Driver/Controller
loop untial converge
10. Solution 1
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Reduce from
executors to get
final lossSum
and gradientSum
Comput loss and
gradient for each
instance, and sum
them up locally
Reduce from
executors to get
partial lossSum
and gradientSum
Reduce from
executors to get
partial lossSum
and gradientSum
11. Bottleneck 2
Initialize
coefficients
Broadcast coefficients
to Executors
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Comput loss and
gradient for each
instance, and sum
them up locally
Reduce from
executors to get
lossSum and
gradientSum
Handle regularization
and
use L-BFGS/OWL-QN
to find next step Final model
coefficients
Executors/Workers
Driver/Controller
loop untial converge
12. Outline
• Background
• Vector-free L-BFGS on Spark
• Logistic regression on vector-free L-BFGS
• Performance
• Integrate with existing MLlib
• Future work
15. Distributed Vector
• Definition:
• Linear algebra operations:
– a * x + y
– a * x + b * y + c * z + …
– x.dot(y)
– x.norm
– …
class DistribuedVector(
val values: RDD[Vector],
val sizePerPart: Int,
val numPartitions: Int,
val size: Long)
35. Outline
• Background
• Vector-free L-BFGS on Spark
• Logistic regression on vector-free L-BFGS
• Performance
• Integrate with existing MLlib
• Future work
36. APIs
val dataset: Dataset[_] =
spark.read.format("libsvm").load("data/a9a")
val trainer = new VLogisticRegression()
.setColsPerBlock(100)
.setRowsPerBlock(10)
.setColPartitions(3)
.setRowPartitions(3)
.setRegParam(0.5)
val model = trainer.fit(dataset)
println(s"Vector-free logistic regression coefficients:
${model.coefficients}")
37. Adaptive logistic regression
Number of features Optimization method
less than 4096 WLS/IRLS
more than 4096,
but less than 10 million
L-BFGS
more than 10 million VL-BFGS
39. APIs
val dataset: Dataset[_] =
spark.read.format("libsvm").load("data/a9a")
val trainer = new LogisticRegression()
.setColsPerBlock(100)
.setRowsPerBlock(10)
.setColPartitions(3)
.setRowPartitions(3)
.setRegParam(0.5)
.setSolver(“vl-bfgs”)
val model = trainer.fit(dataset)
println(s"Vector-free logistic regression coefficients:
${model.coefficients}")
40. Outline
• Background
• Vector-free L-BFGS on Spark
• Logistic regression on vector-free L-BFGS
• Performance
• Integrate with existing MLlib
• Future work
41. Future work
• Reduce stages for each iteration.
• Performance improvements.
• Implement LinearRegression, SoftmaxRegression and
MultilayerPerceptronClassifier base on vector-free L-BFGS/OWL-QN.
• Real word use case for Ads CTR prediction with billions parameters
and will share experiences and lessons we learned:
– https://github.com/yanboliang/spark-vlbfgs
42. Key takeaways
• Full distributed calculation, full distributed model, can run
successfully without OOM.
• VL-BFGS API is consistent with breeze L-BFGS.
• VL-BFGS output exactly the same solution as breeze L-BFGS.
• Does not require special components such as parameter servers.
• Pure library and can be deployed and used easily.
• Can benefit from the Spark cluster operation and development
experience.
• Use the same platform as the data size increasing.