Implementation of linear regression and logistic regression on Spark

Parallel implementation of
ML algorithms on Spark
Dalei Li
EIT Digital
https://github.com/lidalei/LinearLogisticRegSpark
1

Overview
• Linear regression + l2 regularization
• Normal equation
• Logistic regression + l2 regularization
• Gradient descend
• Newton’s method
• Hyper-parameter optimization
• Experiments
2

Tools
• IntelliJ + sbt
• Scala 2.11.8 + Spark 2.0.1
3

Linear regression
• Problem formulation
• Closed-form solution
• Computation reformulation
4

Linear regression
• Data set - UCI YearPredictionMSD, text ﬁle
• 515,345 songs, (90 audio numerical features,
year)
• Core computation - norm terms and rmse
5
Implemented outer product +
vector addition

Workﬂow
6
Read ﬁle RegexTokenizer StandardScaler
Solve normal
equation
Spark SQL text
Add l2 regularization
LAPACK
Center data
Evaluation
rmse

Validation
7
Spark ML linear regression with norm solver vs. my implementation (both with
0.1 l2 regularization)
Randomly split data set into train 70% +
test 30%. The RMSEs on test set are
also identical, less than 0.5%
diﬀerence.

Logistic regression
• Problem formulation
• Gradient descent
• Newton’s method
• Computation reformulation - gradient and Hessian matrix
8

Logistic regression
• Data set - UCI HIGGS, csv ﬁle
• 11 million instances, (21+7 numerical features,
binary label)
• Core computation - gradient and Hessian matrix
9
treeReduce can reduce the
pressure of ﬁnal ops in driver.

Workﬂow
10
Read ﬁle VectorAssembler DF to RDD
gradient
descent/
newton’s method
Spark SQL csv Gradient - add l2 regularization
Scala case class Instance
(features, label),
Newton’s - append all-one
column
Evaluation
cross entropy
confusion matrix

Validation
11
Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s
method
Randomly split data set into train 70% +
test 30%. The learned THETAs are
almost identical, the last one is bias.

• Grid search to ﬁnd optimal hyper-parameters with
best generalization error
• Estimate generalization error
• k-Fold cross validation
Hyper-parameter
optimization
12
Hyper-parameter is a parameter used in
a training process but not a part of a
classiﬁer itself. It controls what kind of
parameters can / tend to be selected.
For example, polynomial expansion will
make non-linear relationship between a
label and features be learned possibly.

Grid search
• Grid - [polynomial expansion degree] x [l2
regularization]
• Polynomial expansion is memory killer
• Degree 3 on 7 features results in 119 features
• Be careful with exploiting parallelism
13
To increase temporal locality - accesses
to a data frame are clustered in time.

Polynomial expansion does not include
constant column.

K-Fold
14
DF
Persist,
randomSplit
map=> [([train_i], test)] map=>[(train, test)]
Spark SQL data frame
[([DF], DF)]
[(union[DF], DF)]

Experiments
16
Spark 2.0.2 standalone mode
3 cores + 5GB mem
exact copy of read-in ﬁle
http://spark.apache.org/docs/latest/cluster-overview.html
In total, we have 3 physical machines
with 12GB mem + 8 cores.

Driver - execute scala program

Worker - execute tasks

Executor - each application runs a or
more processes on a worker node

Job - triggered by an action

Task - a unit of work executed on an
executor, related with number of
partitions >= number of blocks
(128MB). If set manually, 2-4 partitions
for each CPU in your cluster.

Stage - a set of tasks
Local ﬁle - path + content on
each worker node.

Performance test
• ML Settings
• Logistic regression on HIGGS
• Train-test split, 70% + 30%
• Only 7 high level features were used
• Test unit 1 - 100 times full gradient descent + training error
on training set, initial learning rate 0.001, l2 regularization 0.1
• Test unit 2 - compute confusion matrix on test set and make
predictions
17

Performance and speedup curve
18
0
1.25
2.5
3.75
5
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) training-speed up
1
1.822
2.372
2.693
3.641
4.43
Running time vs. #executors (2 times average). Except for local, all tests have
enough memory
Local mode does not have
enough memory, causing data
cannot be persist in memory.
Thus, the running time is much
higher.

Having more executors will
reduce the running time linearly.

Grid search
• 10% of original data, i.e., 1.1 million instances, 7 high level features only
• Grid
• Polynomial degrees - 1, 2, 3
• l2 regularization - 0, 0.001, 0.01, 0.1, 0.5
• 3-Fold cross validation
• 100 times gradient descent with initial learning rate 0.01
• 2 executors with 10GB mem + 5 cores each
• Result - 4400s training time, ﬁnal test accuracy 62.4%
19
Confusion matrix: truePositive: 117605,
trueNegative: 88664, falsePositive:
66529, falseNegative: 57786

Conclusion
• Persist data - use more than once (incl. having branches)
• Change default cluster settings, e.g., executor memory per executor is 1GB
• Make use of Spark UI to ﬁnd bottlenecks
• Using Spark builtin functions if possible
• Good examples for missing functions
• Don’t use accumulators in a transformation, except only need
approximations
• Always start from small data to debug faster
• Future work - obey train-test split
20

Q&A
• Thank you!
• Useful links
• Master - spark://ip:7077, e.g., spark://b2.lxd:7077
• Cluster - http://ip:8080/
• Spark UI - http://ip:4040/
• https://spark.apache.org/docs/latest/programming-guide.html
• http://spark.apache.org/docs/latest/submitting-
applications.html, package a jar - sbt package
21

Training time vs. # executors
23
0
0.25
0.5
0.75
1
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) test accuracy

Implementation of linear regression and logistic regression on Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Implementation of linear regression and logistic regression on Spark

Ähnlich wie Implementation of linear regression and logistic regression on Spark (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Implementation of linear regression and logistic regression on Spark