SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Parallel implementation of
ML algorithms on Spark
Dalei Li
EIT Digital
https://github.com/lidalei/LinearLogisticRegSpark
1
Overview
• Linear regression + l2 regularization
• Normal equation
• Logistic regression + l2 regularization
• Gradient descend
• Newton’s method
• Hyper-parameter optimization
• Experiments
2
Tools
• IntelliJ + sbt
• Scala 2.11.8 + Spark 2.0.1
3
Linear regression
• Problem formulation
• Closed-form solution
• Computation reformulation
4
Linear regression
• Data set - UCI YearPredictionMSD, text file
• 515,345 songs, (90 audio numerical features,
year)
• Core computation - norm terms and rmse
5
Implemented outer product +
vector addition
Workflow
6
Read file RegexTokenizer StandardScaler
Solve normal
equation
Spark SQL text
Add l2 regularization
LAPACK
Center data
Evaluation
rmse
Validation
7
Spark ML linear regression with norm solver vs. my implementation (both with
0.1 l2 regularization)
Randomly split data set into train 70% +
test 30%. The RMSEs on test set are
also identical, less than 0.5%
difference.
Logistic regression
• Problem formulation
• Gradient descent
• Newton’s method
• Computation reformulation - gradient and Hessian matrix
8
Logistic regression
• Data set - UCI HIGGS, csv file
• 11 million instances, (21+7 numerical features,
binary label)
• Core computation - gradient and Hessian matrix
9
treeReduce can reduce the
pressure of final ops in driver.
Workflow
10
Read file VectorAssembler DF to RDD
gradient
descent/
newton’s method
Spark SQL csv Gradient - add l2 regularization
Scala case class Instance
(features, label),
Newton’s - append all-one
column
Evaluation
cross entropy
confusion matrix
Validation
11
Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s
method
Randomly split data set into train 70% +
test 30%. The learned THETAs are
almost identical, the last one is bias.
• Grid search to find optimal hyper-parameters with
best generalization error
• Estimate generalization error
• k-Fold cross validation
Hyper-parameter
optimization
12
Hyper-parameter is a parameter used in
a training process but not a part of a
classifier itself. It controls what kind of
parameters can / tend to be selected.
For example, polynomial expansion will
make non-linear relationship between a
label and features be learned possibly.
Grid search
• Grid - [polynomial expansion degree] x [l2
regularization]
• Polynomial expansion is memory killer
• Degree 3 on 7 features results in 119 features
• Be careful with exploiting parallelism
13
To increase temporal locality - accesses
to a data frame are clustered in time.

Polynomial expansion does not include
constant column.
K-Fold
14
DF
Persist,
randomSplit
map=> [([train_i], test)] map=>[(train, test)]
Spark SQL data frame
[([DF], DF)]
[(union[DF], DF)]
15
k-Fold
PE
Experiments
16
Spark 2.0.2 standalone mode
3 cores + 5GB mem
exact copy of read-in file
http://spark.apache.org/docs/latest/cluster-overview.html
In total, we have 3 physical machines
with 12GB mem + 8 cores.

Driver - execute scala program

Worker - execute tasks

Executor - each application runs a or
more processes on a worker node

Job - triggered by an action

Task - a unit of work executed on an
executor, related with number of
partitions >= number of blocks
(128MB). If set manually, 2-4 partitions
for each CPU in your cluster.

Stage - a set of tasks
Local file - path + content on
each worker node.
Performance test
• ML Settings
• Logistic regression on HIGGS
• Train-test split, 70% + 30%
• Only 7 high level features were used
• Test unit 1 - 100 times full gradient descent + training error
on training set, initial learning rate 0.001, l2 regularization 0.1
• Test unit 2 - compute confusion matrix on test set and make
predictions
17
Performance and speedup curve
18
0
1.25
2.5
3.75
5
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) training-speed up
1
1.822
2.372
2.693
3.641
4.43
Running time vs. #executors (2 times average). Except for local, all tests have
enough memory
Local mode does not have
enough memory, causing data
cannot be persist in memory.
Thus, the running time is much
higher.

Having more executors will
reduce the running time linearly.
Grid search
• 10% of original data, i.e., 1.1 million instances, 7 high level features only
• Grid
• Polynomial degrees - 1, 2, 3
• l2 regularization - 0, 0.001, 0.01, 0.1, 0.5
• 3-Fold cross validation
• 100 times gradient descent with initial learning rate 0.01
• 2 executors with 10GB mem + 5 cores each
• Result - 4400s training time, final test accuracy 62.4%
19
Confusion matrix: truePositive: 117605,
trueNegative: 88664, falsePositive:
66529, falseNegative: 57786
Conclusion
• Persist data - use more than once (incl. having branches)
• Change default cluster settings, e.g., executor memory per executor is 1GB
• Make use of Spark UI to find bottlenecks
• Using Spark builtin functions if possible
• Good examples for missing functions
• Don’t use accumulators in a transformation, except only need
approximations
• Always start from small data to debug faster
• Future work - obey train-test split
20
Q&A
• Thank you!
• Useful links
• Master - spark://ip:7077, e.g., spark://b2.lxd:7077
• Cluster - http://ip:8080/
• Spark UI - http://ip:4040/
• https://spark.apache.org/docs/latest/programming-guide.html
• http://spark.apache.org/docs/latest/submitting-
applications.html, package a jar - sbt package
21
Backend slides
22
Training time vs. # executors
23
0
0.25
0.5
0.75
1
0
225
450
675
900
local 1 executor 2 executors 3 executors 4 executors 5 executors
training time (s) test accuracy
Spark UI
24
Jobs timeline
Spark UI
25
Executor summary
Numerical stability
26

Weitere ähnliche Inhalte

Was ist angesagt?

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
MLconf
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
MLconf
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
Naoki Shibata
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 

Was ist angesagt? (20)

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
 
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...(Slides) Task scheduling algorithm for multicore processor system for minimiz...
(Slides) Task scheduling algorithm for multicore processor system for minimiz...
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
[ppt]
[ppt][ppt]
[ppt]
 
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 

Ähnlich wie Implementation of linear regression and logistic regression on Spark

Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
HPCC Systems
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
hdhappy001
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 

Ähnlich wie Implementation of linear regression and logistic regression on Spark (20)

A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander UlanovA Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
A Scaleable Implemenation of Deep Leaning on Spark- Alexander Ulanov
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 

Kürzlich hochgeladen

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 

Kürzlich hochgeladen (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Implementation of linear regression and logistic regression on Spark

  • 1. Parallel implementation of ML algorithms on Spark Dalei Li EIT Digital https://github.com/lidalei/LinearLogisticRegSpark 1
  • 2. Overview • Linear regression + l2 regularization • Normal equation • Logistic regression + l2 regularization • Gradient descend • Newton’s method • Hyper-parameter optimization • Experiments 2
  • 3. Tools • IntelliJ + sbt • Scala 2.11.8 + Spark 2.0.1 3
  • 4. Linear regression • Problem formulation • Closed-form solution • Computation reformulation 4
  • 5. Linear regression • Data set - UCI YearPredictionMSD, text file • 515,345 songs, (90 audio numerical features, year) • Core computation - norm terms and rmse 5 Implemented outer product + vector addition
  • 6. Workflow 6 Read file RegexTokenizer StandardScaler Solve normal equation Spark SQL text Add l2 regularization LAPACK Center data Evaluation rmse
  • 7. Validation 7 Spark ML linear regression with norm solver vs. my implementation (both with 0.1 l2 regularization) Randomly split data set into train 70% + test 30%. The RMSEs on test set are also identical, less than 0.5% difference.
  • 8. Logistic regression • Problem formulation • Gradient descent • Newton’s method • Computation reformulation - gradient and Hessian matrix 8
  • 9. Logistic regression • Data set - UCI HIGGS, csv file • 11 million instances, (21+7 numerical features, binary label) • Core computation - gradient and Hessian matrix 9 treeReduce can reduce the pressure of final ops in driver.
  • 10. Workflow 10 Read file VectorAssembler DF to RDD gradient descent/ newton’s method Spark SQL csv Gradient - add l2 regularization Scala case class Instance (features, label), Newton’s - append all-one column Evaluation cross entropy confusion matrix
  • 11. Validation 11 Spark ML logistic regression with L-BFGS vs. my implementation of Newton’s method Randomly split data set into train 70% + test 30%. The learned THETAs are almost identical, the last one is bias.
  • 12. • Grid search to find optimal hyper-parameters with best generalization error • Estimate generalization error • k-Fold cross validation Hyper-parameter optimization 12 Hyper-parameter is a parameter used in a training process but not a part of a classifier itself. It controls what kind of parameters can / tend to be selected. For example, polynomial expansion will make non-linear relationship between a label and features be learned possibly.
  • 13. Grid search • Grid - [polynomial expansion degree] x [l2 regularization] • Polynomial expansion is memory killer • Degree 3 on 7 features results in 119 features • Be careful with exploiting parallelism 13 To increase temporal locality - accesses to a data frame are clustered in time. Polynomial expansion does not include constant column.
  • 14. K-Fold 14 DF Persist, randomSplit map=> [([train_i], test)] map=>[(train, test)] Spark SQL data frame [([DF], DF)] [(union[DF], DF)]
  • 16. Experiments 16 Spark 2.0.2 standalone mode 3 cores + 5GB mem exact copy of read-in file http://spark.apache.org/docs/latest/cluster-overview.html In total, we have 3 physical machines with 12GB mem + 8 cores. Driver - execute scala program Worker - execute tasks Executor - each application runs a or more processes on a worker node Job - triggered by an action Task - a unit of work executed on an executor, related with number of partitions >= number of blocks (128MB). If set manually, 2-4 partitions for each CPU in your cluster. Stage - a set of tasks Local file - path + content on each worker node.
  • 17. Performance test • ML Settings • Logistic regression on HIGGS • Train-test split, 70% + 30% • Only 7 high level features were used • Test unit 1 - 100 times full gradient descent + training error on training set, initial learning rate 0.001, l2 regularization 0.1 • Test unit 2 - compute confusion matrix on test set and make predictions 17
  • 18. Performance and speedup curve 18 0 1.25 2.5 3.75 5 0 225 450 675 900 local 1 executor 2 executors 3 executors 4 executors 5 executors training time (s) training-speed up 1 1.822 2.372 2.693 3.641 4.43 Running time vs. #executors (2 times average). Except for local, all tests have enough memory Local mode does not have enough memory, causing data cannot be persist in memory. Thus, the running time is much higher. Having more executors will reduce the running time linearly.
  • 19. Grid search • 10% of original data, i.e., 1.1 million instances, 7 high level features only • Grid • Polynomial degrees - 1, 2, 3 • l2 regularization - 0, 0.001, 0.01, 0.1, 0.5 • 3-Fold cross validation • 100 times gradient descent with initial learning rate 0.01 • 2 executors with 10GB mem + 5 cores each • Result - 4400s training time, final test accuracy 62.4% 19 Confusion matrix: truePositive: 117605, trueNegative: 88664, falsePositive: 66529, falseNegative: 57786
  • 20. Conclusion • Persist data - use more than once (incl. having branches) • Change default cluster settings, e.g., executor memory per executor is 1GB • Make use of Spark UI to find bottlenecks • Using Spark builtin functions if possible • Good examples for missing functions • Don’t use accumulators in a transformation, except only need approximations • Always start from small data to debug faster • Future work - obey train-test split 20
  • 21. Q&A • Thank you! • Useful links • Master - spark://ip:7077, e.g., spark://b2.lxd:7077 • Cluster - http://ip:8080/ • Spark UI - http://ip:4040/ • https://spark.apache.org/docs/latest/programming-guide.html • http://spark.apache.org/docs/latest/submitting- applications.html, package a jar - sbt package 21
  • 23. Training time vs. # executors 23 0 0.25 0.5 0.75 1 0 225 450 675 900 local 1 executor 2 executors 3 executors 4 executors 5 executors training time (s) test accuracy