SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Automation and optimisation of machine learning pipelines on
top of Apache
Peter Rudenko
@peter_rud
peter.rudenko@datarobot.com
DataRobot data pipeline
Data
upload
Training models,
selecting best
models &
hyperparameters
Exploratory
data
analysis
Models
leaderboard
Prediction
API
Our journey to Apache Spark
PySpark vs Scala API?
Spark
worker
JVM
Python
process
Sending instructions:
df.agg({"age": "max"})
FAST!
Spark
worker
JVM
Python
process
Sending data:
data.map(lambda x: …)
data.filter(lambda x: …)
SLOW!
Instructions py4j Data ipc/serde
Our journey to Apache Spark
RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe
(DoubleType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(StringType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(VectorType,
nullable=true)
+ Attributes
(in spark-1.4)
Attributes:
NumericAttribute
NominalAttribute (Ordinal)
BinaryAttribute
Our journey to Apache Spark
Mllib vs ML
Mllib:
● Low - level implementation of machine learning algorithms
● Based on RDD
ML:
● High-level pipeline abstractions
● Based on dataframes
● Uses mllib under the hood.
Columnar format
● Compression
● Scan optimization
● Null-imputor improvement
- val na2mean = {value: Double =>
- if (value.isNaN) meanValue else value
- }
- dataset.withColumn(map(outputCol),
callUDF(na2mean, DoubleType, dataset(map
(inputCol))))
+ dataset.na.fill(map(inputCols).zip
(meanValues).toMap)
Typical machine learning pipeline
● Features extraction
● Missing values imputation
● Variables encoding
● Dimensionality reduction
● Training model (finding
the optimal model
parameters)
● Selecting
hyperparameters
Model evaluation on
some metric (AUC,
R2, RMSE, etc.)
Train data (features + label)
Test data (features)
Model state (parameters +
hyperparameters)
Prediction
Introducing Blueprint
Pipeline config
pipeline: {
"1": {
input: ["NUM"],
class: "org.apache.spark.ml.feature.MeanImputor"
},
"2": {
input: ["CAT"],
class: "org.apache.spark.ml.feature.OneHotEncoder"
},
"3":{
input: ["1", "2"],
class: "org.apache.spark.ml.feature.VectorAssembler"
},
"4": {
input: "3",
class : "org.apache.spark.ml.classification.LogisticRegression",
params: {
optimizer: “LBFGS”,
regParam: [0.5, 0.1, 0.01, 0.001]
}
}
}
Introducing Blueprint
YARN cluster
Blueprint
Spark jobserver
Transformer (pure function)
abstract class Transformer extends PipelineStage with Params {
/**
* Transforms the dataset with provided parameter map as additional parameters.
* @param dataset input dataset
* @param paramMap additional parameters, overwrite embedded params
* @return transformed dataset
*/
def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame
}
Example:
(new HashingTF).
setInputCol("categorical_column").
setOutputCol("Hashing_tf_1").
setNumFeatures(1<<20).
transform(data)
Estimator
abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/**
* Fits a single model to the input data with optional parameters.
*
* @param dataset input dataset
* @param paramPairs Optional list of param pairs.
* These values override any specified in this Estimator's
embedded ParamMap.
* @return fitted model
*/
@varargs
def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = {
val map = ParamMap(paramPairs: _*)
fit(dataset, map)
}
}
Example:
val oneHotEncoderModel = (new OneHotEncoder).
setInputCol("vector_col").
fit(trainingData)
oneHotEncoderModel.transform(trainingData)
oneHotEncoderModel.transform(testData)
Estimator => Transformer
Predictor
Estimator that predicts a value
ProbabilisticClassifier
Predictor
Classifier Regressor
Evaluator
abstract class Evaluator extends Identifiable {
/**
* Evaluates the output.
*
* @param dataset a dataset that contains labels/observations and predictions.
* @param paramMap parameter map that specifies the input columns and output metrics
* @return metric
*/
def evaluate(dataset: DataFrame, paramMap: ParamMap): Double
}
Example:
val areaUnderROC = (new BinaryClassificationEvaluator).
setScoreCol("prediction").
evaluate(data)
Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Input
data
Tockenizer
HashingTF
Logistic
Regression
fit
Pipeline
Model
Estimator that encapsulates other transformers / estimators
CrossValidator
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array
(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01)) .
build()
crossval.setEstimatorParamMaps(
paramGrid)
crossval.setNumFolds(3)
val cvModel = crossval.fit(training.toDF)
Input
data
Tockenizer
HashingTF
Logistic
Regression fit
CrossVal
Model
numFeatures:
{10, 100, 1000}
regParam:
{0.1, 0.01}
Folds
Pluggable backend
● H20
● Flink
● DeepLearning4J
● http://keystone-ml.org/
● etc.
Optimization
● Disable k-fold cross validation
● Minimize redundant pre-processing
● Parallel grid search
● Parallel DAG pipeline
● Pluggable optimizer
● Non-gridsearch hyperparameter optimization
(bayesian & hypergrad):
http://arxiv.org/pdf/1502.03492v2.pdf
http://arxiv.org/pdf/1206.2944.pdf
http://arxiv.org/pdf/1502.05700v1.pdf
Minimize redundant pre-processing
regParam:
0.1
regParam:
0.01
val rdd1 = rdd.map(function)
val rdd2 = rdd.map(function)
rdd1 != rdd2
Summary
● Good model != good result. Feature engineering is
the key.
● Spark provides a good abstraction, but need to tune
some parts to achieve good performance.
● ml pipeline API gives a pluggable and reusable
building blocks.
● Don’t forget to clean after yourself (unpersist cache).
Thanks,
Demo & QA

Weitere ähnliche Inhalte

Was ist angesagt?

Chapter12 array-single-dimension
Chapter12 array-single-dimensionChapter12 array-single-dimension
Chapter12 array-single-dimension
Deepak Singh
 

Was ist angesagt? (20)

Reference Parameter, Passing object by reference, constant parameter & Defaul...
Reference Parameter, Passing object by reference, constant parameter & Defaul...Reference Parameter, Passing object by reference, constant parameter & Defaul...
Reference Parameter, Passing object by reference, constant parameter & Defaul...
 
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlowTensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
TensorFlow Extended: An End-to-End Machine Learning Platform for TensorFlow
 
Java 8
Java 8Java 8
Java 8
 
Java8
Java8Java8
Java8
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
K fold
K foldK fold
K fold
 
mc_simulation documentation
mc_simulation documentationmc_simulation documentation
mc_simulation documentation
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
Java8 training - class 3
Java8 training - class 3Java8 training - class 3
Java8 training - class 3
 
Chapter12 array-single-dimension
Chapter12 array-single-dimensionChapter12 array-single-dimension
Chapter12 array-single-dimension
 
Scala in a nutshell by venkat
Scala in a nutshell by venkatScala in a nutshell by venkat
Scala in a nutshell by venkat
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Lambdas Hands On Lab
Lambdas Hands On LabLambdas Hands On Lab
Lambdas Hands On Lab
 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++
 
Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8Lessons Learnt With Lambdas and Streams in JDK 8
Lessons Learnt With Lambdas and Streams in JDK 8
 
BPstudy sklearn 20180925
BPstudy sklearn 20180925BPstudy sklearn 20180925
BPstudy sklearn 20180925
 
Beginning Scala Svcc 2009
Beginning Scala Svcc 2009Beginning Scala Svcc 2009
Beginning Scala Svcc 2009
 
Java8 training - Class 1
Java8 training  - Class 1Java8 training  - Class 1
Java8 training - Class 1
 

Andere mochten auch

Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Chris Fregly
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 

Andere mochten auch (20)

Automation and machine learning in the enterprise
Automation and machine learning in the enterpriseAutomation and machine learning in the enterprise
Automation and machine learning in the enterprise
 
Présentation SikuliX
Présentation SikuliXPrésentation SikuliX
Présentation SikuliX
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Heterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At NetflixHeterogeneous Workflows With Spark At Netflix
Heterogeneous Workflows With Spark At Netflix
 
Accelerating Machine Learning with Cognitive Calibration - Kalpesh Balar, Coseer
Accelerating Machine Learning with Cognitive Calibration - Kalpesh Balar, CoseerAccelerating Machine Learning with Cognitive Calibration - Kalpesh Balar, Coseer
Accelerating Machine Learning with Cognitive Calibration - Kalpesh Balar, Coseer
 
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingAutomation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
 
Machine learning pipeline with spark ml
Machine learning pipeline with spark mlMachine learning pipeline with spark ml
Machine learning pipeline with spark ml
 
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Learning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physicsLearning do discover: machine learning in high-energy physics
Learning do discover: machine learning in high-energy physics
 
Mark Lynch - Importance of Big Data and Analytics for the Insurance Market
Mark Lynch - Importance of Big Data and Analytics for the Insurance MarketMark Lynch - Importance of Big Data and Analytics for the Insurance Market
Mark Lynch - Importance of Big Data and Analytics for the Insurance Market
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORKMACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
 
"Introduction to Machine Learning and its Applications" at sapthgiri engineer...
"Introduction to Machine Learning and its Applications" at sapthgiri engineer..."Introduction to Machine Learning and its Applications" at sapthgiri engineer...
"Introduction to Machine Learning and its Applications" at sapthgiri engineer...
 
Applying Machine Learning and Artificial Intelligence to Business
Applying Machine Learning and Artificial Intelligence to BusinessApplying Machine Learning and Artificial Intelligence to Business
Applying Machine Learning and Artificial Intelligence to Business
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
Machine Learning: Artificial Intelligence isn't just a Science Fiction topicMachine Learning: Artificial Intelligence isn't just a Science Fiction topic
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
 

Ähnlich wie AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

Ähnlich wie AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
Interactive Session on Sparkling Water
Interactive Session on Sparkling WaterInteractive Session on Sparkling Water
Interactive Session on Sparkling Water
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17Alexey Tsoy Meta Programming in C++ 16.11.17
Alexey Tsoy Meta Programming in C++ 16.11.17
 
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
Algorithm and Programming (Introduction of dev pascal, data type, value, and ...
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
 

Mehr von GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
GeeksLab Odessa
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
GeeksLab Odessa
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
GeeksLab Odessa
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
GeeksLab Odessa
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
GeeksLab Odessa
 

Mehr von GeeksLab Odessa (20)

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский Виктор
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображение
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
 

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

  • 1. Automation and optimisation of machine learning pipelines on top of Apache Peter Rudenko @peter_rud peter.rudenko@datarobot.com
  • 2. DataRobot data pipeline Data upload Training models, selecting best models & hyperparameters Exploratory data analysis Models leaderboard Prediction API
  • 3. Our journey to Apache Spark PySpark vs Scala API? Spark worker JVM Python process Sending instructions: df.agg({"age": "max"}) FAST! Spark worker JVM Python process Sending data: data.map(lambda x: …) data.filter(lambda x: …) SLOW! Instructions py4j Data ipc/serde
  • 4. Our journey to Apache Spark RDD vs DataFrame RDD[Row[(Double, String, Vector)]] Dataframe (DoubleType, nullable=true) + Attributes (in spark-1.4) Dataframe (StringType, nullable=true) + Attributes (in spark-1.4) Dataframe (VectorType, nullable=true) + Attributes (in spark-1.4) Attributes: NumericAttribute NominalAttribute (Ordinal) BinaryAttribute
  • 5. Our journey to Apache Spark Mllib vs ML Mllib: ● Low - level implementation of machine learning algorithms ● Based on RDD ML: ● High-level pipeline abstractions ● Based on dataframes ● Uses mllib under the hood.
  • 6. Columnar format ● Compression ● Scan optimization ● Null-imputor improvement - val na2mean = {value: Double => - if (value.isNaN) meanValue else value - } - dataset.withColumn(map(outputCol), callUDF(na2mean, DoubleType, dataset(map (inputCol)))) + dataset.na.fill(map(inputCols).zip (meanValues).toMap)
  • 7. Typical machine learning pipeline ● Features extraction ● Missing values imputation ● Variables encoding ● Dimensionality reduction ● Training model (finding the optimal model parameters) ● Selecting hyperparameters Model evaluation on some metric (AUC, R2, RMSE, etc.) Train data (features + label) Test data (features) Model state (parameters + hyperparameters) Prediction
  • 9. Pipeline config pipeline: { "1": { input: ["NUM"], class: "org.apache.spark.ml.feature.MeanImputor" }, "2": { input: ["CAT"], class: "org.apache.spark.ml.feature.OneHotEncoder" }, "3":{ input: ["1", "2"], class: "org.apache.spark.ml.feature.VectorAssembler" }, "4": { input: "3", class : "org.apache.spark.ml.classification.LogisticRegression", params: { optimizer: “LBFGS”, regParam: [0.5, 0.1, 0.01, 0.001] } } }
  • 11. Transformer (pure function) abstract class Transformer extends PipelineStage with Params { /** * Transforms the dataset with provided parameter map as additional parameters. * @param dataset input dataset * @param paramMap additional parameters, overwrite embedded params * @return transformed dataset */ def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame } Example: (new HashingTF). setInputCol("categorical_column"). setOutputCol("Hashing_tf_1"). setNumFeatures(1<<20). transform(data)
  • 12. Estimator abstract class Estimator[M <: Model[M]] extends PipelineStage with Params { /** * Fits a single model to the input data with optional parameters. * * @param dataset input dataset * @param paramPairs Optional list of param pairs. * These values override any specified in this Estimator's embedded ParamMap. * @return fitted model */ @varargs def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = { val map = ParamMap(paramPairs: _*) fit(dataset, map) } } Example: val oneHotEncoderModel = (new OneHotEncoder). setInputCol("vector_col"). fit(trainingData) oneHotEncoderModel.transform(trainingData) oneHotEncoderModel.transform(testData) Estimator => Transformer
  • 13. Predictor Estimator that predicts a value ProbabilisticClassifier Predictor Classifier Regressor
  • 14. Evaluator abstract class Evaluator extends Identifiable { /** * Evaluates the output. * * @param dataset a dataset that contains labels/observations and predictions. * @param paramMap parameter map that specifies the input columns and output metrics * @return metric */ def evaluate(dataset: DataFrame, paramMap: ParamMap): Double } Example: val areaUnderROC = (new BinaryClassificationEvaluator). setScoreCol("prediction"). evaluate(data)
  • 15. Pipeline val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) Input data Tockenizer HashingTF Logistic Regression fit Pipeline Model Estimator that encapsulates other transformers / estimators
  • 16. CrossValidator val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array (10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) . build() crossval.setEstimatorParamMaps( paramGrid) crossval.setNumFolds(3) val cvModel = crossval.fit(training.toDF) Input data Tockenizer HashingTF Logistic Regression fit CrossVal Model numFeatures: {10, 100, 1000} regParam: {0.1, 0.01} Folds
  • 17. Pluggable backend ● H20 ● Flink ● DeepLearning4J ● http://keystone-ml.org/ ● etc.
  • 18. Optimization ● Disable k-fold cross validation ● Minimize redundant pre-processing ● Parallel grid search ● Parallel DAG pipeline ● Pluggable optimizer ● Non-gridsearch hyperparameter optimization (bayesian & hypergrad): http://arxiv.org/pdf/1502.03492v2.pdf http://arxiv.org/pdf/1206.2944.pdf http://arxiv.org/pdf/1502.05700v1.pdf
  • 19. Minimize redundant pre-processing regParam: 0.1 regParam: 0.01 val rdd1 = rdd.map(function) val rdd2 = rdd.map(function) rdd1 != rdd2
  • 20. Summary ● Good model != good result. Feature engineering is the key. ● Spark provides a good abstraction, but need to tune some parts to achieve good performance. ● ml pipeline API gives a pluggable and reusable building blocks. ● Don’t forget to clean after yourself (unpersist cache).