AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

Automation and optimisation of machine learning pipelines on
top of Apache
Peter Rudenko
@peter_rud
peter.rudenko@datarobot.com

DataRobot data pipeline
Data
upload
Training models,
selecting best
models &
hyperparameters
Exploratory
data
analysis
Models
leaderboard
Prediction
API

Our journey to Apache Spark
PySpark vs Scala API?
Spark
worker
JVM
Python
process
Sending instructions:
df.agg({"age": "max"})
FAST!
Spark
worker
JVM
Python
process
Sending data:
data.map(lambda x: …)
data.filter(lambda x: …)
SLOW!
Instructions py4j Data ipc/serde

RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe
(DoubleType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(StringType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(VectorType,
nullable=true)
+ Attributes
(in spark-1.4)
Attributes:
NumericAttribute
NominalAttribute (Ordinal)
BinaryAttribute

Mllib vs ML
Mllib:
● Low - level implementation of machine learning algorithms
● Based on RDD
ML:
● High-level pipeline abstractions
● Based on dataframes
● Uses mllib under the hood.

Columnar format
● Compression
● Scan optimization
● Null-imputor improvement
- val na2mean = {value: Double =>
- if (value.isNaN) meanValue else value
- }
- dataset.withColumn(map(outputCol),
callUDF(na2mean, DoubleType, dataset(map
(inputCol))))
+ dataset.na.fill(map(inputCols).zip
(meanValues).toMap)

Typical machine learning pipeline
● Features extraction
● Missing values imputation
● Variables encoding
● Dimensionality reduction
● Training model (finding
the optimal model
parameters)
● Selecting
hyperparameters
Model evaluation on
some metric (AUC,
R2, RMSE, etc.)
Train data (features + label)
Test data (features)
Model state (parameters +
hyperparameters)
Prediction

Pipeline config
pipeline: {
"1": {
input: ["NUM"],
class: "org.apache.spark.ml.feature.MeanImputor"
},
"2": {
input: ["CAT"],
class: "org.apache.spark.ml.feature.OneHotEncoder"
},
"3":{
input: ["1", "2"],
class: "org.apache.spark.ml.feature.VectorAssembler"
},
"4": {
input: "3",
class : "org.apache.spark.ml.classification.LogisticRegression",
params: {
optimizer: “LBFGS”,
regParam: [0.5, 0.1, 0.01, 0.001]
}
}
}

Introducing Blueprint
YARN cluster
Blueprint
Spark jobserver

Transformer (pure function)
abstract class Transformer extends PipelineStage with Params {
/**
* Transforms the dataset with provided parameter map as additional parameters.
* @param dataset input dataset
* @param paramMap additional parameters, overwrite embedded params
* @return transformed dataset
*/
def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame
}
Example:
(new HashingTF).
setInputCol("categorical_column").
setOutputCol("Hashing_tf_1").
setNumFeatures(1<<20).
transform(data)

Estimator
abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/**
* Fits a single model to the input data with optional parameters.
*
* @param dataset input dataset
* @param paramPairs Optional list of param pairs.
* These values override any specified in this Estimator's
embedded ParamMap.
* @return fitted model
*/
@varargs
def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = {
val map = ParamMap(paramPairs: _*)
fit(dataset, map)
}
}
Example:
val oneHotEncoderModel = (new OneHotEncoder).
setInputCol("vector_col").
fit(trainingData)
oneHotEncoderModel.transform(trainingData)
oneHotEncoderModel.transform(testData)
Estimator => Transformer

Predictor
Estimator that predicts a value
ProbabilisticClassifier
Predictor
Classifier Regressor

Evaluator
abstract class Evaluator extends Identifiable {
/**
* Evaluates the output.
*
* @param dataset a dataset that contains labels/observations and predictions.
* @param paramMap parameter map that specifies the input columns and output metrics
* @return metric
*/
def evaluate(dataset: DataFrame, paramMap: ParamMap): Double
}
Example:
val areaUnderROC = (new BinaryClassificationEvaluator).
setScoreCol("prediction").
evaluate(data)

Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Input
data
Tockenizer
HashingTF
Logistic
Regression
fit
Pipeline
Model
Estimator that encapsulates other transformers / estimators

CrossValidator
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array
(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01)) .
build()
crossval.setEstimatorParamMaps(
paramGrid)
crossval.setNumFolds(3)
val cvModel = crossval.fit(training.toDF)
Input
data
Tockenizer
HashingTF
Logistic
Regression fit
CrossVal
Model
numFeatures:
{10, 100, 1000}
regParam:
{0.1, 0.01}
Folds

Pluggable backend
● H20
● Flink
● DeepLearning4J
● http://keystone-ml.org/
● etc.

Optimization
● Disable k-fold cross validation
● Minimize redundant pre-processing
● Parallel grid search
● Parallel DAG pipeline
● Pluggable optimizer
● Non-gridsearch hyperparameter optimization
(bayesian & hypergrad):
http://arxiv.org/pdf/1502.03492v2.pdf
http://arxiv.org/pdf/1206.2944.pdf
http://arxiv.org/pdf/1502.05700v1.pdf

Minimize redundant pre-processing
regParam:
0.1
regParam:
0.01
val rdd1 = rdd.map(function)
val rdd2 = rdd.map(function)
rdd1 != rdd2

Summary
● Good model != good result. Feature engineering is
the key.
● Spark provides a good abstraction, but need to tune
some parts to achieve good performance.
● ml pipeline API gives a pluggable and reusable
building blocks.
● Don’t forget to clean after yourself (unpersist cache).

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark

Ähnlich wie AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark (20)

Mehr von GeeksLab Odessa

Mehr von GeeksLab Odessa (20)

AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning pipelines on top of Apache Spark