The document discusses optimizing machine learning pipelines in Apache Spark. It introduces Blueprint, which provides a configurable pipeline API to string together transformers, estimators, predictors, and evaluators. This allows reusable machine learning components to be assembled into complete pipelines. The document also discusses opportunities to optimize pipelines, such as minimizing redundant preprocessing, enabling parallel grid search, and using more efficient hyperparameter optimization techniques.
3. Our journey to Apache Spark
PySpark vs Scala API?
Spark
worker
JVM
Python
process
Sending instructions:
df.agg({"age": "max"})
FAST!
Spark
worker
JVM
Python
process
Sending data:
data.map(lambda x: …)
data.filter(lambda x: …)
SLOW!
Instructions py4j Data ipc/serde
4. Our journey to Apache Spark
RDD vs DataFrame
RDD[Row[(Double, String, Vector)]]
Dataframe
(DoubleType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(StringType,
nullable=true)
+ Attributes
(in spark-1.4)
Dataframe
(VectorType,
nullable=true)
+ Attributes
(in spark-1.4)
Attributes:
NumericAttribute
NominalAttribute (Ordinal)
BinaryAttribute
5. Our journey to Apache Spark
Mllib vs ML
Mllib:
● Low - level implementation of machine learning algorithms
● Based on RDD
ML:
● High-level pipeline abstractions
● Based on dataframes
● Uses mllib under the hood.
6. Columnar format
● Compression
● Scan optimization
● Null-imputor improvement
- val na2mean = {value: Double =>
- if (value.isNaN) meanValue else value
- }
- dataset.withColumn(map(outputCol),
callUDF(na2mean, DoubleType, dataset(map
(inputCol))))
+ dataset.na.fill(map(inputCols).zip
(meanValues).toMap)
7. Typical machine learning pipeline
● Features extraction
● Missing values imputation
● Variables encoding
● Dimensionality reduction
● Training model (finding
the optimal model
parameters)
● Selecting
hyperparameters
Model evaluation on
some metric (AUC,
R2, RMSE, etc.)
Train data (features + label)
Test data (features)
Model state (parameters +
hyperparameters)
Prediction
11. Transformer (pure function)
abstract class Transformer extends PipelineStage with Params {
/**
* Transforms the dataset with provided parameter map as additional parameters.
* @param dataset input dataset
* @param paramMap additional parameters, overwrite embedded params
* @return transformed dataset
*/
def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame
}
Example:
(new HashingTF).
setInputCol("categorical_column").
setOutputCol("Hashing_tf_1").
setNumFeatures(1<<20).
transform(data)
12. Estimator
abstract class Estimator[M <: Model[M]] extends PipelineStage with Params {
/**
* Fits a single model to the input data with optional parameters.
*
* @param dataset input dataset
* @param paramPairs Optional list of param pairs.
* These values override any specified in this Estimator's
embedded ParamMap.
* @return fitted model
*/
@varargs
def fit(dataset: DataFrame, paramPairs: ParamPair[_]*): M = {
val map = ParamMap(paramPairs: _*)
fit(dataset, map)
}
}
Example:
val oneHotEncoderModel = (new OneHotEncoder).
setInputCol("vector_col").
fit(trainingData)
oneHotEncoderModel.transform(trainingData)
oneHotEncoderModel.transform(testData)
Estimator => Transformer
14. Evaluator
abstract class Evaluator extends Identifiable {
/**
* Evaluates the output.
*
* @param dataset a dataset that contains labels/observations and predictions.
* @param paramMap parameter map that specifies the input columns and output metrics
* @return metric
*/
def evaluate(dataset: DataFrame, paramMap: ParamMap): Double
}
Example:
val areaUnderROC = (new BinaryClassificationEvaluator).
setScoreCol("prediction").
evaluate(data)
15. Pipeline
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
Input
data
Tockenizer
HashingTF
Logistic
Regression
fit
Pipeline
Model
Estimator that encapsulates other transformers / estimators
16. CrossValidator
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new
BinaryClassificationEvaluator)
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array
(10, 100, 1000))
.addGrid(lr.regParam, Array(0.1, 0.01)) .
build()
crossval.setEstimatorParamMaps(
paramGrid)
crossval.setNumFolds(3)
val cvModel = crossval.fit(training.toDF)
Input
data
Tockenizer
HashingTF
Logistic
Regression fit
CrossVal
Model
numFeatures:
{10, 100, 1000}
regParam:
{0.1, 0.01}
Folds
20. Summary
● Good model != good result. Feature engineering is
the key.
● Spark provides a good abstraction, but need to tune
some parts to achieve good performance.
● ml pipeline API gives a pluggable and reusable
building blocks.
● Don’t forget to clean after yourself (unpersist cache).