R, Scikit-Learn and Apache Spark ML - What APIs to Use
1. R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
2. Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
7. Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
8. Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
9. Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
10. API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
11. JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
12. Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
13. The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
15. R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
16. Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
17. Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
18. LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
19. XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
20. GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
25. Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
26.
27.
28.
29. GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
30. LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
31. Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
32.
33.
34. Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"