SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
R, Scikit-Learn and Apache Spark ML -
What difference does it make?
Villu Ruusmann
Openscoring OÜ
Overview
● Identifying long-standing, high-value opportunities in the
applied predictive analytics domain
● Thinking about problems in API terms
● Providing solutions in API terms
● Developing and applying custom tools
+ A couple of tips if you're looking to buy or sell a VW Golf
The trade-off
"More data beats better algorithms"
The state of the art
Scaling out horizontally
Elements of reproducibility
Standardized, human- and machine-readable descriptions:
● Dataset
● Data pre- and post-processing steps:
○ From real-life input table (SQL, CSV) to model
○ From model to real-life output table
● Model
● Statistics
Calling R from within Apache Spark
1. Create and initialize R runtime
2. Format and upload input RDD; upload and execute R
model; download output and parse into result RDD
3. Destroy R runtime
Calling Scikit-Learn from within Apache Spark
1. Format input RDD (eg. using Java NIO) as numpy.array
2. Invoke Scikit-Learn via Python/C API
3. Parse output numpy.array into result RDD
API prioritization
Training << Maintenance ~ Deployment
One-time activity << Repeated activities
Short-term << Long-term
JPMML - Java PMML API
● Conversion API
● Maintenance API
● Execution API
○ Interpreted mode
○ Translated + compiled ("Transpiled") mode
● Serving API
○ Integrations with popular Big Data frameworks
○ REST web service
Calling JPMML-Spark from within Apache Spark
org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..;
org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build();
org.apache.spark.sql.Dataset<Row> input = ..;
org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
The case study
Predicting the price of VW Golf cars using GBT algorithms:
● 71 columns:
○ A continuous label: log(price)
○ Two string and four numeric categorical features
○ 64 binary-like (0/1) and numeric continuous features
● 270'458 rows:
○ 153'978 complete cases
○ 116'480 incomplete (ie. with missing values) cases
Gradient-Boosted Trees (GBTs)
R training and conversion API
#library("caret")
library("gbm")
library("r2pmml")
cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A")
factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type")
for(factor_col in factor_cols){
cars[, factor_col] = as.factor(cars[, factor_col])
}
# Doesn't work with factors with missing values
#cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..)
cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6)
r2pmml(cars.gbm, "gbm.pmml")
Scikit-Learn training and conversion API
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import sklearn2pmml, PMMLPipeline
cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"])
mapper = DataFrameMapper(..)
regressor = ..
tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..)
tuner.fit(mapper.fit_transform(cars), cars["price"])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", tuner.best_estimator_)
])
sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
Dataset
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector>
Memory
layout
Contiguous,
dense
Contiguous,
dense(?)
Contiguous,
dense/sparse
Contiguous,
dense/sparse
Distributed,
dense/sparse
Data type Any double float float or
double
double
Categorical
values
As-is (factor) Encoded Binarized Binarized Binarized
Missing
values
Yes Pseudo (NaN) Pseudo (NaN) No No
LightGBM via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelEncoder
from lightgbm import LGBMRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64)
regressor.fit(transformed_cars, cars["price"],
categorical_feature = list(range(0, len(factor_columns))))
XGBoost via Scikit-Learn
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
from xgboost.sklearn import XGBRegressor
mapper = DataFrameMapper(
[(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] +
[(continuous_columns, None)]
)
transformed_cars = mapper.fit_transform(cars)
regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6)
regressor.fit(transformed_cars, cars["price"])
GBT algorithm (training)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Abstraction gbm LGBMRegressor XGBRegressor GradientBoost
ingRegressor
GBTRegressor
Parameterizab
ility
Medium High High Medium Medium
Split type Multi-way Binary Binary Binary Binary
Categorical
values
"set contains" "equals" Pseudo
("equals")
Pseudo
("equals")
"equals"
Missing
values
First-class Pseudo Pseudo No No
gbm-style splits
<Node id="9">
<SimplePredicate field="interior_type" operator="isMissing"/>
<Node id="12" score="3.0702062395803734E-4">
<SimplePredicate field="colour" operator="isMissing"/>
</Node>
<Node id="10" score="-0.018950416258408962">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Grün Rot Violett Weiß</Array>
</SimpleSetPredicate>
</Node>
<Node id="11" score="-0.0017446280908351925">
<SimpleSetPredicate field="colour" booleanOperator="isIn">
<Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array>
</SimpleSetPredicate>
</Node>
</Node>
LightGBM- and XGBoost-style splits (1/3)
<Node id="39" defaultChild="76">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<Node id="76" score="0.0030283758">
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</Node>
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="colour" operator="isMissing"/>
<SimplePredicate field="colour" operator="notEqual" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else if("Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</Node>
<!-- else return null -->
</Node>
LightGBM- and XGBoost-style splits (2/3)
<Node id="39">
<SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/>
<!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 -->
<Node id="77" score="0.02483887">
<CompoundPredicate booleanOperator="and">
<SimplePredicate field="colour" operator="isNotMissing"/>
<SimplePredicate field="colour" operator="equal" value="Orange"/>
</CompoundPredicate>
</Node>
<!-- else return 0.0030283758 -->
<Node id="76" score="0.0030283758">
<True/>
</Node>
</Node>
Model measurement using JPMML
org.dmg.pmml.tree.TreeModel treeModel = ..;
treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){
private int count = 0; // Number of Node elements
private int maxDepth = 0; // Max "nesting depth" of Node elements
@Override
public VisitorAction visit(org.dmg.pmml.tree.Node node){
this.count++;
int depth = 0;
for(org.dmg.pmml.PMMLObject parent : getParents()){
if(!(parent instanceof org.dmg.pmml.tree.Node)) break;
depth++;
}
this.maxDepth = Math.max(this.maxDepth, depth);
return super.visit(node);
}
});
GBT algorithm (interpretation)
R LightGBM XGBoost
Scikit-
Learn
Apache
Spark ML
Feature
importances
Direct Direct Transformed Transformed Transformed
Decision path No No(?) No(?) Transformed Transformed
Model
persistence
RDS (binary) Proprietary
(text)
Proprietary
(binary, text)
Pickle (binary) SER (binary) or
JSON (text)
Model
reusability
Good Fair(?) Good Fair Fair
Java API No No Pseudo No Yes
LightGBM feature importances
Age 936
Mileage 887
Performance 738
[Category] 205
New? 179
[Type of fuel] 170
[Type of interior] 167
Airbags? 130
[Colour] 129
[Type of gearbox] 105
Model execution using JPMML
org.dmg.pmml.PMML pmml;
try(InputStream is = ..){
pmml = org.jpmml.model.PMMLUtil.unmarshal(is);
}
org.jpmml.evaluator.Evaluator evaluator =
new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml);
org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..);
org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..);
for(int value = min; value <= max; value += increment){
Map<FieldName, FieldValue> arguments =
Collections.singletonMap(inputField.getName(), inputField.prepare(value));
Map<FieldName, ?> result = evaluator.evaluate(arguments);
System.out.println(result.get(targetField.getName()));
}
Lessons (to be-) learned
● Limits and limitations of individual APIs
● Vertical integration vs. horizontal integration:
○ All capabilities on a single platform
○ Specialized capabilities on specialized platforms
● Ease-of-use and robustness beat raw performance in
most application scenarios
● "Conventions over configuration"
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data ScientistAlexey Grigorev
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Expert Systems & Prolog
Expert Systems & PrologExpert Systems & Prolog
Expert Systems & PrologFatih Karatana
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfkaransharma62792
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Was ist angesagt? (20)

Spark
SparkSpark
Spark
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Full-stack Data Scientist
Full-stack Data ScientistFull-stack Data Scientist
Full-stack Data Scientist
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Expert Systems & Prolog
Expert Systems & PrologExpert Systems & Prolog
Expert Systems & Prolog
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Delta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdfDelta Lake Cheat Sheet.pdf
Delta Lake Cheat Sheet.pdf
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Unsupervised Machine Learning
Unsupervised Machine LearningUnsupervised Machine Learning
Unsupervised Machine Learning
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark
SparkSpark
Spark
 

Andere mochten auch

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLVillu Ruusmann
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Timothy Spann
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017MLconf
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Alluxio, Inc.
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017EDB
 

Andere mochten auch (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms Ingesting Drone Data into Big Data Platforms
Ingesting Drone Data into Big Data Platforms
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 

Ähnlich wie R, Scikit-Learn and Apache Spark ML - What APIs to Use

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databasesTomáš Drenčák
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Matt Raible
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IPD. j Vicky
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019Matt Raible
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query OptimizationAnju Garg
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)JAINAM KAPADIYA
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLBigML, Inc
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into SwiftSarath C
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...Oleksandr Tarasenko
 

Ähnlich wie R, Scikit-Learn and Apache Spark ML - What APIs to Use (20)

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Semantic search in databases
Semantic search in databasesSemantic search in databases
Semantic search in databases
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019Use Angular Schematics to Simplify Your Life - Develop Denver 2019
Use Angular Schematics to Simplify Your Life - Develop Denver 2019
 
cbse 12 computer science IP
cbse 12 computer science IPcbse 12 computer science IP
cbse 12 computer science IP
 
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
A Gentle Introduction to Angular Schematics - Devoxx Belgium 2019
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
 
VSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzMLVSSML18. Introduction to WhizzML
VSSML18. Introduction to WhizzML
 
Deep Dive Into Swift
Deep Dive Into SwiftDeep Dive Into Swift
Deep Dive Into Swift
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
R console
R consoleR console
R console
 
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
How to grow GraphQL and remove SQLAlchemy and REST API from a high-load Pytho...
 

Kürzlich hochgeladen

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Kürzlich hochgeladen (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

R, Scikit-Learn and Apache Spark ML - What APIs to Use

  • 1. R, Scikit-Learn and Apache Spark ML - What difference does it make? Villu Ruusmann Openscoring OÜ
  • 2. Overview ● Identifying long-standing, high-value opportunities in the applied predictive analytics domain ● Thinking about problems in API terms ● Providing solutions in API terms ● Developing and applying custom tools + A couple of tips if you're looking to buy or sell a VW Golf
  • 4. "More data beats better algorithms"
  • 5. The state of the art
  • 7. Elements of reproducibility Standardized, human- and machine-readable descriptions: ● Dataset ● Data pre- and post-processing steps: ○ From real-life input table (SQL, CSV) to model ○ From model to real-life output table ● Model ● Statistics
  • 8. Calling R from within Apache Spark 1. Create and initialize R runtime 2. Format and upload input RDD; upload and execute R model; download output and parse into result RDD 3. Destroy R runtime
  • 9. Calling Scikit-Learn from within Apache Spark 1. Format input RDD (eg. using Java NIO) as numpy.array 2. Invoke Scikit-Learn via Python/C API 3. Parse output numpy.array into result RDD
  • 10. API prioritization Training << Maintenance ~ Deployment One-time activity << Repeated activities Short-term << Long-term
  • 11. JPMML - Java PMML API ● Conversion API ● Maintenance API ● Execution API ○ Interpreted mode ○ Translated + compiled ("Transpiled") mode ● Serving API ○ Integrations with popular Big Data frameworks ○ REST web service
  • 12. Calling JPMML-Spark from within Apache Spark org.jpmml.spark.TransformerBuilder pmmlTransformerBuilder = ..; org.apache.spark.ml.Transformer pmmlTransformer = pmmlTransformerBuilder.build(); org.apache.spark.sql.Dataset<Row> input = ..; org.apache.spark.sql.DataSet<Row> result = pmmlTransformer.transform(input);
  • 13. The case study Predicting the price of VW Golf cars using GBT algorithms: ● 71 columns: ○ A continuous label: log(price) ○ Two string and four numeric categorical features ○ 64 binary-like (0/1) and numeric continuous features ● 270'458 rows: ○ 153'978 complete cases ○ 116'480 incomplete (ie. with missing values) cases
  • 15. R training and conversion API #library("caret") library("gbm") library("r2pmml") cars = read.csv("cars.tsv", sep = "t", na.strings = "N/A") factor_cols = c("category", "colour", "ac", "fuel_type", "gearbox", "interior_color", "interior_type") for(factor_col in factor_cols){ cars[, factor_col] = as.factor(cars[, factor_col]) } # Doesn't work with factors with missing values #cars.gbm = train(price ~ ., data = cars, method = "gbm", na.action = na.pass, ..) cars.gbm = gbm(price ~ ., data = cars, n.trees = 100, shrinkage = 0.1, interaction.depth = 6) r2pmml(cars.gbm, "gbm.pmml")
  • 16. Scikit-Learn training and conversion API from sklearn_pandas import DataFrameMapper from sklearn.model_selection import GridSearchCV from sklearn2pmml import sklearn2pmml, PMMLPipeline cars = pandas.read_csv("cars.tsv", sep = "t", na_values = ["N/A", "NA"]) mapper = DataFrameMapper(..) regressor = .. tuner = GridSearchCV(regressor, param_grid = .., fit_params = ..) tuner.fit(mapper.fit_transform(cars), cars["price"]) pipeline = PMMLPipeline([ ("mapper", mapper), ("regressor", tuner.best_estimator_) ]) sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)
  • 17. Dataset R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction data.frame lgb.Dataset xgb.DMatrix numpy.array RDD<Vector> Memory layout Contiguous, dense Contiguous, dense(?) Contiguous, dense/sparse Contiguous, dense/sparse Distributed, dense/sparse Data type Any double float float or double double Categorical values As-is (factor) Encoded Binarized Binarized Binarized Missing values Yes Pseudo (NaN) Pseudo (NaN) No No
  • 18. LightGBM via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelEncoder from lightgbm import LGBMRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelEncoder()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = LGBMRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6, num_leaves = 64) regressor.fit(transformed_cars, cars["price"], categorical_feature = list(range(0, len(factor_columns))))
  • 19. XGBoost via Scikit-Learn from sklearn_pandas import DataFrameMapper from sklearn2pmml.preprocessing import PMMLLabelBinarizer from xgboost.sklearn import XGBRegressor mapper = DataFrameMapper( [(factor_column, PMMLLabelBinarizer()) for factor_column in factor_columns] + [(continuous_columns, None)] ) transformed_cars = mapper.fit_transform(cars) regressor = XGBRegressor(n_estimators = 100, learning_rate = 0.1, max_depth = 6) regressor.fit(transformed_cars, cars["price"])
  • 20. GBT algorithm (training) R LightGBM XGBoost Scikit- Learn Apache Spark ML Abstraction gbm LGBMRegressor XGBRegressor GradientBoost ingRegressor GBTRegressor Parameterizab ility Medium High High Medium Medium Split type Multi-way Binary Binary Binary Binary Categorical values "set contains" "equals" Pseudo ("equals") Pseudo ("equals") "equals" Missing values First-class Pseudo Pseudo No No
  • 21. gbm-style splits <Node id="9"> <SimplePredicate field="interior_type" operator="isMissing"/> <Node id="12" score="3.0702062395803734E-4"> <SimplePredicate field="colour" operator="isMissing"/> </Node> <Node id="10" score="-0.018950416258408962"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Grün Rot Violett Weiß</Array> </SimpleSetPredicate> </Node> <Node id="11" score="-0.0017446280908351925"> <SimpleSetPredicate field="colour" booleanOperator="isIn"> <Array type="string">Beige Blau Braun Gelb Gold Grau Orange Schwarz Silber</Array> </SimpleSetPredicate> </Node> </Node>
  • 22. LightGBM- and XGBoost-style splits (1/3) <Node id="39" defaultChild="76"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <Node id="76" score="0.0030283758"> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </Node> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> </Node>
  • 23. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour == null || !"Orange".equals(colour)) return 0.0030283758 --> <Node id="76" score="0.0030283758"> <CompoundPredicate booleanOperator="or"> <SimplePredicate field="colour" operator="isMissing"/> <SimplePredicate field="colour" operator="notEqual" value="Orange"/> </CompoundPredicate> </Node> <!-- else if("Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <SimplePredicate field="colour" operator="equal" value="Orange"/> </Node> <!-- else return null --> </Node>
  • 24. LightGBM- and XGBoost-style splits (2/3) <Node id="39"> <SimplePredicate field="category" operator="equal" value="Cabrio/Roadster"/> <!-- if(colour != null && "Orange".equals(colour)) return 0.02483887 --> <Node id="77" score="0.02483887"> <CompoundPredicate booleanOperator="and"> <SimplePredicate field="colour" operator="isNotMissing"/> <SimplePredicate field="colour" operator="equal" value="Orange"/> </CompoundPredicate> </Node> <!-- else return 0.0030283758 --> <Node id="76" score="0.0030283758"> <True/> </Node> </Node>
  • 25. Model measurement using JPMML org.dmg.pmml.tree.TreeModel treeModel = ..; treeModel.accept(new org.jpmml.model.visitors.AbstractVisitor(){ private int count = 0; // Number of Node elements private int maxDepth = 0; // Max "nesting depth" of Node elements @Override public VisitorAction visit(org.dmg.pmml.tree.Node node){ this.count++; int depth = 0; for(org.dmg.pmml.PMMLObject parent : getParents()){ if(!(parent instanceof org.dmg.pmml.tree.Node)) break; depth++; } this.maxDepth = Math.max(this.maxDepth, depth); return super.visit(node); } });
  • 26.
  • 27.
  • 28.
  • 29. GBT algorithm (interpretation) R LightGBM XGBoost Scikit- Learn Apache Spark ML Feature importances Direct Direct Transformed Transformed Transformed Decision path No No(?) No(?) Transformed Transformed Model persistence RDS (binary) Proprietary (text) Proprietary (binary, text) Pickle (binary) SER (binary) or JSON (text) Model reusability Good Fair(?) Good Fair Fair Java API No No Pseudo No Yes
  • 30. LightGBM feature importances Age 936 Mileage 887 Performance 738 [Category] 205 New? 179 [Type of fuel] 170 [Type of interior] 167 Airbags? 130 [Colour] 129 [Type of gearbox] 105
  • 31. Model execution using JPMML org.dmg.pmml.PMML pmml; try(InputStream is = ..){ pmml = org.jpmml.model.PMMLUtil.unmarshal(is); } org.jpmml.evaluator.Evaluator evaluator = new org.jpmml.evaluator.mining.MiningModelEvaluator(pmml); org.jpmml.evaluator.InputField inputField = selectField(evaluator.getInputFields(), ..); org.jpmml.evaluator.TargetField targetField = selectField(evaluator.getTargetFields(), ..); for(int value = min; value <= max; value += increment){ Map<FieldName, FieldValue> arguments = Collections.singletonMap(inputField.getName(), inputField.prepare(value)); Map<FieldName, ?> result = evaluator.evaluate(arguments); System.out.println(result.get(targetField.getName())); }
  • 32.
  • 33.
  • 34. Lessons (to be-) learned ● Limits and limitations of individual APIs ● Vertical integration vs. horizontal integration: ○ All capabilities on a single platform ○ Specialized capabilities on specialized platforms ● Ease-of-use and robustness beat raw performance in most application scenarios ● "Conventions over configuration"