SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Converting Scikit-Learn to PMML
Villu Ruusmann
Openscoring OÜ
"Train once, deploy anywhere"
Scikit-Learn challenge
pipeline = Pipeline([...])
pipeline.fit(Xtrain
, ytrain
)
ytest
= pipeline.predict(Xtest
)
● Serializing and deserializing the fitted pipeline:
○ (C)Pickle
○ Joblib
● Creating Xtest
that is functionally equivalent to Xtrain
:
○ ???
(J)PMML solution
Evaluator evaluator = ...;
List<InputField> argumentFields = evaluator.getInputFields();
List<ResultField> resultFields =
Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields());
Map<FieldName, ?> arguments = readRecord(argumentFields);
Map<FieldName, ?> result = evaluator.evaluate(arguments);
writeRecord(result, resultFields);
● The fitted pipeline (data pre-and post-processing, model)
is represented using standardized PMML data structures
● Well-defined data entry and exit interfaces
Workflow
The PMML pipeline
A smart pipeline that collects supporting information about
"passed through" features and label(s):
● Name
● Data type (eg. string, float, integer, boolean)
● Operational type (eg. continuous, categorical, ordinal)
● Domain of valid values
● Missing and invalid value treatments
Making a PMML pipeline (1/2)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline([...])
pipeline.fit(X, y)
sklearn2pmml(pipeline, "pipeline.pmml")
Making a PMML pipeline (2/2)
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
pipeline = Pipeline([...])
pipeline.fit(X, y)
pipeline = make_pmml_pipeline(
pipeline,
active_fields = ["x1", "x2", ...],
target_fields = ["y"]
)
sklearn2pmml(pipeline, "pipeline.pmml")
Pipeline setup (1/3)
1. Feature and label definition
Specific to (J)PMML
2. Feature engineering
3. Feature selection
Scikit-Learn persists all features, (J)PMML persists "surviving" features
4. Estimator fitting
5. Decision engineering
Specific to (J)PMML
Pipeline setup (2/3)
Workflow:
1. Column- and column set-oriented feature definition,
engineering and selection
2. Table-oriented feature engineering and selection
3. Estimator fitting
Full support for pipeline nesting, branching.
Pipeline setup (3/3)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn_pandas import DataFrameMapper
pipeline = PMMLPipeline([
("stage1", DataFrameMapper([
(["x1"], [CategoricalDomain(), ...]),
(["x2", "x3"], [ContinuousDomain(), ...]),
])),
("stage2", SelectKBest(10)),
("stage3", LogisticRegression())
])
Feature definition
Continuous features (1/2)
Without missing values:
("Income", [ContinuousDomain(invalid_value_treatment = "return_invalid",
missing_value_treatment = "as_is", with_data = True,
with_statistics = True)])
With missing values (encoded as None/NaN):
("Income", [ContinuousDomain(), Imputer()])
With missing values (encoded using dummy values):
("Income", [ContinuousDomain(missing_values = -999),
Imputer(missing_values = -999)])
Continuous features (2/2)
<DataDictionary>
<DataField name="Income" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="1598.95" rightMargin="481259.5"/>
</DataField>
</DataDictionary>
<RegressionModel>
<MiningSchema>
<MiningField name="Income" invalidValueTreatment="returnInvalid"
missingValueTreatment="asMean" missingValueReplacement="84807.39297450424"/>
</MiningSchema>
<ModelStats>
<UnivariateStats field="Income">
<Counts totalFreq="1899.0" missingFreq="487.0" invalidFreq="0.0"/>
<NumericInfo minimum="1598.95" maximum="481259.5" mean="84807.39297450424" median="60596.35"
standardDeviation="69696.87637064351" interQuartileRange="81611.1225"/>
</UnivariateStats>
</ModelStats>
</RegressionModel>
Categorical features (1/3)
Without missing values:
("Education", [CategoricalDomain(invalid_value_treatment =
"return_invalid", missing_value_treatment = "as_is", with_data = True,
with_statistics = True), LabelBinarizer()])
With missing values, missing value-aware estimator:
("Education", [CategoricalDomain(), PMMLLabelBinarizer()])
With missing values, missing value-unaware estimator:
("Education", [CategoricalDomain(), CategoricalImputer(),
LabelBinarizer()])
Categorical features (2/3)
<DataDictionary>
<DataField name="Education" optype="categorical" dataType="string">
<Value value="Associate"/>
<Value value="Bachelor"/>
<Value value="College"/>
<Value value="Doctorate"/>
<Value value="HSgrad"/>
<Value value="Master"/>
<Value value="Preschool"/>
<Value value="Professional"/>
<Value value="Vocational"/>
<Value value="Yr10t12"/>
<Value value="Yr1t4"/>
<Value value="Yr5t9"/>
</DataField>
</DataDictionary>
Categorical features (3/3)
<RegressionModel>
<MiningSchema>
<MiningField name="Education" invalidValueTreatment="asIs", x-invalidValueReplacement="HSgrad"
missingValueTreatment="asMode" missingValueReplacement="HSgrad"/>
</MiningSchema>
<UnivariateStats field="Education">
<Counts totalFreq="1899.0" missingFreq="497.0" invalidFreq="0.0"/>
<DiscrStats>
<Array type="string">Associate Bachelor College Doctorate HSgrad Master Preschool
Professional Vocational Yr10t12 Yr1t4 Yr5t9</Array>
<Array type="int">55 241 309 22 462 78 6 15 55 95 4 60</Array>
</DiscrStats>
</UnivariateStats>
</RegressionModel>
Feature engineering
Continuous features (1/2)
from sklearn.preprocessing import Binarizer, FunctionTransformer
from sklearn2pmml.preprocessing import ExpressionTransformer
features = FeatureUnion([
("identity", DataFrameMapper([
(["Income", "Hours"], ContinuousDomain())
])),
("transformation", DataFrameMapper([
(["Income"], FunctionTransformer(numpy.log10)),
(["Hours"], Binarizer(threshold = 40)),
(["Income", "Hours"], ExpressionTransformer("X[:,0]/(X[:,1]*52)"))
]))
])
Continuous features (2/2)
<TransformationDictionary>
<DerivedField name="log10(Income)" optype="continuous" dataType="double">
<Apply function="log10"><FieldRef field="Income"/></Apply>
</DerivedField>
<DerivedField name="binarizer(Hours)" optype="continuous" dataType="double">
<Apply function="threshold"><FieldRef field="Hours"/><Constant>40</Constant></Apply>
</DerivedField>
<DerivedField name="eval(X[:,0]/(X[:,1]*52))" optype="continuous" dataType="double">
<Apply function="/">
<FieldRef field="Income"/>
<Apply function="*">
<FieldRef field="Hours"/>
<Constant dataType="integer">52</Constant>
</Apply>
</Apply>
</DerivedField>
</TransformationDictionary>
Categorical features
from sklearn preprocessing import LabelBinarizer, PolynomialFeatures
features = Pipeline([
("identity", DataFrameMapper([
("Education", [CategoricalDomain(), LabelBinarizer()]),
("Occupation", [CategoricalDomain(), LabelBinarizer()])
])),
("interaction", PolynomialFeatures())
])
Feature selection
Scikit-Learn challenge
class SelectPercentile(BaseTransformer, SelectorMixin):
def _get_support_mask(self):
scores = self.scores
threshold = stats.scoreatpercentile(scores, 100 - self.percentile)
return (scores > threshold)
Python methods don't have a persistent state:
Exception in thread "main": java.lang.IllegalArgumentException: The
selector object does not have a persistent '_get_support_mask' attribute
(J)PMML solution
"Hiding" a stateless selector behind a stateful meta-selector:
from sklearn2pmml import SelectorProxy, PMMLPipeline
pipeline = PMMLPipeline([
("stage1", DataFrameMapper([
(["x1"], [CategoricalDomain(), LabelBinarizer(),
SelectorProxy(SelectFromModel(DecisionTreeClassifier()))])
])),
("stage2", SelectorProxy(SelectPercentile(percentile = 50)))
])
Estimator fitting
Hyper-parameter tuning (1/2)
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline([...])
tuner = GridSearchCV(pipeline, param_grid = {...})
tuner.fit(X, y)
# GridSearchCV.best_estimator_ is of type PMMLPipeline
sklearn2pmml(tuner.best_estimator_, "pipeline.pmml")
Hyper-parameter tuning (2/2)
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
pipeline = Pipeline([...])
tuner = GridSearchCV(pipeline, param_grid = {...})
tuner.fit(X, y)
# GridSearchCV.best_estimator_ is of type Pipeline
sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_, active_fields =
[...], target_fields = [...]), "pipeline.pmml")
Algorithm tuning
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
from tpot import TPOTClassifier
# See https://github.com/rhiever/tpot
tpot = TPOTClassifier()
tpot.fit(X, y)
sklearn2pmml(make_pmml_pipeline(tpot.fitted_pipeline_, active_fields =
[...], target_fields = [...]), "pipeline.pmml")
Model customization (1/2)
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml
# Keep reference to the estimator
classifier = DecisionTreeClassifier()
pipeline = PMMLPipeline([..., ("classifier", classifier)])
pipeline.fit(X, y)
# Apply customization(s) to the fitted estimator
classifier.compact = True
sklearn2pmml(pipeline, "pipeline.pmml")
Model customization (2/2)
● org.jpmml.sklearn.HasOptions
○ lightgbm.sklearn.HasLightGBMOptions
■ compact
■ num_iteration
○ sklearn.tree.HasTreeOptions
■ compact
○ xgboost.sklearn.HasXGBoostOptions
■ compact
■ ntree_limit
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
Software (Nov 2017)
● The Python side:
○ sklearn 0.19(.0)
○ sklearn2pmml 0.26(.0)
○ sklearn_pandas 1.5(.0)
○ TPOT 0.9(.0)
● The Java side:
○ JPMML-SkLearn 1.4(.0)
○ JPMML-Evaluator 1.3(.10)

Weitere ähnliche Inhalte

Was ist angesagt?

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
FS2 mongo reactivestreams
FS2 mongo reactivestreamsFS2 mongo reactivestreams
FS2 mongo reactivestreamsyann_s
 
The Best (and Worst) of Django
The Best (and Worst) of DjangoThe Best (and Worst) of Django
The Best (and Worst) of DjangoJacob Kaplan-Moss
 
RSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream ProcessingRSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream ProcessingRiccardo Tommasini
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Schemas Beyond The Edge
Schemas Beyond The EdgeSchemas Beyond The Edge
Schemas Beyond The Edgeconfluent
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
 
Spring boot introduction
Spring boot introductionSpring boot introduction
Spring boot introductionRasheed Waraich
 
How to Design Resilient Odoo Crons
How to Design Resilient Odoo CronsHow to Design Resilient Odoo Crons
How to Design Resilient Odoo CronsOdoo
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemSease
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsCamel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsBilgin Ibryam
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfAnya Bida
 
Functional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldFunctional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldJorge Vásquez
 

Was ist angesagt? (20)

NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
FS2 mongo reactivestreams
FS2 mongo reactivestreamsFS2 mongo reactivestreams
FS2 mongo reactivestreams
 
The Best (and Worst) of Django
The Best (and Worst) of DjangoThe Best (and Worst) of Django
The Best (and Worst) of Django
 
RSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream ProcessingRSP4J: An API for RDF Stream Processing
RSP4J: An API for RDF Stream Processing
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Schemas Beyond The Edge
Schemas Beyond The EdgeSchemas Beyond The Edge
Schemas Beyond The Edge
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Spring Boot
Spring BootSpring Boot
Spring Boot
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache Cassandra
 
Spring boot introduction
Spring boot introductionSpring boot introduction
Spring boot introduction
 
How to Design Resilient Odoo Crons
How to Design Resilient Odoo CronsHow to Design Resilient Odoo Crons
How to Design Resilient Odoo Crons
 
A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and TearsCamel Desing Patterns Learned Through Blood, Sweat, and Tears
Camel Desing Patterns Learned Through Blood, Sweat, and Tears
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
 
Functional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorldFunctional Programming 101 with Scala and ZIO @FunctionalWorld
Functional Programming 101 with Scala and ZIO @FunctionalWorld
 

Andere mochten auch

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPôle Systematic Paris-Region
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017Francesco Mosconi
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanChetan Khatri
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnArnaud Joly
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnAWeber
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learnQingkai Kong
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learnAWeber
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnKan Ouivirach, Ph.D.
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learnYoss Cohen
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnGilles Louppe
 

Andere mochten auch (20)

K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 

Ähnlich wie Converting Scikit-Learn to PMML

Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Abhishek Thakur
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and RAkhilesh Joshi
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsHichem Felouat
 
Assignment 6.2a.pdf
Assignment 6.2a.pdfAssignment 6.2a.pdf
Assignment 6.2a.pdfdash41
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3San Kim
 
Machine Learning on GCP
Machine Learning on GCPMachine Learning on GCP
Machine Learning on GCPAllCloud
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstractionIntro C# Book
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonZahid Hasan
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonDr. Volkan OBAN
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트기룡 남
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeletonIram Ramrajkar
 

Ähnlich wie Converting Scikit-Learn to PMML (20)

svm classification
svm classificationsvm classification
svm classification
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
knn classification
knn classificationknn classification
knn classification
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Assignment 6.2a.pdf
Assignment 6.2a.pdfAssignment 6.2a.pdf
Assignment 6.2a.pdf
 
Xgboost
XgboostXgboost
Xgboost
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Deep learning study 3
Deep learning study 3Deep learning study 3
Deep learning study 3
 
Xgboost
XgboostXgboost
Xgboost
 
Machine Learning on GCP
Machine Learning on GCPMachine Learning on GCP
Machine Learning on GCP
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
20.1 Java working with abstraction
20.1 Java working with abstraction20.1 Java working with abstraction
20.1 Java working with abstraction
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트생산적인 개발을 위한 지속적인 테스트
생산적인 개발을 위한 지속적인 테스트
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeleton
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 

Kürzlich hochgeladen

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 

Kürzlich hochgeladen (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 

Converting Scikit-Learn to PMML

  • 1. Converting Scikit-Learn to PMML Villu Ruusmann Openscoring OÜ
  • 3. Scikit-Learn challenge pipeline = Pipeline([...]) pipeline.fit(Xtrain , ytrain ) ytest = pipeline.predict(Xtest ) ● Serializing and deserializing the fitted pipeline: ○ (C)Pickle ○ Joblib ● Creating Xtest that is functionally equivalent to Xtrain : ○ ???
  • 4. (J)PMML solution Evaluator evaluator = ...; List<InputField> argumentFields = evaluator.getInputFields(); List<ResultField> resultFields = Lists.union(evaluator.getTargetFields(), evaluator.getOutputFields()); Map<FieldName, ?> arguments = readRecord(argumentFields); Map<FieldName, ?> result = evaluator.evaluate(arguments); writeRecord(result, resultFields); ● The fitted pipeline (data pre-and post-processing, model) is represented using standardized PMML data structures ● Well-defined data entry and exit interfaces
  • 6. The PMML pipeline A smart pipeline that collects supporting information about "passed through" features and label(s): ● Name ● Data type (eg. string, float, integer, boolean) ● Operational type (eg. continuous, categorical, ordinal) ● Domain of valid values ● Missing and invalid value treatments
  • 7. Making a PMML pipeline (1/2) from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml pipeline = PMMLPipeline([...]) pipeline.fit(X, y) sklearn2pmml(pipeline, "pipeline.pmml")
  • 8. Making a PMML pipeline (2/2) from sklearn2pmml import make_pmml_pipeline, sklearn2pmml pipeline = Pipeline([...]) pipeline.fit(X, y) pipeline = make_pmml_pipeline( pipeline, active_fields = ["x1", "x2", ...], target_fields = ["y"] ) sklearn2pmml(pipeline, "pipeline.pmml")
  • 9. Pipeline setup (1/3) 1. Feature and label definition Specific to (J)PMML 2. Feature engineering 3. Feature selection Scikit-Learn persists all features, (J)PMML persists "surviving" features 4. Estimator fitting 5. Decision engineering Specific to (J)PMML
  • 10. Pipeline setup (2/3) Workflow: 1. Column- and column set-oriented feature definition, engineering and selection 2. Table-oriented feature engineering and selection 3. Estimator fitting Full support for pipeline nesting, branching.
  • 11. Pipeline setup (3/3) from sklearn2pmml import PMMLPipeline from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain from sklearn_pandas import DataFrameMapper pipeline = PMMLPipeline([ ("stage1", DataFrameMapper([ (["x1"], [CategoricalDomain(), ...]), (["x2", "x3"], [ContinuousDomain(), ...]), ])), ("stage2", SelectKBest(10)), ("stage3", LogisticRegression()) ])
  • 13. Continuous features (1/2) Without missing values: ("Income", [ContinuousDomain(invalid_value_treatment = "return_invalid", missing_value_treatment = "as_is", with_data = True, with_statistics = True)]) With missing values (encoded as None/NaN): ("Income", [ContinuousDomain(), Imputer()]) With missing values (encoded using dummy values): ("Income", [ContinuousDomain(missing_values = -999), Imputer(missing_values = -999)])
  • 14. Continuous features (2/2) <DataDictionary> <DataField name="Income" optype="continuous" dataType="double"> <Interval closure="closedClosed" leftMargin="1598.95" rightMargin="481259.5"/> </DataField> </DataDictionary> <RegressionModel> <MiningSchema> <MiningField name="Income" invalidValueTreatment="returnInvalid" missingValueTreatment="asMean" missingValueReplacement="84807.39297450424"/> </MiningSchema> <ModelStats> <UnivariateStats field="Income"> <Counts totalFreq="1899.0" missingFreq="487.0" invalidFreq="0.0"/> <NumericInfo minimum="1598.95" maximum="481259.5" mean="84807.39297450424" median="60596.35" standardDeviation="69696.87637064351" interQuartileRange="81611.1225"/> </UnivariateStats> </ModelStats> </RegressionModel>
  • 15. Categorical features (1/3) Without missing values: ("Education", [CategoricalDomain(invalid_value_treatment = "return_invalid", missing_value_treatment = "as_is", with_data = True, with_statistics = True), LabelBinarizer()]) With missing values, missing value-aware estimator: ("Education", [CategoricalDomain(), PMMLLabelBinarizer()]) With missing values, missing value-unaware estimator: ("Education", [CategoricalDomain(), CategoricalImputer(), LabelBinarizer()])
  • 16. Categorical features (2/3) <DataDictionary> <DataField name="Education" optype="categorical" dataType="string"> <Value value="Associate"/> <Value value="Bachelor"/> <Value value="College"/> <Value value="Doctorate"/> <Value value="HSgrad"/> <Value value="Master"/> <Value value="Preschool"/> <Value value="Professional"/> <Value value="Vocational"/> <Value value="Yr10t12"/> <Value value="Yr1t4"/> <Value value="Yr5t9"/> </DataField> </DataDictionary>
  • 17. Categorical features (3/3) <RegressionModel> <MiningSchema> <MiningField name="Education" invalidValueTreatment="asIs", x-invalidValueReplacement="HSgrad" missingValueTreatment="asMode" missingValueReplacement="HSgrad"/> </MiningSchema> <UnivariateStats field="Education"> <Counts totalFreq="1899.0" missingFreq="497.0" invalidFreq="0.0"/> <DiscrStats> <Array type="string">Associate Bachelor College Doctorate HSgrad Master Preschool Professional Vocational Yr10t12 Yr1t4 Yr5t9</Array> <Array type="int">55 241 309 22 462 78 6 15 55 95 4 60</Array> </DiscrStats> </UnivariateStats> </RegressionModel>
  • 19. Continuous features (1/2) from sklearn.preprocessing import Binarizer, FunctionTransformer from sklearn2pmml.preprocessing import ExpressionTransformer features = FeatureUnion([ ("identity", DataFrameMapper([ (["Income", "Hours"], ContinuousDomain()) ])), ("transformation", DataFrameMapper([ (["Income"], FunctionTransformer(numpy.log10)), (["Hours"], Binarizer(threshold = 40)), (["Income", "Hours"], ExpressionTransformer("X[:,0]/(X[:,1]*52)")) ])) ])
  • 20. Continuous features (2/2) <TransformationDictionary> <DerivedField name="log10(Income)" optype="continuous" dataType="double"> <Apply function="log10"><FieldRef field="Income"/></Apply> </DerivedField> <DerivedField name="binarizer(Hours)" optype="continuous" dataType="double"> <Apply function="threshold"><FieldRef field="Hours"/><Constant>40</Constant></Apply> </DerivedField> <DerivedField name="eval(X[:,0]/(X[:,1]*52))" optype="continuous" dataType="double"> <Apply function="/"> <FieldRef field="Income"/> <Apply function="*"> <FieldRef field="Hours"/> <Constant dataType="integer">52</Constant> </Apply> </Apply> </DerivedField> </TransformationDictionary>
  • 21. Categorical features from sklearn preprocessing import LabelBinarizer, PolynomialFeatures features = Pipeline([ ("identity", DataFrameMapper([ ("Education", [CategoricalDomain(), LabelBinarizer()]), ("Occupation", [CategoricalDomain(), LabelBinarizer()]) ])), ("interaction", PolynomialFeatures()) ])
  • 23. Scikit-Learn challenge class SelectPercentile(BaseTransformer, SelectorMixin): def _get_support_mask(self): scores = self.scores threshold = stats.scoreatpercentile(scores, 100 - self.percentile) return (scores > threshold) Python methods don't have a persistent state: Exception in thread "main": java.lang.IllegalArgumentException: The selector object does not have a persistent '_get_support_mask' attribute
  • 24. (J)PMML solution "Hiding" a stateless selector behind a stateful meta-selector: from sklearn2pmml import SelectorProxy, PMMLPipeline pipeline = PMMLPipeline([ ("stage1", DataFrameMapper([ (["x1"], [CategoricalDomain(), LabelBinarizer(), SelectorProxy(SelectFromModel(DecisionTreeClassifier()))]) ])), ("stage2", SelectorProxy(SelectPercentile(percentile = 50))) ])
  • 26. Hyper-parameter tuning (1/2) from sklearn.model_selection import GridSearchCV from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml pipeline = PMMLPipeline([...]) tuner = GridSearchCV(pipeline, param_grid = {...}) tuner.fit(X, y) # GridSearchCV.best_estimator_ is of type PMMLPipeline sklearn2pmml(tuner.best_estimator_, "pipeline.pmml")
  • 27. Hyper-parameter tuning (2/2) from sklearn2pmml import make_pmml_pipeline, sklearn2pmml pipeline = Pipeline([...]) tuner = GridSearchCV(pipeline, param_grid = {...}) tuner.fit(X, y) # GridSearchCV.best_estimator_ is of type Pipeline sklearn2pmml(make_pmml_pipeline(tuner.best_estimator_, active_fields = [...], target_fields = [...]), "pipeline.pmml")
  • 28. Algorithm tuning from sklearn2pmml import make_pmml_pipeline, sklearn2pmml from tpot import TPOTClassifier # See https://github.com/rhiever/tpot tpot = TPOTClassifier() tpot.fit(X, y) sklearn2pmml(make_pmml_pipeline(tpot.fitted_pipeline_, active_fields = [...], target_fields = [...]), "pipeline.pmml")
  • 29. Model customization (1/2) from sklearn2pmml import PMMLPipeline from sklearn2pmml import sklearn2pmml # Keep reference to the estimator classifier = DecisionTreeClassifier() pipeline = PMMLPipeline([..., ("classifier", classifier)]) pipeline.fit(X, y) # Apply customization(s) to the fitted estimator classifier.compact = True sklearn2pmml(pipeline, "pipeline.pmml")
  • 30. Model customization (2/2) ● org.jpmml.sklearn.HasOptions ○ lightgbm.sklearn.HasLightGBMOptions ■ compact ■ num_iteration ○ sklearn.tree.HasTreeOptions ■ compact ○ xgboost.sklearn.HasXGBoostOptions ■ compact ■ ntree_limit
  • 32. Software (Nov 2017) ● The Python side: ○ sklearn 0.19(.0) ○ sklearn2pmml 0.26(.0) ○ sklearn_pandas 1.5(.0) ○ TPOT 0.9(.0) ● The Java side: ○ JPMML-SkLearn 1.4(.0) ○ JPMML-Evaluator 1.3(.10)