SlideShare a Scribd company logo
1 of 12
Download to read offline
Representing TF and TF-IDF
transformations in PMML
Villu Ruusmann
Openscoring OÜ
TF
Local Term Frequency (TF) - The frequency of the term in a document.
<TextIndex textField="documentField">
<FieldRef field="termField"/>
</TextIndex>
sklearn.feature_extraction.text.CountVectorizer
org.apache.spark.ml.feature.CountVectorizer
TF-IDF
Global Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term
in the corpus of training documents.
<Apply function="*">
<TextIndex textField="documentField">
<FieldRef field="termField"/>
</TextIndex>
<FieldRef field="termWeightField"/>
</Apply>
sklearn.feature_extraction.text.TfidfTransformer
org.apache.spark.ml.feature.IDF
PMML encoding (1/2)
The "centralized" TF-IDF function definition:
<DefineFunction name="tf-idf" dataType="continuous" optype="continuous">
<ParamField name="document"/>
<ParamField name="term"/>
<ParamField name="weight"/>
<Apply function="*">
<TextIndex textField=" document">
<FieldRef field=" term"/>
</TextIndex>
<FieldRef field=" weight"/>
</Apply>
</DefineFunction>
PMML encoding (2/2)
Many "centralized" TF-IDF function invocations:
<DerivedField name="tf-idf(2017)" dataType="float" optype="continuous">
<Apply function="tf-idf">
<FieldRef field="tweetField"/>
<Constant dataType="string">2017</Constant>
<Constant dataType="double">5.4132</Constant>
</Apply>
</DerivedField>
Many "localized" TF-IDF usages:
<Node>
<SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25">
</Node>
PMML TF algorithm
1. Normalize the document.
2. Tokenize the term and the document. Trim tokens by removing leading and
trailing (but not continuation) punctuation characters.
3. Count the occurrences of term tokens in document tokens subject to the
following constraints:
3.1. Case-sensitivity
3.2. Max Levenshtein distance (as measured in the number of
single-character insertions, substitutions or deletions).
4. Transform the count to the final TF metric.
http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex
String normalization
Ensuring that the unlimited, free-form text input complies with the limited,
standardized vocabulary of the TextIndex element:
<TextIndexNormalization isCaseSensitive="false">
<InlineTable>
<Row>
<string>[u00c0-u00c5]</string><stem>a</stem> <regex>true</regex>
</Row>
<Row>
<string>is|are|was|were</string><stem>be</stem> <regex>true</regex>
</Row>
</InlineTable>
</TextIndexNormalization>
String tokenization
Two approaches for string tokenization using regular expressions (REs):
1. Define word separator RE and execute
(Pattern.compile(wordSeparatorRE)).split(string)
2. Define word RE and execute
((Pattern.compile(wordRE)).matcher(string)).findAll()
Popular ML frameworks support both approaches.
PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will
support the second approach as well.
http://mantis.dmg.org/view.php?id=173
Counting terms in a document
A "match" is a situation where the difference between term tokens [0, length] and
document tokens [i, i + length] (where i is the match position), is less than or equal
to the match threshold.
Match threshold is a function of TextIndex@isCaseSensitive and
TextIndex@maxLevenshteinDistance attribute values. During
case-insensitive matching (the default), the edit distance between two characters
that only differ by case is considered to be 0, whereas during case-sensitive
matching it is considered to be 1.
The matches may overlap if the "length" of term tokens is greater than one.
http://mantis.dmg.org/view.php?id=172
Interoperability with Scikit-Learn (1/2)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(..,
strip_accents = .., # If not None, handle using text normalization
analyzer = "word", # Set to "word"
preprocessor = .., # If not None, handle using text normalization
tokenizer = .., # If not None, handle using text tokenization
token_pattern = None, # Set to None. Use the "tokenizer" attribute instead
lowercase = .., # If True, convert the document to lowercase String and
perform term matching in a case-insensitive manner
binary = .., # Determines the transformation from counts to final TF
metric ("binary" for True, and "termFrequency" for False)
sublinear_tf = .., # If True, apply scaling to final TF metric
norm = None # Set to None
)
Interoperability with Scikit-Learn (2/2)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn2pmml import PMMLPipeline
from sklearn2pmml.feature_extraction.text import Splitter
pipeline = PMMLPipeline(
('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None,
strip_accents = None, tokenizer = Splitter() , token_pattern = None ,
stop_words = "english", ngram_range = (1, 2), binary = False, use_idf =
True, norm = None))
)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "pipeline.pmml")
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml

More Related Content

What's hot

Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

What's hot (20)

Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
Presto
PrestoPresto
Presto
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Hexagonal architecture & Elixir
Hexagonal architecture & ElixirHexagonal architecture & Elixir
Hexagonal architecture & Elixir
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)Ensemble Model (Hybrid model)
Ensemble Model (Hybrid model)
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Fin bert paper review !
Fin bert paper review !Fin bert paper review !
Fin bert paper review !
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Spring data jpa
Spring data jpaSpring data jpa
Spring data jpa
 
Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8Reactive Access to MongoDB from Java 8
Reactive Access to MongoDB from Java 8
 

Viewers also liked

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
Spark Summit
 

Viewers also liked (20)

R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Yace 3.0
Yace 3.0Yace 3.0
Yace 3.0
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
 
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
 
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3Getting Started with Alluxio + Spark + S3
Getting Started with Alluxio + Spark + S3
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017Product Update: EDB Postgres Platform 2017
Product Update: EDB Postgres Platform 2017
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Similar to Representing TF and TF-IDF transformations in PMML

Xml representation oftextspecifications
Xml representation oftextspecificationsXml representation oftextspecifications
Xml representation oftextspecifications
usert098
 
Text Analytics
Text AnalyticsText Analytics
Text Analytics
Ajay Ram
 

Similar to Representing TF and TF-IDF transformations in PMML (20)

Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Multi Document Text Summarization using Backpropagation Network
Multi Document Text Summarization using Backpropagation NetworkMulti Document Text Summarization using Backpropagation Network
Multi Document Text Summarization using Backpropagation Network
 
F Files - Learnings from 3 years of Neos Support
F Files - Learnings from 3 years of Neos SupportF Files - Learnings from 3 years of Neos Support
F Files - Learnings from 3 years of Neos Support
 
Xml representation oftextspecifications
Xml representation oftextspecificationsXml representation oftextspecifications
Xml representation oftextspecifications
 
Xtext's new Formatter API
Xtext's new Formatter APIXtext's new Formatter API
Xtext's new Formatter API
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
Separation of Concerns in Language Definition
Separation of Concerns in Language DefinitionSeparation of Concerns in Language Definition
Separation of Concerns in Language Definition
 
C interview questions
C interview  questionsC interview  questions
C interview questions
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Interpreter Design Pattern
Interpreter Design PatternInterpreter Design Pattern
Interpreter Design Pattern
 
A Programmatic View and Implementation of XML
A Programmatic View and Implementation of XMLA Programmatic View and Implementation of XML
A Programmatic View and Implementation of XML
 
Chapter _4_Semantic Analysis .pptx
Chapter _4_Semantic Analysis .pptxChapter _4_Semantic Analysis .pptx
Chapter _4_Semantic Analysis .pptx
 
Text Analytics
Text AnalyticsText Analytics
Text Analytics
 
Xml session
Xml sessionXml session
Xml session
 
Introduction To Programming with Python-1
Introduction To Programming with Python-1Introduction To Programming with Python-1
Introduction To Programming with Python-1
 

Recently uploaded

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 

Representing TF and TF-IDF transformations in PMML

  • 1. Representing TF and TF-IDF transformations in PMML Villu Ruusmann Openscoring OÜ
  • 2. TF Local Term Frequency (TF) - The frequency of the term in a document. <TextIndex textField="documentField"> <FieldRef field="termField"/> </TextIndex> sklearn.feature_extraction.text.CountVectorizer org.apache.spark.ml.feature.CountVectorizer
  • 3. TF-IDF Global Term Frequency (TF-IDF) - TF, weighted by the "significance" of the term in the corpus of training documents. <Apply function="*"> <TextIndex textField="documentField"> <FieldRef field="termField"/> </TextIndex> <FieldRef field="termWeightField"/> </Apply> sklearn.feature_extraction.text.TfidfTransformer org.apache.spark.ml.feature.IDF
  • 4. PMML encoding (1/2) The "centralized" TF-IDF function definition: <DefineFunction name="tf-idf" dataType="continuous" optype="continuous"> <ParamField name="document"/> <ParamField name="term"/> <ParamField name="weight"/> <Apply function="*"> <TextIndex textField=" document"> <FieldRef field=" term"/> </TextIndex> <FieldRef field=" weight"/> </Apply> </DefineFunction>
  • 5. PMML encoding (2/2) Many "centralized" TF-IDF function invocations: <DerivedField name="tf-idf(2017)" dataType="float" optype="continuous"> <Apply function="tf-idf"> <FieldRef field="tweetField"/> <Constant dataType="string">2017</Constant> <Constant dataType="double">5.4132</Constant> </Apply> </DerivedField> Many "localized" TF-IDF usages: <Node> <SimplePredicate field="tf-idf(2017)" operator="lessThan" value="7.25"> </Node>
  • 6. PMML TF algorithm 1. Normalize the document. 2. Tokenize the term and the document. Trim tokens by removing leading and trailing (but not continuation) punctuation characters. 3. Count the occurrences of term tokens in document tokens subject to the following constraints: 3.1. Case-sensitivity 3.2. Max Levenshtein distance (as measured in the number of single-character insertions, substitutions or deletions). 4. Transform the count to the final TF metric. http://dmg.org/pmml/v4-3/Transformations.html#xsdElement_TextIndex
  • 7. String normalization Ensuring that the unlimited, free-form text input complies with the limited, standardized vocabulary of the TextIndex element: <TextIndexNormalization isCaseSensitive="false"> <InlineTable> <Row> <string>[u00c0-u00c5]</string><stem>a</stem> <regex>true</regex> </Row> <Row> <string>is|are|was|were</string><stem>be</stem> <regex>true</regex> </Row> </InlineTable> </TextIndexNormalization>
  • 8. String tokenization Two approaches for string tokenization using regular expressions (REs): 1. Define word separator RE and execute (Pattern.compile(wordSeparatorRE)).split(string) 2. Define word RE and execute ((Pattern.compile(wordRE)).matcher(string)).findAll() Popular ML frameworks support both approaches. PMML 4.2 and 4.3 only support the first approach. Hopefully, PMML 4.4 will support the second approach as well. http://mantis.dmg.org/view.php?id=173
  • 9. Counting terms in a document A "match" is a situation where the difference between term tokens [0, length] and document tokens [i, i + length] (where i is the match position), is less than or equal to the match threshold. Match threshold is a function of TextIndex@isCaseSensitive and TextIndex@maxLevenshteinDistance attribute values. During case-insensitive matching (the default), the edit distance between two characters that only differ by case is considered to be 0, whereas during case-sensitive matching it is considered to be 1. The matches may overlap if the "length" of term tokens is greater than one. http://mantis.dmg.org/view.php?id=172
  • 10. Interoperability with Scikit-Learn (1/2) from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(.., strip_accents = .., # If not None, handle using text normalization analyzer = "word", # Set to "word" preprocessor = .., # If not None, handle using text normalization tokenizer = .., # If not None, handle using text tokenization token_pattern = None, # Set to None. Use the "tokenizer" attribute instead lowercase = .., # If True, convert the document to lowercase String and perform term matching in a case-insensitive manner binary = .., # Determines the transformation from counts to final TF metric ("binary" for True, and "termFrequency" for False) sublinear_tf = .., # If True, apply scaling to final TF metric norm = None # Set to None )
  • 11. Interoperability with Scikit-Learn (2/2) from sklearn.feature_extraction.text import TfidfVectorizer from sklearn2pmml import PMMLPipeline from sklearn2pmml.feature_extraction.text import Splitter pipeline = PMMLPipeline( ('tf-idf', TfidfVectorizer(analyzer = "word", preprocessor = None, strip_accents = None, tokenizer = Splitter() , token_pattern = None , stop_words = "english", ngram_range = (1, 2), binary = False, use_idf = True, norm = None)) ) from sklearn2pmml import sklearn2pmml sklearn2pmml(pipeline, "pipeline.pmml")