SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
What’s Cooking
Text Classification Competition
Paulo Cezar Lacerda Neto
Instituto de Computação – Universidade Federal Fluminense (UFF)
R. Passo da Pátria, 156 – 24210-240 – Niterói – RJ – Brazil
pclacerda@gmail.com
Abstract. This paper describes my participation in a text classification
competition as part of my activities in a data mining class I attended at
Universidade Federal Fluminense in 2015. The competition was a challenge
to classify the type of cuisine of 9.944 recipes using a training set of 39.774
recipes. I was able to solve the classification problem with a machine learning
classifier based on Naïve Bayes method. The results of this work showed a
good initial accuracy and some improvement opportunities.
1. Introduction
Text classification, a topic in the areas of natural language processing and data mining,
is a task that has many real word applications like spam detection, classification of
patient clinical records, classification of newspapers articles and scientific papers.
Kaggle, an on-line community of data scientists, promotes public data mining contests.
The opportunity to participate in competitions that involves such interesting subject is a
great chance to students to put into practice the concepts learned in data mining courses,
so the platform have received a lot of attention from the community.
What’s Cooking [Kaggle 2015], a text classification competition, set a challenge to the
participants to classify recipes into a single type of cuisine, based on a training data set.
The competition rules determines that each participant is allowed to submit a maximum
of 5 entries per day and the submissions are evaluated based on the accuracy, that is the
number of recipes classified correctly.
The organization provided three files: the first is a training data set with 39.744 recipes,
each one categorized by one the 20 cuisine types available, the second is a test data
containing a list of 9.944 recipes to be classified by the participants and the third file is a
sample submission file in csv format.
The training and test data files content are in json format (see Table 1), the difference
between the two files is that test data does not come with the cuisine type, which is the
class the participant needs to classify.
Table 1. Sample training data
{ "id": 24717,
"cuisine": "indian",
"ingredients": [
"onions",
"spinach",
"sweet potatoes"
]},
2. Solution Approach
The chosen approach to solve the text classification problem was the use of a supervised
machine learning method to classify each recipe according to its ingredients. The first
step was to choose the machine learning algorithm to be used, which is described in the
third section of this document. After selecting the machine learning algorithm, it was
necessary to create a program to implement the method.
Given the idea of using a supervised machine learning method, the processing activities
should be divided into two main steps: training and classification, as shown in Figure 1.
Figure 1. Training and Classification Steps
The training step (see Figure 2) is where the program receives the training data and
build a vocabulary with all the words that make up the recipes, to be used as the basis
for the bag-of-words representation approach, then for each recipe it extract the features
from the text, to create a feature vector, and finally feeds the machine learning
algorithm with the feature vector and the related class to train it.
Figure 2. Training Step
After learning how to classify new recipes based on training data, the algorithm
produces a classifier model to be used in the second step of the process.
Figure 3. Classification Step
The classification step (see Figure 3) is where new recipes, i.e. recipes which classes are
unknown, are classified based on what the algorithm was able to learn in the first step.
After the recipes are classified they can be submitted to the Kaggle’s web site.
3. Choosing the Machine Learning Method
In order to decide which method to use in this project, it was done some research to
investigate the most common machine learning methods used in text classification
activities. Based on a survey written by [Aggarwal and Zhai 2012], the following
methods were considered to be used in this project: Naïve Bayes, Logistic Regression,
k-NN, Neural Networks, SVM and Ensembles.
The principle adopted when deciding which method to use in this project was that the
chosen method should have a track record in the text classification filed and also be
easy to implement, so that an initial result and a baseline could be quickly obtained.
During the research on the possible machine learning methods, a reasonable number of
mentions of Naïve Bayes usage in text classifications and natural language processing
were found. According [Jurafsky and Martin 209], Naïve Bayes is commonly employed
as a good baseline method, often yielding results that are sufficiently good for practical
use. Besides that, some good code samples of Naïve Bayes implementations were found
in machine learning textbooks, such as [Harrington 2012].
The Naïve Bayes method was chosen among the possible alternatives as the algorithm
to be used to classify the competition’s recipes, mainly because of its simplicity and
also to be considered a good baseline. However, it is important to notice that this
method makes the naïve assumption of independence between the conditionals
probabilities of the features values given a class, what may not work well in cases of
cuisine types (classes) where one ingredient has some dependency on another, so it is
important to evaluate other machine learning methods in order to get the best results in
the competition.
4. Implementing the Naïve Bayes Algorithm
After selecting the machine learning algorithm, it was time to start building the program
to implement the algorithm responsible for reading the input files, using Naïve Bayes
(NB) to learn and classify the files and generating an output file with the classified
recipes to be submitted to Kaggle’s web site.
The next formula represents the NB theorem in terms of a class c and a recipe r:
The formula above shows how to calculate the probability of a class c given a recipe r,
so the Naïve Bayes classifier job is to obtain the class with the greatest probability
P(c|r) that will be the class defined for the recipe r. The following derivations are made
in order to simplify the calculations in the Naïve Bayes algorithm:
In the last equation, P(c) is the probability of class c and P(r|c) the probability of a
recipe r given a class c. Since in Naïve Bayes it is used the bag-of-words approach to
represent feature vectors, P(r|c) can be written as P(r|c) = P(w1, w2, w3, ..., wn | c) and
assuming that the probabilities of P(w1|c), P(w2|c), P(w3|c), ... , P(wN|c) are independent,
the Naïve part of the method, P(r|c) can be calculated as P(w1|c) x P(w2|c) x P(w3|c) x ...
x P(wN|c).
The training step of the NB classifier algorithm does the calculation of the a priori
probabilities: P(c) for each class and the probability of each word of the vocabulary
(P(wn|c)) given a class c.
A program in Python was built in order to implement the Naïve Bayes algorithm and is
available in a GitHub repository (https://github.com/placerda/whatscooking). Python
was chosen because it is a simple although powerful language for machine learning
tasks and also works well with vectors (NumPy module). The program has two main
functions: trainNB and classifyNB, Table 2 and Table 3 show fragments from both of
them, respectively, with comments to explain their logic. The first is the function
responsible for training the Naïve Bayes classifier, it creates a vector with the a priori
probabilities and the second function classifies the recipes with unknown cuisine type.
Table 2. Naïve Bayes Training Function
def	
  trainNB(trainrecipes,	
  vocabulary,	
  classes):	
  
	
   ...	
  
#	
  Laplace	
  smoothing	
  (add	
  1	
  numerator)	
  	
  
	
   numeratorPwc	
  =	
  np.array([[1.0]*len(vocabulary)]*len(classes))	
  
	
  
	
   #	
  Laplace	
  smoothing	
  (denominator)	
  
	
   denominatorPwc	
  =	
  np.array([len(classes)]*len(vocabulary))	
  
	
   ...	
  
  #	
  Builds	
  p(c)	
  and	
  p(w|c)	
  vectors	
  for	
  each	
  class	
  
	
   for	
  recipe	
  in	
  trainrecipes:	
  
	
   ...	
  
	
   	
  	
  	
  	
  	
  #pc	
  
	
   	
  	
  	
  	
  	
  pc[classIndex]	
  +=	
  1	
  
	
   	
  	
  	
  	
  	
  #pwc	
  
	
   	
  	
  	
  	
  	
  recipeFeatVector	
  =	
  createFeatVector(vocabulary,recipe['ingredients'])	
  
	
   	
  	
  	
  	
  	
  numeratorPwc[classIndex]	
  +=	
  recipeFeatVector	
  
	
   	
  	
  	
  	
  	
  denominatorPwc	
  +=	
  recipeFeatVector	
  
	
   	
  
	
   #	
  Calculates	
  pc	
  and	
  pwc	
  
	
   #pc	
  vector/number	
  of	
  classes	
  
	
   pc	
  =	
  pc	
  /	
  float(sum(pc))	
  
	
   #	
  calculates	
  pwc	
  using	
  log	
  to	
  avoid	
  underflow	
  problem	
  
	
   pwc	
  =	
  np.log(numeratorPwc	
  /	
  denominatorPwc)	
  
	
   return	
  pc,	
  pwc
Table 3. Naïve Bayes Classifier Function
def	
  classifyNB(pc,	
  pwc,	
  ingredFeatVector):	
  
	
   #	
  For	
  each	
  class	
  calculates	
  p(c|w),	
  the	
  probability	
  of	
  a	
  class	
  c	
  
	
   #	
  given	
  a	
  recipe	
  w	
  (represented	
  as	
  a	
  feature	
  vector)	
  
	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
   for	
  i	
  in	
  range(len(pcw)):	
  
	
   	
  	
  	
  	
  	
  pcw[i]	
  =	
  sum(pwc[i]	
  *	
  ingredFeatVector)	
  +	
  np.log(pc[i])	
  
	
  
	
   #get	
  maximum	
  p(c|w)	
  value	
  (array	
  index)	
  to	
  define	
  the	
  best	
  class	
  for	
  this	
  recipe	
  
	
   maxClass	
  =	
  pcw.tolist().index(max(pcw))	
  
	
   return	
  maxClass
5. Results
After implementing the Naïve Bayes Algorithm program, it was time to submit its
output to the Kaggle’s web site to see the results, but before doing that it was important
to check the algorithm accuracy with the training data. To evaluate the algorithm before
submitting its results, it was used a 10-fold cross validation, so in order to do the
validation, it was created a program to process the train data set, dividing its 39.774
recipes in 10 folds. For each fold, the cross validation method trains the classifier with
the data outside the fold and use the fold as the test data set to calculate the accuracy of
the fold, after doing this for the 10 folds it calculates the average accuracy.
The 10-fold cross validation reported an accuracy of 0.57862. After checking the
10-fold cross validation, the program output with all the 9.944 classified recipes was
submitted in a csv file to the competition’s site.
Figure 4. Kaggle’s Leaderboard
When Kaggle receives the results, the web site calculates the score of the submission so
it can be ranked in the competition’s leaderboard, as shown in Figure 4. The score is the
accuracy calculated based on the submission and the actual classes of the recipes. The
score obtained in the first submission was 0.57623, what was close to the accuracy
obtained with the 10-fold cross validation.
6. Conclusion and Next Steps
The results obtained with the first submission were not in the top of the ranking nor
showed a high accuracy, what was not a surprise, since it was just the first submission
and no data pre-processing was done before the training, classification and submission,
meaning that the first submission is only the beginning of the process of competing in a
challenge like this and there is room for improvement in order to get better results.
Thinking on the improvements that can be done, one of them is related to the Naïve
assumption of the machine learning method used, because depending on the cuisine
(class) there may be some ingredients that influence other ingredients meaning that the
conditionals probabilities p(wn|c) are not so independent from each other. For example,
based on expert judgment, a Brazilian recipe that contains beans has great chances of
also including rice, so that the p(w(beans)|c(Brazilian)) and p(w(rice)|c(Brazilian)) may have some
dependency degree in this case, then with some expert advice it is possible to manually
add weighting rules so that during the learning phase some features will receive a
greater weight depending on the cuisine.
Another thing that can be done thinking on improving the accuracy is to do some data
preparation work, including data normalization and stemming. There are some
ingredients that can be reduced to the same token, for example, the following two
ingredients are good examples of this scenario: “50% less sodium black beans” and
“black beans”.
Based on the observations made during the research phase, it is worth mention other
algorithms that are good candidates to be used to improve accuracy. Experiments made
by [Li and Yang 2003] showed that some methods like SVM and kNN are good options
to do text classification.
References
Kaggle. (2015) What’s Cooking? Use recipe ingredients to categorize the cuisine.
Retrieved November 23, 2015, from https://www.kaggle.com/c/whats-cooking.
Charu C. Aggarwal , Cheng Xiang Zhai (2012), Mining Text Data, Chap. 6, Springer
Publishing Company, Incorporated.
Daniel Jurafsky, James H. Martin (2009), Speech and Language Processing (2nd
Edition), Prentice-Hall, Inc., Upper Saddle River, NJ.
Peter Harrington (2012), Machine Learning in Action. Manning Publications Co.
Greenwich, CT
Fan Li, Yiming Yang (2003), A Loss Function Analysis for Classification Methods in
Text Categorization, Carnegie Mellon Univ, Pittsburgh, PA.

Weitere ähnliche Inhalte

Ähnlich wie Tarefa Data Mining - Classificação de Textos

A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...MuskanRath1
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...Editor IJCATR
 
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval SystemsRule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval SystemsEditor IJCATR
 
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...Editor IJCATR
 
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET Journal
 
IRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET Journal
 
IRJET- Semantic Analysis of Online Customer Queries
IRJET-  	  Semantic Analysis of Online Customer QueriesIRJET-  	  Semantic Analysis of Online Customer Queries
IRJET- Semantic Analysis of Online Customer QueriesIRJET Journal
 
IRJET- Discovery of Recipes based on Ingredients using Machine Learning
IRJET- Discovery of Recipes based on Ingredients using Machine LearningIRJET- Discovery of Recipes based on Ingredients using Machine Learning
IRJET- Discovery of Recipes based on Ingredients using Machine LearningIRJET Journal
 
Using PageRank Algorithm to Improve Coupling Metrics
Using PageRank Algorithm to Improve Coupling MetricsUsing PageRank Algorithm to Improve Coupling Metrics
Using PageRank Algorithm to Improve Coupling MetricsIDES Editor
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSijseajournal
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationeSAT Journals
 
Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual Abdul Hannan
 
Online Examination and Evaluation System
Online Examination and Evaluation SystemOnline Examination and Evaluation System
Online Examination and Evaluation SystemIRJET Journal
 
Cse 7th-sem-machine-learning-laboratory-csml1819
Cse 7th-sem-machine-learning-laboratory-csml1819Cse 7th-sem-machine-learning-laboratory-csml1819
Cse 7th-sem-machine-learning-laboratory-csml1819HODCSE21
 
Section1 compound data class
Section1 compound data classSection1 compound data class
Section1 compound data classDương Tùng
 
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUES
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUESAUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUES
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUESJournal For Research
 

Ähnlich wie Tarefa Data Mining - Classificação de Textos (20)

Mangai
MangaiMangai
Mangai
 
A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...A report on designing a model for improving CPU Scheduling by using Machine L...
A report on designing a model for improving CPU Scheduling by using Machine L...
 
Jk2416381644
Jk2416381644Jk2416381644
Jk2416381644
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
 
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval SystemsRule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
 
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
 
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and LimeIRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
 
IRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine LearningIRJET- Student Placement Prediction using Machine Learning
IRJET- Student Placement Prediction using Machine Learning
 
IRJET- Semantic Analysis of Online Customer Queries
IRJET-  	  Semantic Analysis of Online Customer QueriesIRJET-  	  Semantic Analysis of Online Customer Queries
IRJET- Semantic Analysis of Online Customer Queries
 
IRJET- Discovery of Recipes based on Ingredients using Machine Learning
IRJET- Discovery of Recipes based on Ingredients using Machine LearningIRJET- Discovery of Recipes based on Ingredients using Machine Learning
IRJET- Discovery of Recipes based on Ingredients using Machine Learning
 
Using PageRank Algorithm to Improve Coupling Metrics
Using PageRank Algorithm to Improve Coupling MetricsUsing PageRank Algorithm to Improve Coupling Metrics
Using PageRank Algorithm to Improve Coupling Metrics
 
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTSUSING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
USING CATEGORICAL FEATURES IN MINING BUG TRACKING SYSTEMS TO ASSIGN BUG REPORTS
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorization
 
Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual Object Oriented Programming Lab Manual
Object Oriented Programming Lab Manual
 
Online Examination and Evaluation System
Online Examination and Evaluation SystemOnline Examination and Evaluation System
Online Examination and Evaluation System
 
Cse 7th-sem-machine-learning-laboratory-csml1819
Cse 7th-sem-machine-learning-laboratory-csml1819Cse 7th-sem-machine-learning-laboratory-csml1819
Cse 7th-sem-machine-learning-laboratory-csml1819
 
Section1 compound data class
Section1 compound data classSection1 compound data class
Section1 compound data class
 
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUES
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUESAUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUES
AUTOMATED BUG TRIAGE USING ADVANCED DATA REDUCTION TECHNIQUES
 
IJET-V2I6P32
IJET-V2I6P32IJET-V2I6P32
IJET-V2I6P32
 

Mehr von Paulo Lacerda

Containers, Kubernetes e porque estamos falando tanto disso…
Containers, Kubernetes e porque estamos falando tanto disso…Containers, Kubernetes e porque estamos falando tanto disso…
Containers, Kubernetes e porque estamos falando tanto disso…Paulo Lacerda
 
Tarefa Data Mining - Classificação de Textos
Tarefa Data Mining - Classificação de TextosTarefa Data Mining - Classificação de Textos
Tarefa Data Mining - Classificação de TextosPaulo Lacerda
 
Innovate2014 ea 1833
Innovate2014 ea 1833Innovate2014 ea 1833
Innovate2014 ea 1833Paulo Lacerda
 
Modelagem de Software - Palestra RIORUG - Outubro 2013
Modelagem de Software - Palestra RIORUG - Outubro 2013Modelagem de Software - Palestra RIORUG - Outubro 2013
Modelagem de Software - Palestra RIORUG - Outubro 2013Paulo Lacerda
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Paulo Lacerda
 
1214 deploying rational insight in a heterogenous environment
1214 deploying rational insight in a heterogenous environment1214 deploying rational insight in a heterogenous environment
1214 deploying rational insight in a heterogenous environmentPaulo Lacerda
 
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...Paulo Lacerda
 

Mehr von Paulo Lacerda (7)

Containers, Kubernetes e porque estamos falando tanto disso…
Containers, Kubernetes e porque estamos falando tanto disso…Containers, Kubernetes e porque estamos falando tanto disso…
Containers, Kubernetes e porque estamos falando tanto disso…
 
Tarefa Data Mining - Classificação de Textos
Tarefa Data Mining - Classificação de TextosTarefa Data Mining - Classificação de Textos
Tarefa Data Mining - Classificação de Textos
 
Innovate2014 ea 1833
Innovate2014 ea 1833Innovate2014 ea 1833
Innovate2014 ea 1833
 
Modelagem de Software - Palestra RIORUG - Outubro 2013
Modelagem de Software - Palestra RIORUG - Outubro 2013Modelagem de Software - Palestra RIORUG - Outubro 2013
Modelagem de Software - Palestra RIORUG - Outubro 2013
 
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...Improving Predictability and Efficiency with Kanban Metrics using Rational In...
Improving Predictability and Efficiency with Kanban Metrics using Rational In...
 
1214 deploying rational insight in a heterogenous environment
1214 deploying rational insight in a heterogenous environment1214 deploying rational insight in a heterogenous environment
1214 deploying rational insight in a heterogenous environment
 
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...
Case Study: How Caixa Econômica in Brazil Uses IBM® Rational® Insight and Per...
 

Kürzlich hochgeladen

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Tarefa Data Mining - Classificação de Textos

  • 1. What’s Cooking Text Classification Competition Paulo Cezar Lacerda Neto Instituto de Computação – Universidade Federal Fluminense (UFF) R. Passo da Pátria, 156 – 24210-240 – Niterói – RJ – Brazil pclacerda@gmail.com Abstract. This paper describes my participation in a text classification competition as part of my activities in a data mining class I attended at Universidade Federal Fluminense in 2015. The competition was a challenge to classify the type of cuisine of 9.944 recipes using a training set of 39.774 recipes. I was able to solve the classification problem with a machine learning classifier based on Naïve Bayes method. The results of this work showed a good initial accuracy and some improvement opportunities. 1. Introduction Text classification, a topic in the areas of natural language processing and data mining, is a task that has many real word applications like spam detection, classification of patient clinical records, classification of newspapers articles and scientific papers. Kaggle, an on-line community of data scientists, promotes public data mining contests. The opportunity to participate in competitions that involves such interesting subject is a great chance to students to put into practice the concepts learned in data mining courses, so the platform have received a lot of attention from the community. What’s Cooking [Kaggle 2015], a text classification competition, set a challenge to the participants to classify recipes into a single type of cuisine, based on a training data set. The competition rules determines that each participant is allowed to submit a maximum of 5 entries per day and the submissions are evaluated based on the accuracy, that is the number of recipes classified correctly. The organization provided three files: the first is a training data set with 39.744 recipes, each one categorized by one the 20 cuisine types available, the second is a test data containing a list of 9.944 recipes to be classified by the participants and the third file is a sample submission file in csv format. The training and test data files content are in json format (see Table 1), the difference between the two files is that test data does not come with the cuisine type, which is the class the participant needs to classify. Table 1. Sample training data { "id": 24717, "cuisine": "indian", "ingredients": [ "onions", "spinach", "sweet potatoes" ]},
  • 2. 2. Solution Approach The chosen approach to solve the text classification problem was the use of a supervised machine learning method to classify each recipe according to its ingredients. The first step was to choose the machine learning algorithm to be used, which is described in the third section of this document. After selecting the machine learning algorithm, it was necessary to create a program to implement the method. Given the idea of using a supervised machine learning method, the processing activities should be divided into two main steps: training and classification, as shown in Figure 1. Figure 1. Training and Classification Steps The training step (see Figure 2) is where the program receives the training data and build a vocabulary with all the words that make up the recipes, to be used as the basis for the bag-of-words representation approach, then for each recipe it extract the features from the text, to create a feature vector, and finally feeds the machine learning algorithm with the feature vector and the related class to train it. Figure 2. Training Step
  • 3. After learning how to classify new recipes based on training data, the algorithm produces a classifier model to be used in the second step of the process. Figure 3. Classification Step The classification step (see Figure 3) is where new recipes, i.e. recipes which classes are unknown, are classified based on what the algorithm was able to learn in the first step. After the recipes are classified they can be submitted to the Kaggle’s web site. 3. Choosing the Machine Learning Method In order to decide which method to use in this project, it was done some research to investigate the most common machine learning methods used in text classification activities. Based on a survey written by [Aggarwal and Zhai 2012], the following methods were considered to be used in this project: Naïve Bayes, Logistic Regression, k-NN, Neural Networks, SVM and Ensembles. The principle adopted when deciding which method to use in this project was that the chosen method should have a track record in the text classification filed and also be easy to implement, so that an initial result and a baseline could be quickly obtained. During the research on the possible machine learning methods, a reasonable number of mentions of Naïve Bayes usage in text classifications and natural language processing were found. According [Jurafsky and Martin 209], Naïve Bayes is commonly employed as a good baseline method, often yielding results that are sufficiently good for practical use. Besides that, some good code samples of Naïve Bayes implementations were found in machine learning textbooks, such as [Harrington 2012]. The Naïve Bayes method was chosen among the possible alternatives as the algorithm to be used to classify the competition’s recipes, mainly because of its simplicity and also to be considered a good baseline. However, it is important to notice that this method makes the naïve assumption of independence between the conditionals probabilities of the features values given a class, what may not work well in cases of cuisine types (classes) where one ingredient has some dependency on another, so it is important to evaluate other machine learning methods in order to get the best results in the competition.
  • 4. 4. Implementing the Naïve Bayes Algorithm After selecting the machine learning algorithm, it was time to start building the program to implement the algorithm responsible for reading the input files, using Naïve Bayes (NB) to learn and classify the files and generating an output file with the classified recipes to be submitted to Kaggle’s web site. The next formula represents the NB theorem in terms of a class c and a recipe r: The formula above shows how to calculate the probability of a class c given a recipe r, so the Naïve Bayes classifier job is to obtain the class with the greatest probability P(c|r) that will be the class defined for the recipe r. The following derivations are made in order to simplify the calculations in the Naïve Bayes algorithm: In the last equation, P(c) is the probability of class c and P(r|c) the probability of a recipe r given a class c. Since in Naïve Bayes it is used the bag-of-words approach to represent feature vectors, P(r|c) can be written as P(r|c) = P(w1, w2, w3, ..., wn | c) and assuming that the probabilities of P(w1|c), P(w2|c), P(w3|c), ... , P(wN|c) are independent, the Naïve part of the method, P(r|c) can be calculated as P(w1|c) x P(w2|c) x P(w3|c) x ... x P(wN|c). The training step of the NB classifier algorithm does the calculation of the a priori probabilities: P(c) for each class and the probability of each word of the vocabulary (P(wn|c)) given a class c. A program in Python was built in order to implement the Naïve Bayes algorithm and is available in a GitHub repository (https://github.com/placerda/whatscooking). Python was chosen because it is a simple although powerful language for machine learning tasks and also works well with vectors (NumPy module). The program has two main functions: trainNB and classifyNB, Table 2 and Table 3 show fragments from both of them, respectively, with comments to explain their logic. The first is the function responsible for training the Naïve Bayes classifier, it creates a vector with the a priori probabilities and the second function classifies the recipes with unknown cuisine type. Table 2. Naïve Bayes Training Function def  trainNB(trainrecipes,  vocabulary,  classes):     ...   #  Laplace  smoothing  (add  1  numerator)       numeratorPwc  =  np.array([[1.0]*len(vocabulary)]*len(classes))       #  Laplace  smoothing  (denominator)     denominatorPwc  =  np.array([len(classes)]*len(vocabulary))     ...  
  • 5.   #  Builds  p(c)  and  p(w|c)  vectors  for  each  class     for  recipe  in  trainrecipes:     ...              #pc              pc[classIndex]  +=  1              #pwc              recipeFeatVector  =  createFeatVector(vocabulary,recipe['ingredients'])              numeratorPwc[classIndex]  +=  recipeFeatVector              denominatorPwc  +=  recipeFeatVector         #  Calculates  pc  and  pwc     #pc  vector/number  of  classes     pc  =  pc  /  float(sum(pc))     #  calculates  pwc  using  log  to  avoid  underflow  problem     pwc  =  np.log(numeratorPwc  /  denominatorPwc)     return  pc,  pwc Table 3. Naïve Bayes Classifier Function def  classifyNB(pc,  pwc,  ingredFeatVector):     #  For  each  class  calculates  p(c|w),  the  probability  of  a  class  c     #  given  a  recipe  w  (represented  as  a  feature  vector)                  ...     for  i  in  range(len(pcw)):              pcw[i]  =  sum(pwc[i]  *  ingredFeatVector)  +  np.log(pc[i])       #get  maximum  p(c|w)  value  (array  index)  to  define  the  best  class  for  this  recipe     maxClass  =  pcw.tolist().index(max(pcw))     return  maxClass 5. Results After implementing the Naïve Bayes Algorithm program, it was time to submit its output to the Kaggle’s web site to see the results, but before doing that it was important to check the algorithm accuracy with the training data. To evaluate the algorithm before submitting its results, it was used a 10-fold cross validation, so in order to do the validation, it was created a program to process the train data set, dividing its 39.774 recipes in 10 folds. For each fold, the cross validation method trains the classifier with the data outside the fold and use the fold as the test data set to calculate the accuracy of the fold, after doing this for the 10 folds it calculates the average accuracy. The 10-fold cross validation reported an accuracy of 0.57862. After checking the 10-fold cross validation, the program output with all the 9.944 classified recipes was submitted in a csv file to the competition’s site.
  • 6. Figure 4. Kaggle’s Leaderboard When Kaggle receives the results, the web site calculates the score of the submission so it can be ranked in the competition’s leaderboard, as shown in Figure 4. The score is the accuracy calculated based on the submission and the actual classes of the recipes. The score obtained in the first submission was 0.57623, what was close to the accuracy obtained with the 10-fold cross validation. 6. Conclusion and Next Steps The results obtained with the first submission were not in the top of the ranking nor showed a high accuracy, what was not a surprise, since it was just the first submission and no data pre-processing was done before the training, classification and submission, meaning that the first submission is only the beginning of the process of competing in a challenge like this and there is room for improvement in order to get better results. Thinking on the improvements that can be done, one of them is related to the Naïve assumption of the machine learning method used, because depending on the cuisine (class) there may be some ingredients that influence other ingredients meaning that the conditionals probabilities p(wn|c) are not so independent from each other. For example, based on expert judgment, a Brazilian recipe that contains beans has great chances of also including rice, so that the p(w(beans)|c(Brazilian)) and p(w(rice)|c(Brazilian)) may have some dependency degree in this case, then with some expert advice it is possible to manually add weighting rules so that during the learning phase some features will receive a greater weight depending on the cuisine. Another thing that can be done thinking on improving the accuracy is to do some data preparation work, including data normalization and stemming. There are some ingredients that can be reduced to the same token, for example, the following two ingredients are good examples of this scenario: “50% less sodium black beans” and “black beans”. Based on the observations made during the research phase, it is worth mention other algorithms that are good candidates to be used to improve accuracy. Experiments made by [Li and Yang 2003] showed that some methods like SVM and kNN are good options to do text classification.
  • 7. References Kaggle. (2015) What’s Cooking? Use recipe ingredients to categorize the cuisine. Retrieved November 23, 2015, from https://www.kaggle.com/c/whats-cooking. Charu C. Aggarwal , Cheng Xiang Zhai (2012), Mining Text Data, Chap. 6, Springer Publishing Company, Incorporated. Daniel Jurafsky, James H. Martin (2009), Speech and Language Processing (2nd Edition), Prentice-Hall, Inc., Upper Saddle River, NJ. Peter Harrington (2012), Machine Learning in Action. Manning Publications Co. Greenwich, CT Fan Li, Yiming Yang (2003), A Loss Function Analysis for Classification Methods in Text Categorization, Carnegie Mellon Univ, Pittsburgh, PA.