DevoxxFR 2024 Reproducible Builds with Apache Maven
Tarefa Data Mining - Classificação de Textos
1. What’s Cooking
Text Classification Competition
Paulo Cezar Lacerda Neto
Instituto de Computação – Universidade Federal Fluminense (UFF)
R. Passo da Pátria, 156 – 24210-240 – Niterói – RJ – Brazil
pclacerda@gmail.com
Abstract. This paper describes my participation in a text classification
competition as part of my activities in a data mining class I attended at
Universidade Federal Fluminense in 2015. The competition was a challenge
to classify the type of cuisine of 9.944 recipes using a training set of 39.774
recipes. I was able to solve the classification problem with a machine learning
classifier based on Naïve Bayes method. The results of this work showed a
good initial accuracy and some improvement opportunities.
1. Introduction
Text classification, a topic in the areas of natural language processing and data mining,
is a task that has many real word applications like spam detection, classification of
patient clinical records, classification of newspapers articles and scientific papers.
Kaggle, an on-line community of data scientists, promotes public data mining contests.
The opportunity to participate in competitions that involves such interesting subject is a
great chance to students to put into practice the concepts learned in data mining courses,
so the platform have received a lot of attention from the community.
What’s Cooking [Kaggle 2015], a text classification competition, set a challenge to the
participants to classify recipes into a single type of cuisine, based on a training data set.
The competition rules determines that each participant is allowed to submit a maximum
of 5 entries per day and the submissions are evaluated based on the accuracy, that is the
number of recipes classified correctly.
The organization provided three files: the first is a training data set with 39.744 recipes,
each one categorized by one the 20 cuisine types available, the second is a test data
containing a list of 9.944 recipes to be classified by the participants and the third file is a
sample submission file in csv format.
The training and test data files content are in json format (see Table 1), the difference
between the two files is that test data does not come with the cuisine type, which is the
class the participant needs to classify.
Table 1. Sample training data
{ "id": 24717,
"cuisine": "indian",
"ingredients": [
"onions",
"spinach",
"sweet potatoes"
]},
2. 2. Solution Approach
The chosen approach to solve the text classification problem was the use of a supervised
machine learning method to classify each recipe according to its ingredients. The first
step was to choose the machine learning algorithm to be used, which is described in the
third section of this document. After selecting the machine learning algorithm, it was
necessary to create a program to implement the method.
Given the idea of using a supervised machine learning method, the processing activities
should be divided into two main steps: training and classification, as shown in Figure 1.
Figure 1. Training and Classification Steps
The training step (see Figure 2) is where the program receives the training data and
build a vocabulary with all the words that make up the recipes, to be used as the basis
for the bag-of-words representation approach, then for each recipe it extract the features
from the text, to create a feature vector, and finally feeds the machine learning
algorithm with the feature vector and the related class to train it.
Figure 2. Training Step
3. After learning how to classify new recipes based on training data, the algorithm
produces a classifier model to be used in the second step of the process.
Figure 3. Classification Step
The classification step (see Figure 3) is where new recipes, i.e. recipes which classes are
unknown, are classified based on what the algorithm was able to learn in the first step.
After the recipes are classified they can be submitted to the Kaggle’s web site.
3. Choosing the Machine Learning Method
In order to decide which method to use in this project, it was done some research to
investigate the most common machine learning methods used in text classification
activities. Based on a survey written by [Aggarwal and Zhai 2012], the following
methods were considered to be used in this project: Naïve Bayes, Logistic Regression,
k-NN, Neural Networks, SVM and Ensembles.
The principle adopted when deciding which method to use in this project was that the
chosen method should have a track record in the text classification filed and also be
easy to implement, so that an initial result and a baseline could be quickly obtained.
During the research on the possible machine learning methods, a reasonable number of
mentions of Naïve Bayes usage in text classifications and natural language processing
were found. According [Jurafsky and Martin 209], Naïve Bayes is commonly employed
as a good baseline method, often yielding results that are sufficiently good for practical
use. Besides that, some good code samples of Naïve Bayes implementations were found
in machine learning textbooks, such as [Harrington 2012].
The Naïve Bayes method was chosen among the possible alternatives as the algorithm
to be used to classify the competition’s recipes, mainly because of its simplicity and
also to be considered a good baseline. However, it is important to notice that this
method makes the naïve assumption of independence between the conditionals
probabilities of the features values given a class, what may not work well in cases of
cuisine types (classes) where one ingredient has some dependency on another, so it is
important to evaluate other machine learning methods in order to get the best results in
the competition.
4. 4. Implementing the Naïve Bayes Algorithm
After selecting the machine learning algorithm, it was time to start building the program
to implement the algorithm responsible for reading the input files, using Naïve Bayes
(NB) to learn and classify the files and generating an output file with the classified
recipes to be submitted to Kaggle’s web site.
The next formula represents the NB theorem in terms of a class c and a recipe r:
The formula above shows how to calculate the probability of a class c given a recipe r,
so the Naïve Bayes classifier job is to obtain the class with the greatest probability
P(c|r) that will be the class defined for the recipe r. The following derivations are made
in order to simplify the calculations in the Naïve Bayes algorithm:
In the last equation, P(c) is the probability of class c and P(r|c) the probability of a
recipe r given a class c. Since in Naïve Bayes it is used the bag-of-words approach to
represent feature vectors, P(r|c) can be written as P(r|c) = P(w1, w2, w3, ..., wn | c) and
assuming that the probabilities of P(w1|c), P(w2|c), P(w3|c), ... , P(wN|c) are independent,
the Naïve part of the method, P(r|c) can be calculated as P(w1|c) x P(w2|c) x P(w3|c) x ...
x P(wN|c).
The training step of the NB classifier algorithm does the calculation of the a priori
probabilities: P(c) for each class and the probability of each word of the vocabulary
(P(wn|c)) given a class c.
A program in Python was built in order to implement the Naïve Bayes algorithm and is
available in a GitHub repository (https://github.com/placerda/whatscooking). Python
was chosen because it is a simple although powerful language for machine learning
tasks and also works well with vectors (NumPy module). The program has two main
functions: trainNB and classifyNB, Table 2 and Table 3 show fragments from both of
them, respectively, with comments to explain their logic. The first is the function
responsible for training the Naïve Bayes classifier, it creates a vector with the a priori
probabilities and the second function classifies the recipes with unknown cuisine type.
Table 2. Naïve Bayes Training Function
def
trainNB(trainrecipes,
vocabulary,
classes):
...
#
Laplace
smoothing
(add
1
numerator)
numeratorPwc
=
np.array([[1.0]*len(vocabulary)]*len(classes))
#
Laplace
smoothing
(denominator)
denominatorPwc
=
np.array([len(classes)]*len(vocabulary))
...
5. #
Builds
p(c)
and
p(w|c)
vectors
for
each
class
for
recipe
in
trainrecipes:
...
#pc
pc[classIndex]
+=
1
#pwc
recipeFeatVector
=
createFeatVector(vocabulary,recipe['ingredients'])
numeratorPwc[classIndex]
+=
recipeFeatVector
denominatorPwc
+=
recipeFeatVector
#
Calculates
pc
and
pwc
#pc
vector/number
of
classes
pc
=
pc
/
float(sum(pc))
#
calculates
pwc
using
log
to
avoid
underflow
problem
pwc
=
np.log(numeratorPwc
/
denominatorPwc)
return
pc,
pwc
Table 3. Naïve Bayes Classifier Function
def
classifyNB(pc,
pwc,
ingredFeatVector):
#
For
each
class
calculates
p(c|w),
the
probability
of
a
class
c
#
given
a
recipe
w
(represented
as
a
feature
vector)
...
for
i
in
range(len(pcw)):
pcw[i]
=
sum(pwc[i]
*
ingredFeatVector)
+
np.log(pc[i])
#get
maximum
p(c|w)
value
(array
index)
to
define
the
best
class
for
this
recipe
maxClass
=
pcw.tolist().index(max(pcw))
return
maxClass
5. Results
After implementing the Naïve Bayes Algorithm program, it was time to submit its
output to the Kaggle’s web site to see the results, but before doing that it was important
to check the algorithm accuracy with the training data. To evaluate the algorithm before
submitting its results, it was used a 10-fold cross validation, so in order to do the
validation, it was created a program to process the train data set, dividing its 39.774
recipes in 10 folds. For each fold, the cross validation method trains the classifier with
the data outside the fold and use the fold as the test data set to calculate the accuracy of
the fold, after doing this for the 10 folds it calculates the average accuracy.
The 10-fold cross validation reported an accuracy of 0.57862. After checking the
10-fold cross validation, the program output with all the 9.944 classified recipes was
submitted in a csv file to the competition’s site.
6. Figure 4. Kaggle’s Leaderboard
When Kaggle receives the results, the web site calculates the score of the submission so
it can be ranked in the competition’s leaderboard, as shown in Figure 4. The score is the
accuracy calculated based on the submission and the actual classes of the recipes. The
score obtained in the first submission was 0.57623, what was close to the accuracy
obtained with the 10-fold cross validation.
6. Conclusion and Next Steps
The results obtained with the first submission were not in the top of the ranking nor
showed a high accuracy, what was not a surprise, since it was just the first submission
and no data pre-processing was done before the training, classification and submission,
meaning that the first submission is only the beginning of the process of competing in a
challenge like this and there is room for improvement in order to get better results.
Thinking on the improvements that can be done, one of them is related to the Naïve
assumption of the machine learning method used, because depending on the cuisine
(class) there may be some ingredients that influence other ingredients meaning that the
conditionals probabilities p(wn|c) are not so independent from each other. For example,
based on expert judgment, a Brazilian recipe that contains beans has great chances of
also including rice, so that the p(w(beans)|c(Brazilian)) and p(w(rice)|c(Brazilian)) may have some
dependency degree in this case, then with some expert advice it is possible to manually
add weighting rules so that during the learning phase some features will receive a
greater weight depending on the cuisine.
Another thing that can be done thinking on improving the accuracy is to do some data
preparation work, including data normalization and stemming. There are some
ingredients that can be reduced to the same token, for example, the following two
ingredients are good examples of this scenario: “50% less sodium black beans” and
“black beans”.
Based on the observations made during the research phase, it is worth mention other
algorithms that are good candidates to be used to improve accuracy. Experiments made
by [Li and Yang 2003] showed that some methods like SVM and kNN are good options
to do text classification.
7. References
Kaggle. (2015) What’s Cooking? Use recipe ingredients to categorize the cuisine.
Retrieved November 23, 2015, from https://www.kaggle.com/c/whats-cooking.
Charu C. Aggarwal , Cheng Xiang Zhai (2012), Mining Text Data, Chap. 6, Springer
Publishing Company, Incorporated.
Daniel Jurafsky, James H. Martin (2009), Speech and Language Processing (2nd
Edition), Prentice-Hall, Inc., Upper Saddle River, NJ.
Peter Harrington (2012), Machine Learning in Action. Manning Publications Co.
Greenwich, CT
Fan Li, Yiming Yang (2003), A Loss Function Analysis for Classification Methods in
Text Categorization, Carnegie Mellon Univ, Pittsburgh, PA.