2020-TFG1 Natural Language Processing

Document classiﬁcation using Machine Learning
techniques
Mar´ıa Roncal Salcedo
University of Zaragoza, 9th July 2020
Mar´ıa Roncal Salcedo Document classiﬁcation using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 1 /

Chapter 1: What is Artiﬁcial Intelligence?

What is Natural Language Processing?

Automatic document classiﬁcation
Solving text classiﬁcation manually is time-consuming and expensive.
Machine Learning

Text classification
Definition
In mathematical terms, text categorization is the task of assigning a
boolean value to each pair dj , ci ∈ D x C, where D = {d1, d2, ..., dn}
is a domain of documents and C = {c1, c2, ..., cm} is a set of predefined
categories. A value of 1 assigned to dj , ci will indicate that document dj
classifies under category ci , whereas a 0 value will indicate that dj does
not classify under ci .
What is the main goal?
Subjective task

Art of vector representation
Dealing with text data is problematic because, whereas human beings
communicate with words and sentences, computers only understand
numbers.
Since raw text cannot be fed straight into the model, a mechanism
for representing text is required.
The main idea of this is thus to represent a word as a point in some
multidimensional space. Vectors for representing words are generally
called embeddings, because the word is embedded in a partiular
vector space.
Common strategies: BOW, tf-idf and word2vec.

Bag of words (BOW)
Simple model
This model creates a vocabulary of all the unique words ocurring in
all the documents in the training set.
The basic idea of BoW is to take a piece of text and count the
frequency of the words in that text.
Disadvantages:
- BoW does not care about the order of words in the text
- These vectors contain many 0’s
- Treats each word individually and therefore there is a loss of
contextual information ( ”tasty” and ”delicious” )

TF-IDF
If a word occurs numerous times in a text document but also along with
many other documents in our data set, maybe it is because this word is
just a frequent word; not because it is signiﬁcant or meaningful. In order
to solve this, TF-IDF was proposed.
The importance of a word increases proportionally with the number of
times it appears in the ﬁle, but it also decreases inversely with the
frequency it appears in the corpus. Tf–idf is one of the most popular
term-weighting schemes today.
The weight wi,j of word i in document j is given by:
wi,j = tfi,j · log10(
N
dfi
)

Word2Vec
Word2Vec is one of the most popular technique to learn word
embeddings using neural network.
The intuition of word2vec

Word2Vec architectures
Word2vec can be obtained using two architectures (both involving Neural
Networks): Continuous Bag Of Words (CBOW) and Skip Gram.
Mar´ıa Roncal Salcedo Document classiﬁcation using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 10

Chapter 2 - Diﬀerent techniques for text classiﬁcation
1) Multinomial Na¨ıve Bayes
2) Support Vector Machine (SVM)
3) Recurrent Neural Network (LSTM)

Multinomial Na¨ıve Bayes for Text Classification
Naive Bayes Classifier
Let the possible classes be the next fixed set: C = {c1, c2, ..., ck}. Out of
all classes in C, the classifier returns the class cMAP which has the
maximum a posterior (MAP) probability given the document d:
cMAP = arg max
c∈C
P(c|d)
Therefore, by applying Bayes:
cMAP = arg max
c∈C
P(d|c)P(c)
P(d)
= arg max
c∈C
P(d|c) P(c)
Notice that P(d) can be dropped. With no loss of generality, let’s
represent the document as a set of features x = {x1, x2, ..., xn}:
cMAP = arg max
c∈C
P(x1, x2, ..., xn|c) P(c)

Naive Bayes Classifier
Unfortunately, this is hard to compute directly because of the high level of
computational complexity. So, to reduce the number of parameters, two
simplifying assumptions need to be made:
1 Bag of words assumption: Assume that the position of the words in
the document doesn’t matter. Word order is ignored and therefore a
concrete word has the same effect on classification whether is the first
word, the 150th word or the last word in the document.
2 Naive Bayes assumption: Each feature of x is independent of one
another given the class.
Hence,
P (x1, x2, ..., xn|c) =
n
i=1
P (xi |c)

Both of these assumptions are incorrect because as it is obvious, order is
important in semantic interpretation. Nonetheless by making these
simplifying assumptions the problem becomes simpler. Due to the previous
assumptions, the Naive Bayes cNB expression can be expressed as:
Since all word positions need to be considered, by walking an index
through every word position in the document one can compute:

Multiplying lots of probabilities can result in floating-point underflow. A
common workaround for this situation is to do it in log space, to avoid
underflow and increase speed. The multinomial NB classifier becomes a
linear one when doing this. Therefore, we conclude that:

Support Vector Machine
SVM is one of the most powerful supervised Machine Learning algorithms
in classiﬁcation problems.

SVM: Linearly separable case
Let’s consider a binary classification task and assume that the training
set is linearly separable. We try to find a function f : Rn −→ {−1, 1}
that apart from correctly classify the patterns in the training data, it
correctly classifies the unseen patterns too.

Infinitely many diferent hyperplanes
Usually, when the data are linearly separable, there are infinitely many
different hyperplanes that can perform separation. Hence, the question
here is how to find the separating hyperplane that not only separates the
training data, but generalizes as well as possible to new data.

Find the optimal hyperplane

Maximum margin
It is easy to prove that the distance between the two hyperplanes is 2
||w|| .
The wider the margin is, the larger the generality performance on new
samples. Therefore, our goal is to maximize the margin: m = 2
||w|| .

Hard margin case
Notice that maximizing the margin is equivalent to minimizing ||w||.
Thus, the maximum margin SVM can be found by solving the following
quadratic problem:
min
w,b
1
2
||w||2
subject to: yi (w xi + b) ≥ 1, i = 1, ..., m
Once the optimization problem is solved, we get that the decision function
is a linear combination of dot products between the training points and
test points:

Nonlinear SVM: Kernel Mapping
To deal with nonlinearly separable data, the same solution techniques as
for the linear case are still used. The idea is to transform the input data
into a space of higher dimensions in which the data can be linearly
separated.

Nonlinear SVM
To do this, the ”kernel trick” will be used. A kernel is a function
that takes two vectors xi and xj as arguments and returns the value
of the inner product of their images φ(xi ) and φ(xj ):
K(xi , xj ) = φ(xj )T
φ(xi )
The trick is that we never really need to know the mapping φ at all.
All we need is a way of computing a kernel Kij . Therefore, choosing
the kernel is equivalent as choosing φ. Notice that as only the inner
product of the two vectors in the new space is returned, the
dimensionality of the new space is not important.

Kernel types

Multiclass: One-versus-all and One-versus-one

Neural Network: What is an Artificial Neural Network?
Artificial neurons are elementary units in an artificial neural network. An
artificial neuron is a digital construct that seeks to simulate the behavior
of a biological neuron in the brain. The artificial neuron receives one or
more inputs and sums them to produce an output.
Mathematically, this is expressed as:
y = f N
i=1 wi xi + b

Neural Network
Network architecture (number of layers, number of nodes,
parameters, etc)
Activation function

Neural Network
One of the most important processes is neural network is the training
one, that is, learning the values of our parameters (weights and
biases) that minimize the cost function.
Theoretically, at the end of training, the NN should be able to infer
the right output even for inputs that weren’t provided during training.
To do this, the network compares initial outputs with a provided
correct answer. A technique called a cost function is used to modify
initial outputs based on the degree to which they diﬀered from the
target values. Finally, cost function results are then pushed back
across all neurons and connections to adjust the biases and weights.

Recurrent Neural Network (RNN)
Feedforward network vs RNN
Loops allow information to persist
Chain-like structure
Unrolled RNN

Long short-term memory
LSTM is an improved extension of the RNN.
In concept, an LSTM recurrent unit tries to “remember” all the past
knowledge that the network has seen so far and to “forget” irrelevant
data.
LSTM introduces multiples gates.

Chapter 3: Practical Application in Python
Why Python?
The chosen data set is 20newsgroup: training set and test set

Python code

Results of the test

Conclusions
Document classiﬁcation can be automated successfully using ML
techniques
Metric results
Were these the expected results?
Complications when setting up a neural network
glove.6B.300d.txt, preprocessing, GPU, etc.

Thank you
Mar´ıa Roncal Salcedo

2020-TFG1 Natural Language Processing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 2020-TFG1 Natural Language Processing

Ähnlich wie 2020-TFG1 Natural Language Processing (20)

Mehr von Ricardo Lopez-Ruiz

Mehr von Ricardo Lopez-Ruiz (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

2020-TFG1 Natural Language Processing