(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
2020-TFG1 Natural Language Processing
1. Document classification using Machine Learning
techniques
Mar´ıa Roncal Salcedo
University of Zaragoza, 9th July 2020
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 1 /
2. Chapter 1: What is Artificial Intelligence?
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 2 /
3. What is Natural Language Processing?
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 3 /
4. Automatic document classification
Solving text classification manually is time-consuming and expensive.
Machine Learning
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 4 /
5. Text classification
Definition
In mathematical terms, text categorization is the task of assigning a
boolean value to each pair dj , ci ∈ D x C, where D = {d1, d2, ..., dn}
is a domain of documents and C = {c1, c2, ..., cm} is a set of predefined
categories. A value of 1 assigned to dj , ci will indicate that document dj
classifies under category ci , whereas a 0 value will indicate that dj does
not classify under ci .
What is the main goal?
Subjective task
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 5 /
6. Art of vector representation
Dealing with text data is problematic because, whereas human beings
communicate with words and sentences, computers only understand
numbers.
Since raw text cannot be fed straight into the model, a mechanism
for representing text is required.
The main idea of this is thus to represent a word as a point in some
multidimensional space. Vectors for representing words are generally
called embeddings, because the word is embedded in a partiular
vector space.
Common strategies: BOW, tf-idf and word2vec.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 6 /
7. Bag of words (BOW)
Simple model
This model creates a vocabulary of all the unique words ocurring in
all the documents in the training set.
The basic idea of BoW is to take a piece of text and count the
frequency of the words in that text.
Disadvantages:
- BoW does not care about the order of words in the text
- These vectors contain many 0’s
- Treats each word individually and therefore there is a loss of
contextual information ( ”tasty” and ”delicious” )
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 7 /
8. TF-IDF
If a word occurs numerous times in a text document but also along with
many other documents in our data set, maybe it is because this word is
just a frequent word; not because it is significant or meaningful. In order
to solve this, TF-IDF was proposed.
The importance of a word increases proportionally with the number of
times it appears in the file, but it also decreases inversely with the
frequency it appears in the corpus. Tf–idf is one of the most popular
term-weighting schemes today.
The weight wi,j of word i in document j is given by:
wi,j = tfi,j · log10(
N
dfi
)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 8 /
9. Word2Vec
Word2Vec is one of the most popular technique to learn word
embeddings using neural network.
The intuition of word2vec
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 9 /
10. Word2Vec architectures
Word2vec can be obtained using two architectures (both involving Neural
Networks): Continuous Bag Of Words (CBOW) and Skip Gram.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 10
11. Chapter 2 - Different techniques for text classification
1) Multinomial Na¨ıve Bayes
2) Support Vector Machine (SVM)
3) Recurrent Neural Network (LSTM)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 11
12. Multinomial Na¨ıve Bayes for Text Classification
Naive Bayes Classifier
Let the possible classes be the next fixed set: C = {c1, c2, ..., ck}. Out of
all classes in C, the classifier returns the class cMAP which has the
maximum a posterior (MAP) probability given the document d:
cMAP = arg max
c∈C
P(c|d)
Therefore, by applying Bayes:
cMAP = arg max
c∈C
P(d|c)P(c)
P(d)
= arg max
c∈C
P(d|c) P(c)
Notice that P(d) can be dropped. With no loss of generality, let’s
represent the document as a set of features x = {x1, x2, ..., xn}:
cMAP = arg max
c∈C
P(x1, x2, ..., xn|c) P(c)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 12
13. Multinomial Na¨ıve Bayes for Text Classification
Naive Bayes Classifier
Unfortunately, this is hard to compute directly because of the high level of
computational complexity. So, to reduce the number of parameters, two
simplifying assumptions need to be made:
1 Bag of words assumption: Assume that the position of the words in
the document doesn’t matter. Word order is ignored and therefore a
concrete word has the same effect on classification whether is the first
word, the 150th word or the last word in the document.
2 Naive Bayes assumption: Each feature of x is independent of one
another given the class.
Hence,
P (x1, x2, ..., xn|c) =
n
i=1
P (xi |c)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 13
14. Multinomial Na¨ıve Bayes for Text Classification
Both of these assumptions are incorrect because as it is obvious, order is
important in semantic interpretation. Nonetheless by making these
simplifying assumptions the problem becomes simpler. Due to the previous
assumptions, the Naive Bayes cNB expression can be expressed as:
Since all word positions need to be considered, by walking an index
through every word position in the document one can compute:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 14
15. Multinomial Na¨ıve Bayes for Text Classification
Multiplying lots of probabilities can result in floating-point underflow. A
common workaround for this situation is to do it in log space, to avoid
underflow and increase speed. The multinomial NB classifier becomes a
linear one when doing this. Therefore, we conclude that:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 15
16. Support Vector Machine
SVM is one of the most powerful supervised Machine Learning algorithms
in classification problems.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 16
17. SVM: Linearly separable case
Let’s consider a binary classification task and assume that the training
set is linearly separable. We try to find a function f : Rn −→ {−1, 1}
that apart from correctly classify the patterns in the training data, it
correctly classifies the unseen patterns too.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 17
18. Infinitely many diferent hyperplanes
Usually, when the data are linearly separable, there are infinitely many
different hyperplanes that can perform separation. Hence, the question
here is how to find the separating hyperplane that not only separates the
training data, but generalizes as well as possible to new data.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 18
19. Find the optimal hyperplane
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 19
20. Maximum margin
It is easy to prove that the distance between the two hyperplanes is 2
||w|| .
The wider the margin is, the larger the generality performance on new
samples. Therefore, our goal is to maximize the margin: m = 2
||w|| .
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 20
21. Hard margin case
Notice that maximizing the margin is equivalent to minimizing ||w||.
Thus, the maximum margin SVM can be found by solving the following
quadratic problem:
min
w,b
1
2
||w||2
subject to: yi (w xi + b) ≥ 1, i = 1, ..., m
Once the optimization problem is solved, we get that the decision function
is a linear combination of dot products between the training points and
test points:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 21
22. Nonlinear SVM: Kernel Mapping
To deal with nonlinearly separable data, the same solution techniques as
for the linear case are still used. The idea is to transform the input data
into a space of higher dimensions in which the data can be linearly
separated.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 22
23. Nonlinear SVM
To do this, the ”kernel trick” will be used. A kernel is a function
that takes two vectors xi and xj as arguments and returns the value
of the inner product of their images φ(xi ) and φ(xj ):
K(xi , xj ) = φ(xj )T
φ(xi )
The trick is that we never really need to know the mapping φ at all.
All we need is a way of computing a kernel Kij . Therefore, choosing
the kernel is equivalent as choosing φ. Notice that as only the inner
product of the two vectors in the new space is returned, the
dimensionality of the new space is not important.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 23
24. Kernel types
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 24
25. Multiclass: One-versus-all and One-versus-one
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 25
26. Neural Network: What is an Artificial Neural Network?
Artificial neurons are elementary units in an artificial neural network. An
artificial neuron is a digital construct that seeks to simulate the behavior
of a biological neuron in the brain. The artificial neuron receives one or
more inputs and sums them to produce an output.
Mathematically, this is expressed as:
y = f N
i=1 wi xi + b
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 26
27. Neural Network
Network architecture (number of layers, number of nodes,
parameters, etc)
Activation function
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 27
28. Neural Network
One of the most important processes is neural network is the training
one, that is, learning the values of our parameters (weights and
biases) that minimize the cost function.
Theoretically, at the end of training, the NN should be able to infer
the right output even for inputs that weren’t provided during training.
To do this, the network compares initial outputs with a provided
correct answer. A technique called a cost function is used to modify
initial outputs based on the degree to which they differed from the
target values. Finally, cost function results are then pushed back
across all neurons and connections to adjust the biases and weights.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 28
29. Recurrent Neural Network (RNN)
Feedforward network vs RNN
Loops allow information to persist
Chain-like structure
Unrolled RNN
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 29
30. Long short-term memory
LSTM is an improved extension of the RNN.
In concept, an LSTM recurrent unit tries to “remember” all the past
knowledge that the network has seen so far and to “forget” irrelevant
data.
LSTM introduces multiples gates.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 30
31. Chapter 3: Practical Application in Python
Why Python?
The chosen data set is 20newsgroup: training set and test set
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 31
32. Python code
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 32
33. Results of the test
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 33
34. Conclusions
Document classification can be automated successfully using ML
techniques
Metric results
Were these the expected results?
Complications when setting up a neural network
glove.6B.300d.txt, preprocessing, GPU, etc.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 34
35. Thank you
Mar´ıa Roncal Salcedo
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 35