SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Document classification using Machine Learning
techniques
Mar´ıa Roncal Salcedo
University of Zaragoza, 9th July 2020
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 1 /
Chapter 1: What is Artificial Intelligence?
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 2 /
What is Natural Language Processing?
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 3 /
Automatic document classification
Solving text classification manually is time-consuming and expensive.
Machine Learning
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 4 /
Text classification
Definition
In mathematical terms, text categorization is the task of assigning a
boolean value to each pair dj , ci ∈ D x C, where D = {d1, d2, ..., dn}
is a domain of documents and C = {c1, c2, ..., cm} is a set of predefined
categories. A value of 1 assigned to dj , ci will indicate that document dj
classifies under category ci , whereas a 0 value will indicate that dj does
not classify under ci .
What is the main goal?
Subjective task
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 5 /
Art of vector representation
Dealing with text data is problematic because, whereas human beings
communicate with words and sentences, computers only understand
numbers.
Since raw text cannot be fed straight into the model, a mechanism
for representing text is required.
The main idea of this is thus to represent a word as a point in some
multidimensional space. Vectors for representing words are generally
called embeddings, because the word is embedded in a partiular
vector space.
Common strategies: BOW, tf-idf and word2vec.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 6 /
Bag of words (BOW)
Simple model
This model creates a vocabulary of all the unique words ocurring in
all the documents in the training set.
The basic idea of BoW is to take a piece of text and count the
frequency of the words in that text.
Disadvantages:
- BoW does not care about the order of words in the text
- These vectors contain many 0’s
- Treats each word individually and therefore there is a loss of
contextual information ( ”tasty” and ”delicious” )
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 7 /
TF-IDF
If a word occurs numerous times in a text document but also along with
many other documents in our data set, maybe it is because this word is
just a frequent word; not because it is significant or meaningful. In order
to solve this, TF-IDF was proposed.
The importance of a word increases proportionally with the number of
times it appears in the file, but it also decreases inversely with the
frequency it appears in the corpus. Tf–idf is one of the most popular
term-weighting schemes today.
The weight wi,j of word i in document j is given by:
wi,j = tfi,j · log10(
N
dfi
)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 8 /
Word2Vec
Word2Vec is one of the most popular technique to learn word
embeddings using neural network.
The intuition of word2vec
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 9 /
Word2Vec architectures
Word2vec can be obtained using two architectures (both involving Neural
Networks): Continuous Bag Of Words (CBOW) and Skip Gram.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 10
Chapter 2 - Different techniques for text classification
1) Multinomial Na¨ıve Bayes
2) Support Vector Machine (SVM)
3) Recurrent Neural Network (LSTM)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 11
Multinomial Na¨ıve Bayes for Text Classification
Naive Bayes Classifier
Let the possible classes be the next fixed set: C = {c1, c2, ..., ck}. Out of
all classes in C, the classifier returns the class cMAP which has the
maximum a posterior (MAP) probability given the document d:
cMAP = arg max
c∈C
P(c|d)
Therefore, by applying Bayes:
cMAP = arg max
c∈C
P(d|c)P(c)
P(d)
= arg max
c∈C
P(d|c) P(c)
Notice that P(d) can be dropped. With no loss of generality, let’s
represent the document as a set of features x = {x1, x2, ..., xn}:
cMAP = arg max
c∈C
P(x1, x2, ..., xn|c) P(c)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 12
Multinomial Na¨ıve Bayes for Text Classification
Naive Bayes Classifier
Unfortunately, this is hard to compute directly because of the high level of
computational complexity. So, to reduce the number of parameters, two
simplifying assumptions need to be made:
1 Bag of words assumption: Assume that the position of the words in
the document doesn’t matter. Word order is ignored and therefore a
concrete word has the same effect on classification whether is the first
word, the 150th word or the last word in the document.
2 Naive Bayes assumption: Each feature of x is independent of one
another given the class.
Hence,
P (x1, x2, ..., xn|c) =
n
i=1
P (xi |c)
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 13
Multinomial Na¨ıve Bayes for Text Classification
Both of these assumptions are incorrect because as it is obvious, order is
important in semantic interpretation. Nonetheless by making these
simplifying assumptions the problem becomes simpler. Due to the previous
assumptions, the Naive Bayes cNB expression can be expressed as:
Since all word positions need to be considered, by walking an index
through every word position in the document one can compute:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 14
Multinomial Na¨ıve Bayes for Text Classification
Multiplying lots of probabilities can result in floating-point underflow. A
common workaround for this situation is to do it in log space, to avoid
underflow and increase speed. The multinomial NB classifier becomes a
linear one when doing this. Therefore, we conclude that:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 15
Support Vector Machine
SVM is one of the most powerful supervised Machine Learning algorithms
in classification problems.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 16
SVM: Linearly separable case
Let’s consider a binary classification task and assume that the training
set is linearly separable. We try to find a function f : Rn −→ {−1, 1}
that apart from correctly classify the patterns in the training data, it
correctly classifies the unseen patterns too.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 17
Infinitely many diferent hyperplanes
Usually, when the data are linearly separable, there are infinitely many
different hyperplanes that can perform separation. Hence, the question
here is how to find the separating hyperplane that not only separates the
training data, but generalizes as well as possible to new data.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 18
Find the optimal hyperplane
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 19
Maximum margin
It is easy to prove that the distance between the two hyperplanes is 2
||w|| .
The wider the margin is, the larger the generality performance on new
samples. Therefore, our goal is to maximize the margin: m = 2
||w|| .
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 20
Hard margin case
Notice that maximizing the margin is equivalent to minimizing ||w||.
Thus, the maximum margin SVM can be found by solving the following
quadratic problem:
min
w,b
1
2
||w||2
subject to: yi (w xi + b) ≥ 1, i = 1, ..., m
Once the optimization problem is solved, we get that the decision function
is a linear combination of dot products between the training points and
test points:
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 21
Nonlinear SVM: Kernel Mapping
To deal with nonlinearly separable data, the same solution techniques as
for the linear case are still used. The idea is to transform the input data
into a space of higher dimensions in which the data can be linearly
separated.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 22
Nonlinear SVM
To do this, the ”kernel trick” will be used. A kernel is a function
that takes two vectors xi and xj as arguments and returns the value
of the inner product of their images φ(xi ) and φ(xj ):
K(xi , xj ) = φ(xj )T
φ(xi )
The trick is that we never really need to know the mapping φ at all.
All we need is a way of computing a kernel Kij . Therefore, choosing
the kernel is equivalent as choosing φ. Notice that as only the inner
product of the two vectors in the new space is returned, the
dimensionality of the new space is not important.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 23
Kernel types
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 24
Multiclass: One-versus-all and One-versus-one
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 25
Neural Network: What is an Artificial Neural Network?
Artificial neurons are elementary units in an artificial neural network. An
artificial neuron is a digital construct that seeks to simulate the behavior
of a biological neuron in the brain. The artificial neuron receives one or
more inputs and sums them to produce an output.
Mathematically, this is expressed as:
y = f N
i=1 wi xi + b
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 26
Neural Network
Network architecture (number of layers, number of nodes,
parameters, etc)
Activation function
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 27
Neural Network
One of the most important processes is neural network is the training
one, that is, learning the values of our parameters (weights and
biases) that minimize the cost function.
Theoretically, at the end of training, the NN should be able to infer
the right output even for inputs that weren’t provided during training.
To do this, the network compares initial outputs with a provided
correct answer. A technique called a cost function is used to modify
initial outputs based on the degree to which they differed from the
target values. Finally, cost function results are then pushed back
across all neurons and connections to adjust the biases and weights.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 28
Recurrent Neural Network (RNN)
Feedforward network vs RNN
Loops allow information to persist
Chain-like structure
Unrolled RNN
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 29
Long short-term memory
LSTM is an improved extension of the RNN.
In concept, an LSTM recurrent unit tries to “remember” all the past
knowledge that the network has seen so far and to “forget” irrelevant
data.
LSTM introduces multiples gates.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 30
Chapter 3: Practical Application in Python
Why Python?
The chosen data set is 20newsgroup: training set and test set
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 31
Python code
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 32
Results of the test
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 33
Conclusions
Document classification can be automated successfully using ML
techniques
Metric results
Were these the expected results?
Complications when setting up a neural network
glove.6B.300d.txt, preprocessing, GPU, etc.
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 34
Thank you
Mar´ıa Roncal Salcedo
Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 35

Weitere ähnliche Inhalte

Ähnlich wie 2020-TFG1 Natural Language Processing

Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxHorton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classificationpaperpublications3
 
[ENCORE webinar] Artificial Intelligence for mapping skills of the future
[ENCORE webinar] Artificial Intelligence for mapping skills of the future[ENCORE webinar] Artificial Intelligence for mapping skills of the future
[ENCORE webinar] Artificial Intelligence for mapping skills of the futureEADTU
 
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...EADTU
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftSebastian Hellmann
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Saeedeh Shekarpour
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.pptRithikRaj25
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...csandit
 

Ähnlich wie 2020-TFG1 Natural Language Processing (20)

Srikanth CV - BDM
Srikanth CV - BDMSrikanth CV - BDM
Srikanth CV - BDM
 
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxHorton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docx
 
G04124041046
G04124041046G04124041046
G04124041046
 
PointNet
PointNetPointNet
PointNet
 
Oop basic concepts
Oop basic conceptsOop basic concepts
Oop basic concepts
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classification
 
[ENCORE webinar] Artificial Intelligence for mapping skills of the future
[ENCORE webinar] Artificial Intelligence for mapping skills of the future[ENCORE webinar] Artificial Intelligence for mapping skills of the future
[ENCORE webinar] Artificial Intelligence for mapping skills of the future
 
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...
Artificial Intelligence and Human Expertise to Foresee Green, Digital and Ent...
 
NLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draftNLP2RDF Wortschatz and Linguistic LOD draft
NLP2RDF Wortschatz and Linguistic LOD draft
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Proyecto
ProyectoProyecto
Proyecto
 
text classification_NB.ppt
text classification_NB.ppttext classification_NB.ppt
text classification_NB.ppt
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)4V - WP3 Progress Report (TIN2013-46238)
4V - WP3 Progress Report (TIN2013-46238)
 
Doc format.
Doc format.Doc format.
Doc format.
 
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
IMPROVING SUPERVISED CLASSIFICATION OF DAILY ACTIVITIES LIVING USING NEW COST...
 

Mehr von Ricardo Lopez-Ruiz

2024-T20-Katherine_Johnson.ppsx
2024-T20-Katherine_Johnson.ppsx2024-T20-Katherine_Johnson.ppsx
2024-T20-Katherine_Johnson.ppsxRicardo Lopez-Ruiz
 
2024-T19-Redes_Neuronales_II.pdf
2024-T19-Redes_Neuronales_II.pdf2024-T19-Redes_Neuronales_II.pdf
2024-T19-Redes_Neuronales_II.pdfRicardo Lopez-Ruiz
 
2024-T18-Disfunciones_Cerebrales.ppsx
2024-T18-Disfunciones_Cerebrales.ppsx2024-T18-Disfunciones_Cerebrales.ppsx
2024-T18-Disfunciones_Cerebrales.ppsxRicardo Lopez-Ruiz
 
2024-T17-Num_Perfect_Defect_Abund.ppsx
2024-T17-Num_Perfect_Defect_Abund.ppsx2024-T17-Num_Perfect_Defect_Abund.ppsx
2024-T17-Num_Perfect_Defect_Abund.ppsxRicardo Lopez-Ruiz
 
2024-T15-Tipos_Numeros_Primos.ppsx
2024-T15-Tipos_Numeros_Primos.ppsx2024-T15-Tipos_Numeros_Primos.ppsx
2024-T15-Tipos_Numeros_Primos.ppsxRicardo Lopez-Ruiz
 
2024-T13-NarcisoMonturiol_IsaacPeral.ppsx
2024-T13-NarcisoMonturiol_IsaacPeral.ppsx2024-T13-NarcisoMonturiol_IsaacPeral.ppsx
2024-T13-NarcisoMonturiol_IsaacPeral.ppsxRicardo Lopez-Ruiz
 
2024-T12-Distribución_Num_Primos.ppsx
2024-T12-Distribución_Num_Primos.ppsx2024-T12-Distribución_Num_Primos.ppsx
2024-T12-Distribución_Num_Primos.ppsxRicardo Lopez-Ruiz
 
2024-T10-El_Número_de_Oro.ppsx
2024-T10-El_Número_de_Oro.ppsx2024-T10-El_Número_de_Oro.ppsx
2024-T10-El_Número_de_Oro.ppsxRicardo Lopez-Ruiz
 
2024-T9-Carl_Friedrich_Gauss.ppsx
2024-T9-Carl_Friedrich_Gauss.ppsx2024-T9-Carl_Friedrich_Gauss.ppsx
2024-T9-Carl_Friedrich_Gauss.ppsxRicardo Lopez-Ruiz
 
2024-T8-Redes_Neuronales_I.ppsx
2024-T8-Redes_Neuronales_I.ppsx2024-T8-Redes_Neuronales_I.ppsx
2024-T8-Redes_Neuronales_I.ppsxRicardo Lopez-Ruiz
 
2024-T6-Paradoja_de_Russell.ppsx
2024-T6-Paradoja_de_Russell.ppsx2024-T6-Paradoja_de_Russell.ppsx
2024-T6-Paradoja_de_Russell.ppsxRicardo Lopez-Ruiz
 
2024-T5-Telescopio_James_Webb.ppsx
2024-T5-Telescopio_James_Webb.ppsx2024-T5-Telescopio_James_Webb.ppsx
2024-T5-Telescopio_James_Webb.ppsxRicardo Lopez-Ruiz
 
2024-T4-Abaco-y-OtrasCalculadoras.ppsx
2024-T4-Abaco-y-OtrasCalculadoras.ppsx2024-T4-Abaco-y-OtrasCalculadoras.ppsx
2024-T4-Abaco-y-OtrasCalculadoras.ppsxRicardo Lopez-Ruiz
 
2024-T2-ProgramaVoyager-Pioneer.ppsx
2024-T2-ProgramaVoyager-Pioneer.ppsx2024-T2-ProgramaVoyager-Pioneer.ppsx
2024-T2-ProgramaVoyager-Pioneer.ppsxRicardo Lopez-Ruiz
 

Mehr von Ricardo Lopez-Ruiz (20)

2024-T20-Katherine_Johnson.ppsx
2024-T20-Katherine_Johnson.ppsx2024-T20-Katherine_Johnson.ppsx
2024-T20-Katherine_Johnson.ppsx
 
2024-T19-Redes_Neuronales_II.pdf
2024-T19-Redes_Neuronales_II.pdf2024-T19-Redes_Neuronales_II.pdf
2024-T19-Redes_Neuronales_II.pdf
 
2024-T18-Disfunciones_Cerebrales.ppsx
2024-T18-Disfunciones_Cerebrales.ppsx2024-T18-Disfunciones_Cerebrales.ppsx
2024-T18-Disfunciones_Cerebrales.ppsx
 
2024-T17-Num_Perfect_Defect_Abund.ppsx
2024-T17-Num_Perfect_Defect_Abund.ppsx2024-T17-Num_Perfect_Defect_Abund.ppsx
2024-T17-Num_Perfect_Defect_Abund.ppsx
 
2024-T16-JuegoDeLaVida.ppsx
2024-T16-JuegoDeLaVida.ppsx2024-T16-JuegoDeLaVida.ppsx
2024-T16-JuegoDeLaVida.ppsx
 
2024-T15-Tipos_Numeros_Primos.ppsx
2024-T15-Tipos_Numeros_Primos.ppsx2024-T15-Tipos_Numeros_Primos.ppsx
2024-T15-Tipos_Numeros_Primos.ppsx
 
2024-T14-Primos_Gemelos.ppsx
2024-T14-Primos_Gemelos.ppsx2024-T14-Primos_Gemelos.ppsx
2024-T14-Primos_Gemelos.ppsx
 
2024-T13-NarcisoMonturiol_IsaacPeral.ppsx
2024-T13-NarcisoMonturiol_IsaacPeral.ppsx2024-T13-NarcisoMonturiol_IsaacPeral.ppsx
2024-T13-NarcisoMonturiol_IsaacPeral.ppsx
 
2024-T12-Distribución_Num_Primos.ppsx
2024-T12-Distribución_Num_Primos.ppsx2024-T12-Distribución_Num_Primos.ppsx
2024-T12-Distribución_Num_Primos.ppsx
 
2024-T11-Sam_Altman.pdf
2024-T11-Sam_Altman.pdf2024-T11-Sam_Altman.pdf
2024-T11-Sam_Altman.pdf
 
2024-T10-El_Número_de_Oro.ppsx
2024-T10-El_Número_de_Oro.ppsx2024-T10-El_Número_de_Oro.ppsx
2024-T10-El_Número_de_Oro.ppsx
 
2024-T9-Carl_Friedrich_Gauss.ppsx
2024-T9-Carl_Friedrich_Gauss.ppsx2024-T9-Carl_Friedrich_Gauss.ppsx
2024-T9-Carl_Friedrich_Gauss.ppsx
 
2024-T8-Redes_Neuronales_I.ppsx
2024-T8-Redes_Neuronales_I.ppsx2024-T8-Redes_Neuronales_I.ppsx
2024-T8-Redes_Neuronales_I.ppsx
 
2024-T7-GeoGebra.pdf
2024-T7-GeoGebra.pdf2024-T7-GeoGebra.pdf
2024-T7-GeoGebra.pdf
 
2024-T6-Paradoja_de_Russell.ppsx
2024-T6-Paradoja_de_Russell.ppsx2024-T6-Paradoja_de_Russell.ppsx
2024-T6-Paradoja_de_Russell.ppsx
 
2024-T5-Telescopio_James_Webb.ppsx
2024-T5-Telescopio_James_Webb.ppsx2024-T5-Telescopio_James_Webb.ppsx
2024-T5-Telescopio_James_Webb.ppsx
 
2024-T4-Abaco-y-OtrasCalculadoras.ppsx
2024-T4-Abaco-y-OtrasCalculadoras.ppsx2024-T4-Abaco-y-OtrasCalculadoras.ppsx
2024-T4-Abaco-y-OtrasCalculadoras.ppsx
 
2024-T3-Redes.ppsx
2024-T3-Redes.ppsx2024-T3-Redes.ppsx
2024-T3-Redes.ppsx
 
2024-T2-ProgramaVoyager-Pioneer.ppsx
2024-T2-ProgramaVoyager-Pioneer.ppsx2024-T2-ProgramaVoyager-Pioneer.ppsx
2024-T2-ProgramaVoyager-Pioneer.ppsx
 
2024-T1-ChatGPT.ppsx
2024-T1-ChatGPT.ppsx2024-T1-ChatGPT.ppsx
2024-T1-ChatGPT.ppsx
 

Kürzlich hochgeladen

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 

Kürzlich hochgeladen (20)

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 

2020-TFG1 Natural Language Processing

  • 1. Document classification using Machine Learning techniques Mar´ıa Roncal Salcedo University of Zaragoza, 9th July 2020 Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 1 /
  • 2. Chapter 1: What is Artificial Intelligence? Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 2 /
  • 3. What is Natural Language Processing? Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 3 /
  • 4. Automatic document classification Solving text classification manually is time-consuming and expensive. Machine Learning Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 4 /
  • 5. Text classification Definition In mathematical terms, text categorization is the task of assigning a boolean value to each pair dj , ci ∈ D x C, where D = {d1, d2, ..., dn} is a domain of documents and C = {c1, c2, ..., cm} is a set of predefined categories. A value of 1 assigned to dj , ci will indicate that document dj classifies under category ci , whereas a 0 value will indicate that dj does not classify under ci . What is the main goal? Subjective task Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 5 /
  • 6. Art of vector representation Dealing with text data is problematic because, whereas human beings communicate with words and sentences, computers only understand numbers. Since raw text cannot be fed straight into the model, a mechanism for representing text is required. The main idea of this is thus to represent a word as a point in some multidimensional space. Vectors for representing words are generally called embeddings, because the word is embedded in a partiular vector space. Common strategies: BOW, tf-idf and word2vec. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 6 /
  • 7. Bag of words (BOW) Simple model This model creates a vocabulary of all the unique words ocurring in all the documents in the training set. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. Disadvantages: - BoW does not care about the order of words in the text - These vectors contain many 0’s - Treats each word individually and therefore there is a loss of contextual information ( ”tasty” and ”delicious” ) Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 7 /
  • 8. TF-IDF If a word occurs numerous times in a text document but also along with many other documents in our data set, maybe it is because this word is just a frequent word; not because it is significant or meaningful. In order to solve this, TF-IDF was proposed. The importance of a word increases proportionally with the number of times it appears in the file, but it also decreases inversely with the frequency it appears in the corpus. Tf–idf is one of the most popular term-weighting schemes today. The weight wi,j of word i in document j is given by: wi,j = tfi,j · log10( N dfi ) Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 8 /
  • 9. Word2Vec Word2Vec is one of the most popular technique to learn word embeddings using neural network. The intuition of word2vec Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 9 /
  • 10. Word2Vec architectures Word2vec can be obtained using two architectures (both involving Neural Networks): Continuous Bag Of Words (CBOW) and Skip Gram. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 10
  • 11. Chapter 2 - Different techniques for text classification 1) Multinomial Na¨ıve Bayes 2) Support Vector Machine (SVM) 3) Recurrent Neural Network (LSTM) Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 11
  • 12. Multinomial Na¨ıve Bayes for Text Classification Naive Bayes Classifier Let the possible classes be the next fixed set: C = {c1, c2, ..., ck}. Out of all classes in C, the classifier returns the class cMAP which has the maximum a posterior (MAP) probability given the document d: cMAP = arg max c∈C P(c|d) Therefore, by applying Bayes: cMAP = arg max c∈C P(d|c)P(c) P(d) = arg max c∈C P(d|c) P(c) Notice that P(d) can be dropped. With no loss of generality, let’s represent the document as a set of features x = {x1, x2, ..., xn}: cMAP = arg max c∈C P(x1, x2, ..., xn|c) P(c) Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 12
  • 13. Multinomial Na¨ıve Bayes for Text Classification Naive Bayes Classifier Unfortunately, this is hard to compute directly because of the high level of computational complexity. So, to reduce the number of parameters, two simplifying assumptions need to be made: 1 Bag of words assumption: Assume that the position of the words in the document doesn’t matter. Word order is ignored and therefore a concrete word has the same effect on classification whether is the first word, the 150th word or the last word in the document. 2 Naive Bayes assumption: Each feature of x is independent of one another given the class. Hence, P (x1, x2, ..., xn|c) = n i=1 P (xi |c) Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 13
  • 14. Multinomial Na¨ıve Bayes for Text Classification Both of these assumptions are incorrect because as it is obvious, order is important in semantic interpretation. Nonetheless by making these simplifying assumptions the problem becomes simpler. Due to the previous assumptions, the Naive Bayes cNB expression can be expressed as: Since all word positions need to be considered, by walking an index through every word position in the document one can compute: Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 14
  • 15. Multinomial Na¨ıve Bayes for Text Classification Multiplying lots of probabilities can result in floating-point underflow. A common workaround for this situation is to do it in log space, to avoid underflow and increase speed. The multinomial NB classifier becomes a linear one when doing this. Therefore, we conclude that: Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 15
  • 16. Support Vector Machine SVM is one of the most powerful supervised Machine Learning algorithms in classification problems. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 16
  • 17. SVM: Linearly separable case Let’s consider a binary classification task and assume that the training set is linearly separable. We try to find a function f : Rn −→ {−1, 1} that apart from correctly classify the patterns in the training data, it correctly classifies the unseen patterns too. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 17
  • 18. Infinitely many diferent hyperplanes Usually, when the data are linearly separable, there are infinitely many different hyperplanes that can perform separation. Hence, the question here is how to find the separating hyperplane that not only separates the training data, but generalizes as well as possible to new data. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 18
  • 19. Find the optimal hyperplane Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 19
  • 20. Maximum margin It is easy to prove that the distance between the two hyperplanes is 2 ||w|| . The wider the margin is, the larger the generality performance on new samples. Therefore, our goal is to maximize the margin: m = 2 ||w|| . Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 20
  • 21. Hard margin case Notice that maximizing the margin is equivalent to minimizing ||w||. Thus, the maximum margin SVM can be found by solving the following quadratic problem: min w,b 1 2 ||w||2 subject to: yi (w xi + b) ≥ 1, i = 1, ..., m Once the optimization problem is solved, we get that the decision function is a linear combination of dot products between the training points and test points: Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 21
  • 22. Nonlinear SVM: Kernel Mapping To deal with nonlinearly separable data, the same solution techniques as for the linear case are still used. The idea is to transform the input data into a space of higher dimensions in which the data can be linearly separated. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 22
  • 23. Nonlinear SVM To do this, the ”kernel trick” will be used. A kernel is a function that takes two vectors xi and xj as arguments and returns the value of the inner product of their images φ(xi ) and φ(xj ): K(xi , xj ) = φ(xj )T φ(xi ) The trick is that we never really need to know the mapping φ at all. All we need is a way of computing a kernel Kij . Therefore, choosing the kernel is equivalent as choosing φ. Notice that as only the inner product of the two vectors in the new space is returned, the dimensionality of the new space is not important. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 23
  • 24. Kernel types Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 24
  • 25. Multiclass: One-versus-all and One-versus-one Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 25
  • 26. Neural Network: What is an Artificial Neural Network? Artificial neurons are elementary units in an artificial neural network. An artificial neuron is a digital construct that seeks to simulate the behavior of a biological neuron in the brain. The artificial neuron receives one or more inputs and sums them to produce an output. Mathematically, this is expressed as: y = f N i=1 wi xi + b Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 26
  • 27. Neural Network Network architecture (number of layers, number of nodes, parameters, etc) Activation function Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 27
  • 28. Neural Network One of the most important processes is neural network is the training one, that is, learning the values of our parameters (weights and biases) that minimize the cost function. Theoretically, at the end of training, the NN should be able to infer the right output even for inputs that weren’t provided during training. To do this, the network compares initial outputs with a provided correct answer. A technique called a cost function is used to modify initial outputs based on the degree to which they differed from the target values. Finally, cost function results are then pushed back across all neurons and connections to adjust the biases and weights. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 28
  • 29. Recurrent Neural Network (RNN) Feedforward network vs RNN Loops allow information to persist Chain-like structure Unrolled RNN Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 29
  • 30. Long short-term memory LSTM is an improved extension of the RNN. In concept, an LSTM recurrent unit tries to “remember” all the past knowledge that the network has seen so far and to “forget” irrelevant data. LSTM introduces multiples gates. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 30
  • 31. Chapter 3: Practical Application in Python Why Python? The chosen data set is 20newsgroup: training set and test set Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 31
  • 32. Python code Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 32
  • 33. Results of the test Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 33
  • 34. Conclusions Document classification can be automated successfully using ML techniques Metric results Were these the expected results? Complications when setting up a neural network glove.6B.300d.txt, preprocessing, GPU, etc. Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 34
  • 35. Thank you Mar´ıa Roncal Salcedo Mar´ıa Roncal Salcedo Document classification using Machine Learning techniquesUniversity of Zaragoza, 9th July 2020 35