SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Word vector representations and applications
Mostapha Benhenda
AI & Big Data conference, Odessa
mostaphabenhenda@gmail.com
May 23, 2015
Mostapha Benhenda word embeddings May 23, 2015 1 / 28
Overview
1 Introduction
2 Applications
Machine translation
Sentiment analysis of movie reviews
Vector averaging (Kaggle tutorial)
Convolutional Neural Networks
Other applications of word embeddings
3 Example of word embedding algorithm: GloVe
Build the co-occurrence matrix X
Matrix factorization
4 Presentation of the Kyiv deep learning meetup
5 References
Mostapha Benhenda word embeddings May 23, 2015 2 / 28
Introduction
We want to compute a ”map of words”, i.e. a representation:
R : Words = {w1, ..., wN} → Vectors = {R(w1), ..., R(wN)} ⊂ Rd
such that:
wi ≈ wj (meaning of words)
is equivalent to:
R(wi ) ≈ R(wj ) (distance of vectors)
Mostapha Benhenda word embeddings May 23, 2015 3 / 28
These vectors are interesting because:
the computer can ”grasp” the meaning of words, just by looking at
the distance between vectors.
We can feed prediction algorithms (linear regression,...) with these
vectors, and hopefully get good accuracy, because the representation
is ”faithful” to the meaning of words.
Mostapha Benhenda word embeddings May 23, 2015 4 / 28
For example, if we have the list of words:
cat, dog, mouse, house
We expect the most distant word to be: ?
Mostapha Benhenda word embeddings May 23, 2015 5 / 28
(this is a 2-dimensional visualization of a 300-dimensional space, using
t-SNE, a visualization technique that preserves clusters of points)
Mostapha Benhenda word embeddings May 23, 2015 6 / 28
Even better, we can sometimes make additions and substractions of
vectors, and draw parallelisms. For example,
king + woman - man ≈ ?
Mostapha Benhenda word embeddings May 23, 2015 7 / 28
(2-dimensional visualization of a 300-dimensional space, using PCA, a
visualization technique that preserves parallelisms)
Mostapha Benhenda word embeddings May 23, 2015 8 / 28
Idea behind all algorithms (Word2vec, GloVe,...): ”You shall know a
word by the company it keeps” (John R. Firth, 1957)
The more often 2 words are near each other in a text (the training
data), the closer their vectors will be.
We hope that 2 words have close meanings if, statistically, they are
often near each other.i.e.:
statistics on texts → meaning of words
So we need quite big datasets:
Google News English: 100 Billion words
Wikipedia French or Spanish: 100 Million words
MT 11 French: 200 Million words
MT 11 Spanish: 84 Million words
Mostapha Benhenda word embeddings May 23, 2015 9 / 28
Application to machine translation
Idea: we compute maps of all English words, all French words, and we
”superpose” the 2 maps, and we should get an English/French translator.
Mostapha Benhenda word embeddings May 23, 2015 10 / 28
[Mikolov et al. 2013] (from Google), made an English → Spanish
translator (among other languages).
I tried to reproduce their results for French → English,
and French → Spanish.
Using their algorithm Word2vec, they trained their vectors on the
dataset MT 11, and on Google News.
I did the same on MT 11 and Wikipedias for French and Spanish
(trained with Gensim package in Python), and I used their Google
News-trained vectors for English.
Mostapha Benhenda word embeddings May 23, 2015 11 / 28
Then they took for training set: list of the 5000 most frequent
English words, and their Google translation in Spanish.
They train a linear transformation W that approximates the English
→ Spanish translation, i.e. they take W that minimizes:
5000
i=1
W (vi ) − G(vi ) 2
where G(vi ) is the Google-translation of vi .
Mostapha Benhenda word embeddings May 23, 2015 12 / 28
They test the accuracy of W on the next 1000 most common words
of the list (1-accuracy). They also test the accuracy up to 5
attempts, i.e. they test if G(vi ) belongs to the 5 nearest neighbors of
W (vi ) (5-accuracy).
I did the same for French → English, and French → Spanish.
computer-intensive task, I recommend using Amazon Web Services.
Mostapha Benhenda word embeddings May 23, 2015 13 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
Mostapha Benhenda word embeddings May 23, 2015 14 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
French (Wikipedia) → English (Google News) 10-accuracy: < 10%.
Mostapha Benhenda word embeddings May 23, 2015 14 / 28
Results
Mikolov, English → Spanish:
Training data 1-accuracy 5-accuracy
Google News 50% 75%
MT 11 33% 51%
Me, French → Spanish:
Training data 1-accuracy 5-accuracy
Wikipedia na <10%
MT 11 25% 37%
French (Wikipedia) → English (Google News) 10-accuracy: < 10%.
My conclusion: Wikipedia = cauchemar !
Mostapha Benhenda word embeddings May 23, 2015 14 / 28
Sentiment analysis of movie reviews
Goal: we want to determine if a movie review is positive or negative.
Toy problem in machine learning: reviews are ambiguous, emotional,
full of sarcasm,...
Long-term goal: computer can understand emotions.
Commercial applications of sentiment analysis:
marketing: customer satisfaction, brand reputation...
finance: predict market trends,...
Example of review: ”This movie contains everything you’d expect,
but nothing more”.
Mostapha Benhenda word embeddings May 23, 2015 15 / 28
Vector averaging (Kaggle tutorial)
We work on the IMDB dataset. To predict the sentiment (+ or -) of a
review:
we average the vectors of all the words of the review
We use these average vectors as input to a supervised learning
algorithm (e.g. SVM, random forests...)
We get 83% accuracy.
Limitation of the method: the order of words is lost, because addition
is a commutative operation.
Mostapha Benhenda word embeddings May 23, 2015 16 / 28
Convolutional Neural Networks
CNN are biology-inspired neural network models, initially introduced
for image processing.
they preserve spatial structure: the input is a n × m matrix (the pixels
of the image)
but here, the input is a sentence (with its ”spatial structure”
preserved):
Figure : source: [Collobert et al. 2011]
Mostapha Benhenda word embeddings May 23, 2015 17 / 28
Practical problem: Kaggle dataset (IMDB) can have 2000
words/review >> 50 words/review in [Collobert et al. 2011]
→ training is too slow!!
there are tricks to speed-up training (cutting reviews,...) but it did
not work.
Mostapha Benhenda word embeddings May 23, 2015 18 / 28
Other applications of word embeddings
Innovative search engine (ThisPlusThat), where we can subtract
queries, for example:
pizza + Japan -Italy → sushi
Recommendation systems for online shops
Mostapha Benhenda word embeddings May 23, 2015 19 / 28
Example of word embedding algorithm: GloVe
GloVe: ”Global Vectors” algorithm made in Stanford by
[Pennington, Socher, Manning, 2014]. There are 2 steps:
1 Build the co-occurrence matrix X from the training text
2 Factorize the matrix X to get vectors
1. Build the co-occurrence matrix X:
For the first step, we apply the principle ”you shall know a word by the
company it keeps”:
Mostapha Benhenda word embeddings May 23, 2015 20 / 28
The context window C(w) of size 2 of the word w= Ukraine (for
example), is given by:
The national flag of Ukraine is yellow and blue.
(in practice, we take the context window size around 10)
The number of times 2 words i and j lie in the same context window
is denoted by Xi,j .
The symmetric matrix X = (Xi,j )1≤i,j≤N is the co-occurrence matrix.
Mostapha Benhenda word embeddings May 23, 2015 21 / 28
2. Factorize the co-occurrence matrix X:
To extract vectors from a symmetric matrix, we can write:
Xi,j = < vi , vj > vi ∈ Rd
,
where d is an integer fixed by us a priori (a hyperparameter). i.e. we
can write:
X = Gram(v1, ..., vN)
This formula does not give a good empirical performance, but:
for any scalar function f , (f (Xi,j ))1≤i,j≤N is still a symmetric matrix.
Let’s find f that works!
Mostapha Benhenda word embeddings May 23, 2015 22 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
Mostapha Benhenda word embeddings May 23, 2015 23 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
< vi − vj , vk >= f (Xi,k/Xj,k)
Mostapha Benhenda word embeddings May 23, 2015 23 / 28
Let i, j be 2 words, for example: i= fruit, j=house. The third word
k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is
large.
If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is
small.
If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1.
→ the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning
of words. So we should look at f (Xi,k/Xj,k).
On the other hand, if we want to combine 3 vectors and the scalar
product in a ”natural” way, we do not have much choice:
< vi − vj , vk >= f (Xi,k/Xj,k)
We also have:
< vi , vk > − < vj , vk >= f (Xi,k) − f (Xj,k)
So f = log
Mostapha Benhenda word embeddings May 23, 2015 23 / 28
We cannot factorize the matrix log X explicitly, i.e. we cannot directly
compute vi , vj such that:
< vi , vj >= log Xi,j
but we compute an approximation, by minimizing a cost function J:
min
v1,...,vN
J(v1, ..., vN) ∼
N
i,j=1
(< vi , vj > − log Xi,j )2
To do that, we use gradient descent (standard method).
Mostapha Benhenda word embeddings May 23, 2015 24 / 28
Looks like cooking, but:
good empirical performance (at least similar to Word2vec)
cost function J similar to Word2vec
GloVe model is analogous to Latent Semantic Analysis (LSA). In
LSA:
co-occurrence matrix is a word-document matrix X (In GloVe:
word-word matrix)
we factorize ∼ log(1 + Xi,j ) (SVD, not Gram matrix)
Mostapha Benhenda word embeddings May 23, 2015 25 / 28
Presentation of the Kyiv deep learning meetup
Regular meeting about deep learning, neural networks...
Every two weeks (approximately) in Kyiv
Free and open to everyone: experts, ”scientific tourists”, and all
people in between.
attendance: ∼ 30 people/meeting
Mostapha Benhenda word embeddings May 23, 2015 26 / 28
Presentation of the Kyiv deep learning meetup
Regular meeting about deep learning, neural networks...
Every two weeks (approximately) in Kyiv
Free and open to everyone: experts, ”scientific tourists”, and all
people in between.
attendance: ∼ 30 people/meeting
Regular talks, but also:
Mostapha Benhenda word embeddings May 23, 2015 26 / 28
Presentation of the Kyiv deep learning meetup
Regular meeting about deep learning, neural networks...
Every two weeks (approximately) in Kyiv
Free and open to everyone: experts, ”scientific tourists”, and all
people in between.
attendance: ∼ 30 people/meeting
Regular talks, but also:
”AI-superstars”: Yoshua Bengio (remotely with London and Paris
meetups)
Mostapha Benhenda word embeddings May 23, 2015 26 / 28
Presentation of the Kyiv deep learning meetup
Regular meeting about deep learning, neural networks...
Every two weeks (approximately) in Kyiv
Free and open to everyone: experts, ”scientific tourists”, and all
people in between.
attendance: ∼ 30 people/meeting
Regular talks, but also:
”AI-superstars”: Yoshua Bengio (remotely with London and Paris
meetups)
Interactive tutorials: Theano...
Mostapha Benhenda word embeddings May 23, 2015 26 / 28
Presentation of the Kyiv deep learning meetup
Regular meeting about deep learning, neural networks...
Every two weeks (approximately) in Kyiv
Free and open to everyone: experts, ”scientific tourists”, and all
people in between.
attendance: ∼ 30 people/meeting
Regular talks, but also:
”AI-superstars”: Yoshua Bengio (remotely with London and Paris
meetups)
Interactive tutorials: Theano...
Join us!
http://deeplearningkyiv.doorkeeper.jp/
Mostapha Benhenda word embeddings May 23, 2015 26 / 28
References
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., ...
& Bengio, Y. (2010)
Theano: a CPU and GPU math expression compiler
Proceedings of the Python for scientific computing conference (SciPy), (Vol. 4, p.
3)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P.
(2011)
Natural language processing (almost) from scratch
The Journal of Machine Learning Research, 12, 2493-2537
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013)
Distributed representations of words and phrases and their compositionality
Advances in Neural Information Processing Systems pp. 3111-3119
Mikolov, T., Quoc V. Le, Sutskever, I. (2013)
Exploiting similarities among languages for machine translation
arXiv preprint arXiv:1309.4168.
Mostapha Benhenda word embeddings May 23, 2015 27 / 28
References
Mikolov, T., Quoc V. Le (2014)
Distributed representations of sentences and documents
arXiv preprint arXiv:1405.4053.
Pennington, J., Socher, R., & Manning, C. D. (2014)
Glove: Global vectors for word representation
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP
2014), 12.
ˇReh˚uˇrek, R., & Sojka, P. (2010)
Software framework for topic modelling with large corpora
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP
2014), 12.
Mostapha Benhenda word embeddings May 23, 2015 28 / 28

Weitere ähnliche Inhalte

Was ist angesagt?

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Universitat Politècnica de Catalunya
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 
A mathematical model for integrating product of two functions
A mathematical model for integrating product of two functionsA mathematical model for integrating product of two functions
A mathematical model for integrating product of two functionsAlexander Decker
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative ClusteringToshihiro Kamishima
 

Was ist angesagt? (7)

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
A mathematical model for integrating product of two functions
A mathematical model for integrating product of two functionsA mathematical model for integrating product of two functions
A mathematical model for integrating product of two functions
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
lec17_ref.pdf
lec17_ref.pdflec17_ref.pdf
lec17_ref.pdf
 
Absolute and Relative Clustering
Absolute and Relative ClusteringAbsolute and Relative Clustering
Absolute and Relative Clustering
 

Andere mochten auch

A Virtual Infrastructure for Data intensive Analysis (VIDIA)
A Virtual Infrastructure for Data intensive Analysis (VIDIA)A Virtual Infrastructure for Data intensive Analysis (VIDIA)
A Virtual Infrastructure for Data intensive Analysis (VIDIA)Alexandra M. Pickett
 
RapidMiner: Advanced Processes And Operators
RapidMiner:  Advanced Processes And OperatorsRapidMiner:  Advanced Processes And Operators
RapidMiner: Advanced Processes And OperatorsDataminingTools Inc
 
RapidMiner: Word Vector Tool And Rapid Miner
RapidMiner:  Word Vector Tool And Rapid MinerRapidMiner:  Word Vector Tool And Rapid Miner
RapidMiner: Word Vector Tool And Rapid MinerDataminingTools Inc
 
Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveyArabic_NLP_ImamU2013
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerbohanairl
 

Andere mochten auch (6)

A Virtual Infrastructure for Data intensive Analysis (VIDIA)
A Virtual Infrastructure for Data intensive Analysis (VIDIA)A Virtual Infrastructure for Data intensive Analysis (VIDIA)
A Virtual Infrastructure for Data intensive Analysis (VIDIA)
 
RapidMiner: Advanced Processes And Operators
RapidMiner:  Advanced Processes And OperatorsRapidMiner:  Advanced Processes And Operators
RapidMiner: Advanced Processes And Operators
 
RapidMiner: Word Vector Tool And Rapid Miner
RapidMiner:  Word Vector Tool And Rapid MinerRapidMiner:  Word Vector Tool And Rapid Miner
RapidMiner: Word Vector Tool And Rapid Miner
 
Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7Introduction to Text Classification with RapidMiner Studio 7
Introduction to Text Classification with RapidMiner Studio 7
 
Sentiment analysis of arabic,a survey
Sentiment analysis of arabic,a surveySentiment analysis of arabic,a survey
Sentiment analysis of arabic,a survey
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMiner
 

Ähnlich wie Word Vector Representations and Applications Discussed

word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisMostapha Benhenda
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.pptLadallaRajKumar
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioDeep Learning Italia
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingMarco Wirthlin
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2Jisoo Jang
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)Abhimanyu Dwivedi
 
word2vec-DataPalooza-Seattle
word2vec-DataPalooza-Seattleword2vec-DataPalooza-Seattle
word2vec-DataPalooza-Seattlecastanan2
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Addressing open Machine Translation problems with Linked Data.
  Addressing open Machine Translation problems with Linked Data.  Addressing open Machine Translation problems with Linked Data.
Addressing open Machine Translation problems with Linked Data.DiegoMoussallem
 
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...Istituto nazionale di statistica
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learningRyohei Suzuki
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdffikki11
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
2_GLMs_printable.pdf
2_GLMs_printable.pdf2_GLMs_printable.pdf
2_GLMs_printable.pdfElio Laureano
 

Ähnlich wie Word Vector Representations and Applications Discussed (20)

word embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysisword embeddings and applications to machine translation and sentiment analysis
word embeddings and applications to machine translation and sentiment analysis
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
Continutiy of Functions.ppt
Continutiy of Functions.pptContinutiy of Functions.ppt
Continutiy of Functions.ppt
 
Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational Modelling
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2A Neural Probabilistic Language Model_v2
A Neural Probabilistic Language Model_v2
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
word2vec-DataPalooza-Seattle
word2vec-DataPalooza-Seattleword2vec-DataPalooza-Seattle
word2vec-DataPalooza-Seattle
 
Data Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLPData Con LA 2022 - Transformers for NLP
Data Con LA 2022 - Transformers for NLP
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Addressing open Machine Translation problems with Linked Data.
  Addressing open Machine Translation problems with Linked Data.  Addressing open Machine Translation problems with Linked Data.
Addressing open Machine Translation problems with Linked Data.
 
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...
M. De Cubellis, F. De Fausti, Word Embeddings: modellare il significato delle...
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Transformer_tutorial.pdf
Transformer_tutorial.pdfTransformer_tutorial.pdf
Transformer_tutorial.pdf
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
Deep Feature Consistent VAE
Deep Feature Consistent VAEDeep Feature Consistent VAE
Deep Feature Consistent VAE
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
2_GLMs_printable.pdf
2_GLMs_printable.pdf2_GLMs_printable.pdf
2_GLMs_printable.pdf
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 

Mehr von GeeksLab Odessa

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...GeeksLab Odessa
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...GeeksLab Odessa
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторGeeksLab Odessa
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеGeeksLab Odessa
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...GeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладGeeksLab Odessa
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...GeeksLab Odessa
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...GeeksLab Odessa
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко GeeksLab Odessa
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...GeeksLab Odessa
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...GeeksLab Odessa
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...GeeksLab Odessa
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...GeeksLab Odessa
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот GeeksLab Odessa
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...GeeksLab Odessa
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js GeeksLab Odessa
 

Mehr von GeeksLab Odessa (20)

DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
DataScience Lab2017_Коррекция геометрических искажений оптических спутниковых...
 
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
DataScience Lab 2017_Kappa Architecture: How to implement a real-time streami...
 
DataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский ВикторDataScience Lab 2017_Блиц-доклад_Турский Виктор
DataScience Lab 2017_Блиц-доклад_Турский Виктор
 
DataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображениеDataScience Lab 2017_Обзор методов детекции лиц на изображение
DataScience Lab 2017_Обзор методов детекции лиц на изображение
 
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-докладDataScienceLab2017_Блиц-доклад
DataScienceLab2017_Блиц-доклад
 
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
 
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
DataScienceLab2017_BioVec: Word2Vec в задачах анализа геномных данных и биоин...
 
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
DataScienceLab2017_Data Sciences и Big Data в Телекоме_Александр Саенко
 
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
DataScienceLab2017_Высокопроизводительные вычислительные возможности для сист...
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
 
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
DataScience Lab 2017_Кто здесь? Автоматическая разметка спикеров на телефонны...
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
DataScience Lab 2017_Графические вероятностные модели для принятия решений в ...
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
DataScienceLab2017_Как знать всё о покупателях (или почти всё)?_Дарина Перемот
 
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
JS Lab 2017_Mapbox GL: как работают современные интерактивные карты_Владимир ...
 
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
JS Lab2017_Под микроскопом: блеск и нищета микросервисов на node.js
 

Word Vector Representations and Applications Discussed

  • 1. Word vector representations and applications Mostapha Benhenda AI & Big Data conference, Odessa mostaphabenhenda@gmail.com May 23, 2015 Mostapha Benhenda word embeddings May 23, 2015 1 / 28
  • 2. Overview 1 Introduction 2 Applications Machine translation Sentiment analysis of movie reviews Vector averaging (Kaggle tutorial) Convolutional Neural Networks Other applications of word embeddings 3 Example of word embedding algorithm: GloVe Build the co-occurrence matrix X Matrix factorization 4 Presentation of the Kyiv deep learning meetup 5 References Mostapha Benhenda word embeddings May 23, 2015 2 / 28
  • 3. Introduction We want to compute a ”map of words”, i.e. a representation: R : Words = {w1, ..., wN} → Vectors = {R(w1), ..., R(wN)} ⊂ Rd such that: wi ≈ wj (meaning of words) is equivalent to: R(wi ) ≈ R(wj ) (distance of vectors) Mostapha Benhenda word embeddings May 23, 2015 3 / 28
  • 4. These vectors are interesting because: the computer can ”grasp” the meaning of words, just by looking at the distance between vectors. We can feed prediction algorithms (linear regression,...) with these vectors, and hopefully get good accuracy, because the representation is ”faithful” to the meaning of words. Mostapha Benhenda word embeddings May 23, 2015 4 / 28
  • 5. For example, if we have the list of words: cat, dog, mouse, house We expect the most distant word to be: ? Mostapha Benhenda word embeddings May 23, 2015 5 / 28
  • 6. (this is a 2-dimensional visualization of a 300-dimensional space, using t-SNE, a visualization technique that preserves clusters of points) Mostapha Benhenda word embeddings May 23, 2015 6 / 28
  • 7. Even better, we can sometimes make additions and substractions of vectors, and draw parallelisms. For example, king + woman - man ≈ ? Mostapha Benhenda word embeddings May 23, 2015 7 / 28
  • 8. (2-dimensional visualization of a 300-dimensional space, using PCA, a visualization technique that preserves parallelisms) Mostapha Benhenda word embeddings May 23, 2015 8 / 28
  • 9. Idea behind all algorithms (Word2vec, GloVe,...): ”You shall know a word by the company it keeps” (John R. Firth, 1957) The more often 2 words are near each other in a text (the training data), the closer their vectors will be. We hope that 2 words have close meanings if, statistically, they are often near each other.i.e.: statistics on texts → meaning of words So we need quite big datasets: Google News English: 100 Billion words Wikipedia French or Spanish: 100 Million words MT 11 French: 200 Million words MT 11 Spanish: 84 Million words Mostapha Benhenda word embeddings May 23, 2015 9 / 28
  • 10. Application to machine translation Idea: we compute maps of all English words, all French words, and we ”superpose” the 2 maps, and we should get an English/French translator. Mostapha Benhenda word embeddings May 23, 2015 10 / 28
  • 11. [Mikolov et al. 2013] (from Google), made an English → Spanish translator (among other languages). I tried to reproduce their results for French → English, and French → Spanish. Using their algorithm Word2vec, they trained their vectors on the dataset MT 11, and on Google News. I did the same on MT 11 and Wikipedias for French and Spanish (trained with Gensim package in Python), and I used their Google News-trained vectors for English. Mostapha Benhenda word embeddings May 23, 2015 11 / 28
  • 12. Then they took for training set: list of the 5000 most frequent English words, and their Google translation in Spanish. They train a linear transformation W that approximates the English → Spanish translation, i.e. they take W that minimizes: 5000 i=1 W (vi ) − G(vi ) 2 where G(vi ) is the Google-translation of vi . Mostapha Benhenda word embeddings May 23, 2015 12 / 28
  • 13. They test the accuracy of W on the next 1000 most common words of the list (1-accuracy). They also test the accuracy up to 5 attempts, i.e. they test if G(vi ) belongs to the 5 nearest neighbors of W (vi ) (5-accuracy). I did the same for French → English, and French → Spanish. computer-intensive task, I recommend using Amazon Web Services. Mostapha Benhenda word embeddings May 23, 2015 13 / 28
  • 14. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% Mostapha Benhenda word embeddings May 23, 2015 14 / 28
  • 15. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% French (Wikipedia) → English (Google News) 10-accuracy: < 10%. Mostapha Benhenda word embeddings May 23, 2015 14 / 28
  • 16. Results Mikolov, English → Spanish: Training data 1-accuracy 5-accuracy Google News 50% 75% MT 11 33% 51% Me, French → Spanish: Training data 1-accuracy 5-accuracy Wikipedia na <10% MT 11 25% 37% French (Wikipedia) → English (Google News) 10-accuracy: < 10%. My conclusion: Wikipedia = cauchemar ! Mostapha Benhenda word embeddings May 23, 2015 14 / 28
  • 17. Sentiment analysis of movie reviews Goal: we want to determine if a movie review is positive or negative. Toy problem in machine learning: reviews are ambiguous, emotional, full of sarcasm,... Long-term goal: computer can understand emotions. Commercial applications of sentiment analysis: marketing: customer satisfaction, brand reputation... finance: predict market trends,... Example of review: ”This movie contains everything you’d expect, but nothing more”. Mostapha Benhenda word embeddings May 23, 2015 15 / 28
  • 18. Vector averaging (Kaggle tutorial) We work on the IMDB dataset. To predict the sentiment (+ or -) of a review: we average the vectors of all the words of the review We use these average vectors as input to a supervised learning algorithm (e.g. SVM, random forests...) We get 83% accuracy. Limitation of the method: the order of words is lost, because addition is a commutative operation. Mostapha Benhenda word embeddings May 23, 2015 16 / 28
  • 19. Convolutional Neural Networks CNN are biology-inspired neural network models, initially introduced for image processing. they preserve spatial structure: the input is a n × m matrix (the pixels of the image) but here, the input is a sentence (with its ”spatial structure” preserved): Figure : source: [Collobert et al. 2011] Mostapha Benhenda word embeddings May 23, 2015 17 / 28
  • 20. Practical problem: Kaggle dataset (IMDB) can have 2000 words/review >> 50 words/review in [Collobert et al. 2011] → training is too slow!! there are tricks to speed-up training (cutting reviews,...) but it did not work. Mostapha Benhenda word embeddings May 23, 2015 18 / 28
  • 21. Other applications of word embeddings Innovative search engine (ThisPlusThat), where we can subtract queries, for example: pizza + Japan -Italy → sushi Recommendation systems for online shops Mostapha Benhenda word embeddings May 23, 2015 19 / 28
  • 22. Example of word embedding algorithm: GloVe GloVe: ”Global Vectors” algorithm made in Stanford by [Pennington, Socher, Manning, 2014]. There are 2 steps: 1 Build the co-occurrence matrix X from the training text 2 Factorize the matrix X to get vectors 1. Build the co-occurrence matrix X: For the first step, we apply the principle ”you shall know a word by the company it keeps”: Mostapha Benhenda word embeddings May 23, 2015 20 / 28
  • 23. The context window C(w) of size 2 of the word w= Ukraine (for example), is given by: The national flag of Ukraine is yellow and blue. (in practice, we take the context window size around 10) The number of times 2 words i and j lie in the same context window is denoted by Xi,j . The symmetric matrix X = (Xi,j )1≤i,j≤N is the co-occurrence matrix. Mostapha Benhenda word embeddings May 23, 2015 21 / 28
  • 24. 2. Factorize the co-occurrence matrix X: To extract vectors from a symmetric matrix, we can write: Xi,j = < vi , vj > vi ∈ Rd , where d is an integer fixed by us a priori (a hyperparameter). i.e. we can write: X = Gram(v1, ..., vN) This formula does not give a good empirical performance, but: for any scalar function f , (f (Xi,j ))1≤i,j≤N is still a symmetric matrix. Let’s find f that works! Mostapha Benhenda word embeddings May 23, 2015 22 / 28
  • 25. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: Mostapha Benhenda word embeddings May 23, 2015 23 / 28
  • 26. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: < vi − vj , vk >= f (Xi,k/Xj,k) Mostapha Benhenda word embeddings May 23, 2015 23 / 28
  • 27. Let i, j be 2 words, for example: i= fruit, j=house. The third word k=apple has a meaning closer to fruit than to house, so Xi,k/Xj,k is large. If k=room (closer to ”house” than to ”apple”), then Xi,k/Xj,k is small. If k= sky (far from both ”house” and ”apple”), then Xi,k/Xj,k 1. → the ratio of co-occurrences Xi,k/Xj,k is important to capture meaning of words. So we should look at f (Xi,k/Xj,k). On the other hand, if we want to combine 3 vectors and the scalar product in a ”natural” way, we do not have much choice: < vi − vj , vk >= f (Xi,k/Xj,k) We also have: < vi , vk > − < vj , vk >= f (Xi,k) − f (Xj,k) So f = log Mostapha Benhenda word embeddings May 23, 2015 23 / 28
  • 28. We cannot factorize the matrix log X explicitly, i.e. we cannot directly compute vi , vj such that: < vi , vj >= log Xi,j but we compute an approximation, by minimizing a cost function J: min v1,...,vN J(v1, ..., vN) ∼ N i,j=1 (< vi , vj > − log Xi,j )2 To do that, we use gradient descent (standard method). Mostapha Benhenda word embeddings May 23, 2015 24 / 28
  • 29. Looks like cooking, but: good empirical performance (at least similar to Word2vec) cost function J similar to Word2vec GloVe model is analogous to Latent Semantic Analysis (LSA). In LSA: co-occurrence matrix is a word-document matrix X (In GloVe: word-word matrix) we factorize ∼ log(1 + Xi,j ) (SVD, not Gram matrix) Mostapha Benhenda word embeddings May 23, 2015 25 / 28
  • 30. Presentation of the Kyiv deep learning meetup Regular meeting about deep learning, neural networks... Every two weeks (approximately) in Kyiv Free and open to everyone: experts, ”scientific tourists”, and all people in between. attendance: ∼ 30 people/meeting Mostapha Benhenda word embeddings May 23, 2015 26 / 28
  • 31. Presentation of the Kyiv deep learning meetup Regular meeting about deep learning, neural networks... Every two weeks (approximately) in Kyiv Free and open to everyone: experts, ”scientific tourists”, and all people in between. attendance: ∼ 30 people/meeting Regular talks, but also: Mostapha Benhenda word embeddings May 23, 2015 26 / 28
  • 32. Presentation of the Kyiv deep learning meetup Regular meeting about deep learning, neural networks... Every two weeks (approximately) in Kyiv Free and open to everyone: experts, ”scientific tourists”, and all people in between. attendance: ∼ 30 people/meeting Regular talks, but also: ”AI-superstars”: Yoshua Bengio (remotely with London and Paris meetups) Mostapha Benhenda word embeddings May 23, 2015 26 / 28
  • 33. Presentation of the Kyiv deep learning meetup Regular meeting about deep learning, neural networks... Every two weeks (approximately) in Kyiv Free and open to everyone: experts, ”scientific tourists”, and all people in between. attendance: ∼ 30 people/meeting Regular talks, but also: ”AI-superstars”: Yoshua Bengio (remotely with London and Paris meetups) Interactive tutorials: Theano... Mostapha Benhenda word embeddings May 23, 2015 26 / 28
  • 34. Presentation of the Kyiv deep learning meetup Regular meeting about deep learning, neural networks... Every two weeks (approximately) in Kyiv Free and open to everyone: experts, ”scientific tourists”, and all people in between. attendance: ∼ 30 people/meeting Regular talks, but also: ”AI-superstars”: Yoshua Bengio (remotely with London and Paris meetups) Interactive tutorials: Theano... Join us! http://deeplearningkyiv.doorkeeper.jp/ Mostapha Benhenda word embeddings May 23, 2015 26 / 28
  • 35. References Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., ... & Bengio, Y. (2010) Theano: a CPU and GPU math expression compiler Proceedings of the Python for scientific computing conference (SciPy), (Vol. 4, p. 3) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011) Natural language processing (almost) from scratch The Journal of Machine Learning Research, 12, 2493-2537 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013) Distributed representations of words and phrases and their compositionality Advances in Neural Information Processing Systems pp. 3111-3119 Mikolov, T., Quoc V. Le, Sutskever, I. (2013) Exploiting similarities among languages for machine translation arXiv preprint arXiv:1309.4168. Mostapha Benhenda word embeddings May 23, 2015 27 / 28
  • 36. References Mikolov, T., Quoc V. Le (2014) Distributed representations of sentences and documents arXiv preprint arXiv:1405.4053. Pennington, J., Socher, R., & Manning, C. D. (2014) Glove: Global vectors for word representation Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. ˇReh˚uˇrek, R., & Sojka, P. (2010) Software framework for topic modelling with large corpora Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. Mostapha Benhenda word embeddings May 23, 2015 28 / 28