Suzan Verberne gave a workshop on using text mining for lexicography. She discussed using word embeddings to help discover and select new lemmas for dictionaries. Word2Dict is a lexicographic tool that uses word embeddings to present words semantically related to the lemma being described. Word embeddings learn dense vector representations of words by predicting words in context using neural networks, improving on the traditional sparse vector space model. Word embeddings can be trained using the Word2Vec algorithm and analyzed using the Gensim Python package to gain linguistic insights and improve natural language processing applications.
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Text Mining for Lexicography
1. W O R K S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’
TEXT MINING FOR
LEXICOGRAPHY
SUZAN VERBERNE 2019
2. ABOUT ME
Master in Natural Language Processing, 2002
PhD in Information Retrieval, 2010
Postdoc at Radboud University, 2009-2017
Assistant Professor at Leiden University
Leiden Institute of Advanced Computer Science
Data Science Research Programme
Research group: Text Mining and Retrieval (TMR)
Suzan Verberne 2019
3. BEFORE I START…
Who has a background in linguistics?
Who has experience with programming in Python?
Who is familiar with the vector space model?
Who is familiar with word embeddings?
Who is familiar with logistic regression?
Who is familiar with artificial neural networks?
Suzan Verberne 2019
4. QUESTIONS
“Can Big Data Analytics solve the current bottle neck in the
continuous updating of dictionaries?”
Text mining: Automatic extraction of knowledge from text
Text = unstructured
Knowledge = structured
Suzan Verberne 2019
5. TEXT MINING FOR LEXICOGRAPHY
“How can we automatically extract structured information from the
constant stream of text data on the web and social media in
particular?”
Structured lexical information:
Discovery and selection of new lemmas
New meanings of existing lemmas
Collocations / multi-word expressions
Suzan Verberne 2019
6. TEXT MINING & LEXICOGRAPHY
Discovery and selection of new lemmas
Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and
Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX
International Congress (p. 146).
Trend analysis: Change in the meaning of existing words
Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., &
Mukherjee, A. (2015). An automatic approach to identify word sense
changes in text media across timescales. Natural Language Engineering,
21(5), 773-798.
Extraction of collocations/multiword expressions
Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating
the dictionary: semantic change identification based on change in bigrams
over time”. Presented in Workshop on collocations.
https://elex.link/elex2019/programme/workshop-on-collocations/
Suzan Verberne 2019
8. TASK AND RESEARCH QUESTIONS
Example: Den Danske Ordbog (DDO)
Task: “augmenting the lemmata of an existing dictionary by adding
either completely new or formerly neglected lemmas”
“How do you in a fast and consistent way compare new lemma
candidates to already described lemmas within the same semantic
field in order to ensure the consistency of the definitions?”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
9. TOOL FOR LEMMA SELECTION
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
10. WORD2DICT
A lexicographic tool based on a word embedding model
Goal: “to present a number of words that are most semantically
related to the lemma that the lexicographer is describing”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
12. WHERE TO START
Linguistics: Distributional hypothesis
Data science: Vector Space Model (VSM)
Suzan Verberne 2019
13. DISTRIBUTIONAL HYPOTHESIS
Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162
The context of a word defines its meaning
Words that occur in similar contexts tend to be similar
Suzan Verberne 2019
14. VECTOR SPACE MODEL
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for
automatic indexing. Communications of the ACM, 18(11), 613-620.
Documents and queries represented in a vector space
Where the dimensions are the words
Suzan Verberne 2019
15. VECTOR SPACE MODEL
In the vector space model, we can model similarity as closeness
The closer two documents are in the space, the more similar they
are
Suzan Verberne 2019
We can compute the similarity
between two points/vectors using
a metric for distance or angle
Most used metric: cosine
similarity: the cosine of the angle
𝝷 between two vectors
16. VECTOR SPACE MODEL
Linguistic issues with the vector space model:
synonymy: multiple ways to refer to the same concept, e.g. bicycle
and bike
polysemy/homonyms: most words have more than one distinct
meaning, e.g. bank, bass, chips
Suzan Verberne 2019
17. VECTOR SPACE MODEL
Computational issues with the vector space model:
The vector representations are high-dimensional (easily 10,000
dimensions – one for each term in the collection)
The vector representations are sparse (a given document only contains
a fraction of those 10,000 terms – the other dimensions have a 0 value)
Suzan Verberne 2019
20. WORD EMBEDDINGS
Word embeddings are dense representations of words
Suzan Verberne 2019
21. WORD EMBEDDINGS
Word embeddings models represent (embed) words in a
continuous vector space
The vector space is relatively low-dimensional (100 – 400
dimensions instead of 10,000s)
Semantically and syntactically similar words are mapped to nearby
points because the representations are learned from word
occurrences in context (Distributional Hypothesis)
Suzan Verberne 2019
25. WHAT IS WORD2VEC?
Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
Intuition:
Train a classifier on a binary prediction task (on a text without labels!):
“Is word w likely to show up near the word bicycle?”
We don’t actually care about this prediction task; instead we’ll take the
learned classifier weights as the word embeddings
Suzan Verberne 2019
26. WHERE DOES IT COME FROM
Neural network language model (NNLM) (Bengio et al., 2003)
Mikolov proposed to learn word vectors using a neural network
with a single hidden layer (Mikolov et al 2013) word2vec
Many neural architectures and models have been proposed for
computing word vectors
GloVe (2014) - Global Vectors for Word Representation
FastText (2017) - Enriching Word Vectors with Subword Information
ELMo (2018) - Deep contextualized word representations
BERT (2019) - Bidirectional Encoder Representations from Transformers
Suzan Verberne 2019
27. WORD2VEC
Starting point: large collection (e.g. 10 Million words)
First step: extract the vocabulary (e.g. 10,000 terms)
Goal: to represent each of these 10,000 terms as a dense, lower-
dimensional vector (typically 100-400 dimensions)
Idea: to use the contexts of words to learn their meaning
Suzan Verberne 2019
28. TRAINING WORD2VEC
Training task: binary classification of words in the text
1. Treat the target word and a neighboring context word as positive
examples
2. Randomly sample other words in the lexicon to get negative samples
3. Train a classifier to distinguish those two cases
Suzan Verberne 2019
29. TRAINING WORD2VEC
This example has a target word t (apricot), and 4 context words in
the L = ±2 window, resulting in 4 positive training instances
Negative examples are artificially generated:
Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
30. TRAINING WORD2VEC
The classifier is a
neural network with
one hidden layer
Logistic functions
are used as
activation functions
in the hidden layer
The regression
weights are the
embeddings
Suzan Verberne 2019
sparse vector
dense vector
(embeddings)
31. TRAINING WORD2VEC
The weights on the nodes in the hidden layer get random
initializations and get updated while the model processes the
collection
The outcome of the classification determines whether we adjust
the current word vector
Gradually, the vectors converge to sensible descriptors
(embeddings) for words
Suzan Verberne 2019
32. LANGUAGE MODELLING
The word prediction task is called language modelling
Traditional n-gram model: given the previous n words, predict the next
word
Neural language models can handle much longer histories, and they can
generalize over contexts of similar words
The resulting embeddings are referred to as language model
It is important that the context classification here is not an aim in
itself: it is just an auxiliary task to learn vector representations good
for other tasks
Suzan Verberne 2019
33. ADVANTAGES OF WORD2VEC
It scales
Train on billion word corpora
In limited time
Possibility of parallel training
Pre-trained word embeddings trained by one can be used by others
For entirely different tasks
Incremental training
Train on one piece of data, save results, continue training later on
There is a Python module for it:
Gensim word2vec
Suzan Verberne 2019
35. GENSIM WORD2VEC
Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 100, 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne 2019
36. GENSIM WORD2VEC
Sørensen, N. H., & Nimb, S. (2018):
We used the version of the word2vec algorithm implemented in the Gensim Python
package
to train a model based on the Danish corpus used by the lexicographers of DDO
The corpus included at the time of the training roughly 920 million running words,
mainly newswire, but also, material from magazines, transcripts from the Danish
Parliament, and some fiction, among others, spanning the years 1982 to 2017
We trained the model with 500 features, a window size of five, a minimum occurrence
of five for all types
The corpus included 6.3 million types, five million of which occurred less than five times
The training took roughly 18 hours on a 2017 MacBook Pro
Suzan Verberne 2019
38. WHAT CAN YOU DO WITH IT?
Mining knowledge about natural language
Improve NLP applications
Suzan Verberne 2019
39. WHAT CAN YOU DO WITH IT?
Mining knowledge about natural language
Learning semantic and semantic relations
Suzan Verberne 2019
40. WHAT CAN YOU DO WITH IT?
A is to B as C is to ?
This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’
More interesting:
France is to Paris as Germany is to …
Suzan Verberne 2019
41. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances
in Neural Information Processing Systems, pages 3111–3119, 2013.
42. WHAT CAN YOU DO WITH IT?
A is to B as C is to ?
It also works for syntactic relations:
vector(biggest) - vector(big) + vector(small) =
Suzan Verberne 2019
vector(smallest)
43. WHAT CAN YOU DO WITH IT?
Mining knowledge about natural language
Learning semantic and semantic relations
Selecting out-of-the-list words
Example: which word does not belong in [monkey, lion, dog, truck]
Selectional preferences
Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
Discover new words
Suzan Verberne 2019
44. Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A.
(2019). Unsupervised word embeddings capture latent knowledge from materials
science literature. Nature, 571(7763), 95.
https://github.com/materialsintelligence/mat2vec
45. WHAT CAN YOU DO WITH IT?
Improve NLP applications:
Sentence completion/text prediction/reply suggestion
Bilingual Word Embeddings for Machine Translation with LSTMs
(Near-)Synonym detection ( query expansion)
Concept representation of texts
Example: Twitter sentiment classification
Document similarity
Example: cluster news articles per news event
Suzan Verberne 2019
46. WORD EMBEDDINGS AS FEATURES
NLP models take word embeddings as low-level
representation of words
Word embeddings as input for convolutional neural
networks in text categorization
Word embeddings as input for recurrent neural networks
in sequence labelling
Since 2018: word embeddings are used as language
models that can be fine-tuned towards any natural
language processing task
Suzan Verberne 2019
48. SUMMARY
Text Mining for Lexicography
Discovery and selection of new lemmas
Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)
Word embeddings
Distributional hypothesis
Vector space model
From sparse to dense representations
Neural language modelling
Practical use in the gensim package
Suzan Verberne 2019
49. FURTHER READING
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
https://rare-technologies.com/word2vec-tutorial/
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/
Visualisation of embeddings models: https://projector.tensorflow.org/
Suzan Verberne 2019
Because the data we are interested in is text data and we want to mine knowledge from those text data
Figure 1: A search for ananasjuice (‘pineapple juice’). To the left (the top-half of the interface) we see the most similar words according to the context in which they appear in a corpus. Frequency counts for each word and whether or not the word is included in DDO is also displayed. The frequency counts are color- coded for quicker visual decoding: the darker the color, the higher the frequency. To the right (the bottom- half of the interface), definitions of the words already in the dictionary are shown, as well as their editorial status (e.g. “publiceret” (‘published’)) and the similarity score from the model (e.g. “0.75”, “0.71” – 1.0 equals identical).