SlideShare ist ein Scribd-Unternehmen logo
1 von 50
W O R K S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’
TEXT MINING FOR
LEXICOGRAPHY
SUZAN VERBERNE 2019
ABOUT ME
 Master in Natural Language Processing, 2002
 PhD in Information Retrieval, 2010
 Postdoc at Radboud University, 2009-2017
 Assistant Professor at Leiden University
 Leiden Institute of Advanced Computer Science
 Data Science Research Programme
 Research group: Text Mining and Retrieval (TMR)
Suzan Verberne 2019
BEFORE I START…
 Who has a background in linguistics?
 Who has experience with programming in Python?
 Who is familiar with the vector space model?
 Who is familiar with word embeddings?
 Who is familiar with logistic regression?
 Who is familiar with artificial neural networks?
Suzan Verberne 2019
QUESTIONS
 “Can Big Data Analytics solve the current bottle neck in the
continuous updating of dictionaries?”
 Text mining: Automatic extraction of knowledge from text
 Text = unstructured
 Knowledge = structured
Suzan Verberne 2019
TEXT MINING FOR LEXICOGRAPHY
 “How can we automatically extract structured information from the
constant stream of text data on the web and social media in
particular?”
 Structured lexical information:
 Discovery and selection of new lemmas
 New meanings of existing lemmas
 Collocations / multi-word expressions
Suzan Verberne 2019
TEXT MINING & LEXICOGRAPHY
 Discovery and selection of new lemmas
 Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and
Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX
International Congress (p. 146).
 Trend analysis: Change in the meaning of existing words
 Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., &
Mukherjee, A. (2015). An automatic approach to identify word sense
changes in text media across timescales. Natural Language Engineering,
21(5), 773-798.
 Extraction of collocations/multiword expressions
 Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating
the dictionary: semantic change identification based on change in bigrams
over time”. Presented in Workshop on collocations.
https://elex.link/elex2019/programme/workshop-on-collocations/
Suzan Verberne 2019
Suzan Verberne 2019
DISCOVERY AND SELECTION OF
NEW LEMMAS
TASK AND RESEARCH QUESTIONS
 Example: Den Danske Ordbog (DDO)
 Task: “augmenting the lemmata of an existing dictionary by adding
either completely new or formerly neglected lemmas”
 “How do you in a fast and consistent way compare new lemma
candidates to already described lemmas within the same semantic
field in order to ensure the consistency of the definitions?”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
TOOL FOR LEMMA SELECTION
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
WORD2DICT
 A lexicographic tool based on a word embedding model
 Goal: “to present a number of words that are most semantically
related to the lemma that the lexicographer is describing”
Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
Suzan Verberne 2019
WORD EMBEDDINGS
WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne 2019
DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162
 The context of a word defines its meaning
 Words that occur in similar contexts tend to be similar
Suzan Verberne 2019
VECTOR SPACE MODEL
 Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for
automatic indexing. Communications of the ACM, 18(11), 613-620.
 Documents and queries represented in a vector space
 Where the dimensions are the words
Suzan Verberne 2019
VECTOR SPACE MODEL
 In the vector space model, we can model similarity as closeness
 The closer two documents are in the space, the more similar they
are
Suzan Verberne 2019
 We can compute the similarity
between two points/vectors using
a metric for distance or angle
 Most used metric: cosine
similarity: the cosine of the angle
𝝷 between two vectors
VECTOR SPACE MODEL
Linguistic issues with the vector space model:
 synonymy: multiple ways to refer to the same concept, e.g. bicycle
and bike
 polysemy/homonyms: most words have more than one distinct
meaning, e.g. bank, bass, chips
Suzan Verberne 2019
VECTOR SPACE MODEL
 Computational issues with the vector space model:
 The vector representations are high-dimensional (easily 10,000
dimensions – one for each term in the collection)
 The vector representations are sparse (a given document only contains
a fraction of those 10,000 terms – the other dimensions have a 0 value)
Suzan Verberne 2019
VECTOR SPACE MODEL
Suzan Verberne 2019
VECTOR SPACE MODEL
Suzan Verberne 2019
WORD EMBEDDINGS
 Word embeddings are dense representations of words
Suzan Verberne 2019
WORD EMBEDDINGS
 Word embeddings models represent (embed) words in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 400
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points because the representations are learned from word
occurrences in context (Distributional Hypothesis)
Suzan Verberne 2019
Suzan Verberne 2019
PCA projection of 320-
dimensional vector space
Suzan Verberne 2019
Suzan Verberne 2019
WORD2VEC
WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 Intuition:
 Train a classifier on a binary prediction task (on a text without labels!):
“Is word w likely to show up near the word bicycle?”
 We don’t actually care about this prediction task; instead we’ll take the
learned classifier weights as the word embeddings
Suzan Verberne 2019
WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003)
 Mikolov proposed to learn word vectors using a neural network
with a single hidden layer (Mikolov et al 2013)  word2vec
 Many neural architectures and models have been proposed for
computing word vectors
 GloVe (2014) - Global Vectors for Word Representation
 FastText (2017) - Enriching Word Vectors with Subword Information
 ELMo (2018) - Deep contextualized word representations
 BERT (2019) - Bidirectional Encoder Representations from Transformers
Suzan Verberne 2019
WORD2VEC
 Starting point: large collection (e.g. 10 Million words)
 First step: extract the vocabulary (e.g. 10,000 terms)
 Goal: to represent each of these 10,000 terms as a dense, lower-
dimensional vector (typically 100-400 dimensions)
 Idea: to use the contexts of words to learn their meaning
Suzan Verberne 2019
TRAINING WORD2VEC
 Training task: binary classification of words in the text
1. Treat the target word and a neighboring context word as positive
examples
2. Randomly sample other words in the lexicon to get negative samples
3. Train a classifier to distinguish those two cases
Suzan Verberne 2019
TRAINING WORD2VEC
 This example has a target word t (apricot), and 4 context words in
the L = ±2 window, resulting in 4 positive training instances
 Negative examples are artificially generated:
Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
TRAINING WORD2VEC
 The classifier is a
neural network with
one hidden layer
 Logistic functions
are used as
activation functions
in the hidden layer
 The regression
weights are the
embeddings
Suzan Verberne 2019
sparse vector
dense vector
(embeddings)
TRAINING WORD2VEC
 The weights on the nodes in the hidden layer get random
initializations and get updated while the model processes the
collection
 The outcome of the classification determines whether we adjust
the current word vector
 Gradually, the vectors converge to sensible descriptors
(embeddings) for words
Suzan Verberne 2019
LANGUAGE MODELLING
 The word prediction task is called language modelling
 Traditional n-gram model: given the previous n words, predict the next
word
 Neural language models can handle much longer histories, and they can
generalize over contexts of similar words
 The resulting embeddings are referred to as language model
 It is important that the context classification here is not an aim in
itself: it is just an auxiliary task to learn vector representations good
for other tasks
Suzan Verberne 2019
ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne 2019
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
GENSIM WORD2VEC
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 100, 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne 2019
GENSIM WORD2VEC
Sørensen, N. H., & Nimb, S. (2018):
 We used the version of the word2vec algorithm implemented in the Gensim Python
package
 to train a model based on the Danish corpus used by the lexicographers of DDO
 The corpus included at the time of the training roughly 920 million running words,
mainly newswire, but also, material from magazines, transcripts from the Danish
Parliament, and some fiction, among others, spanning the years 1982 to 2017
 We trained the model with 500 features, a window size of five, a minimum occurrence
of five for all types
 The corpus included 6.3 million types, five million of which occurred less than five times
 The training took roughly 18 hours on a 2017 MacBook Pro
Suzan Verberne 2019
DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne 2019
Cosine similarity
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Improve NLP applications
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Learning semantic and semantic relations
Suzan Verberne 2019
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne 2019
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances
in Neural Information Processing Systems, pages 3111–3119, 2013.
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne 2019
vector(smallest)
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Learning semantic and semantic relations
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Discover new words
Suzan Verberne 2019
Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A.
(2019). Unsupervised word embeddings capture latent knowledge from materials
science literature. Nature, 571(7763), 95.
https://github.com/materialsintelligence/mat2vec
WHAT CAN YOU DO WITH IT?
 Improve NLP applications:
 Sentence completion/text prediction/reply suggestion
 Bilingual Word Embeddings for Machine Translation with LSTMs
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne 2019
WORD EMBEDDINGS AS FEATURES
 NLP models take word embeddings as low-level
representation of words
 Word embeddings as input for convolutional neural
networks in text categorization
 Word embeddings as input for recurrent neural networks
in sequence labelling
 Since 2018: word embeddings are used as language
models that can be fine-tuned towards any natural
language processing task
Suzan Verberne 2019
Suzan Verberne 2019
CONCLUSIONS
SUMMARY
Text Mining for Lexicography
 Discovery and selection of new lemmas
 Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)
 Word embeddings
 Distributional hypothesis
 Vector space model
 From sparse to dense representations
 Neural language modelling
 Practical use in the gensim package
Suzan Verberne 2019
FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 https://rare-technologies.com/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/
 Visualisation of embeddings models: https://projector.tensorflow.org/
Suzan Verberne 2019
Suzan Verberne 2019
 http://tmr.liacs.nl

Weitere ähnliche Inhalte

Was ist angesagt?

Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 

Was ist angesagt? (20)

Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 

Ähnlich wie Text Mining for Lexicography

Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioDeep Learning Italia
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) inventionjournals
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesGerard de Melo
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modellingRiddhi Jain
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSijwscjournal
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYijnlc
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
 
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollSemantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium
 

Ähnlich wie Text Mining for Lexicography (20)

Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modelling
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Doc format.
Doc format.Doc format.
Doc format.
 
Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp SchollSemantic Relatedness of Web Resources by XESA - Philipp Scholl
Semantic Relatedness of Web Resources by XESA - Philipp Scholl
 

Mehr von Leiden University

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...Leiden University
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discoveryLeiden University
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionalsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threadsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leiden University
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchLeiden University
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextLeiden University
 
Search engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleSearch engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleLeiden University
 
Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Leiden University
 

Mehr von Leiden University (13)

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discovery
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals
 
kanker.nl & Data Science
kanker.nl & Data Sciencekanker.nl & Data Science
kanker.nl & Data Science
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Computationeel denken
Computationeel denkenComputationeel denken
Computationeel denken
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in context
 
Search engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleSearch engines for the humanities that go beyond Google
Search engines for the humanities that go beyond Google
 
Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?
 

Kürzlich hochgeladen

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Kürzlich hochgeladen (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Text Mining for Lexicography

  • 1. W O R K S H O P : ‘ T H E F U T U R E O F A C A D E M I C L E X I C O G R A P H Y ’ TEXT MINING FOR LEXICOGRAPHY SUZAN VERBERNE 2019
  • 2. ABOUT ME  Master in Natural Language Processing, 2002  PhD in Information Retrieval, 2010  Postdoc at Radboud University, 2009-2017  Assistant Professor at Leiden University  Leiden Institute of Advanced Computer Science  Data Science Research Programme  Research group: Text Mining and Retrieval (TMR) Suzan Verberne 2019
  • 3. BEFORE I START…  Who has a background in linguistics?  Who has experience with programming in Python?  Who is familiar with the vector space model?  Who is familiar with word embeddings?  Who is familiar with logistic regression?  Who is familiar with artificial neural networks? Suzan Verberne 2019
  • 4. QUESTIONS  “Can Big Data Analytics solve the current bottle neck in the continuous updating of dictionaries?”  Text mining: Automatic extraction of knowledge from text  Text = unstructured  Knowledge = structured Suzan Verberne 2019
  • 5. TEXT MINING FOR LEXICOGRAPHY  “How can we automatically extract structured information from the constant stream of text data on the web and social media in particular?”  Structured lexical information:  Discovery and selection of new lemmas  New meanings of existing lemmas  Collocations / multi-word expressions Suzan Verberne 2019
  • 6. TEXT MINING & LEXICOGRAPHY  Discovery and selection of new lemmas  Sørensen, N. H., & Nimb, S. (2018). Word2Dict–Lemma Selection and Dictionary Editing Assisted by Word Embeddings. In The XVIII EURALEX International Congress (p. 146).  Trend analysis: Change in the meaning of existing words  Mitra, S., Mitra, R., Maity, S. K., Riedl, M., Biemann, C., Goyal, P., & Mukherjee, A. (2015). An automatic approach to identify word sense changes in text media across timescales. Natural Language Engineering, 21(5), 773-798.  Extraction of collocations/multiword expressions  Sanni Nimb, Henrik Lorentzen & Nicolai Hartvig Sørensen (2019): “Updating the dictionary: semantic change identification based on change in bigrams over time”. Presented in Workshop on collocations. https://elex.link/elex2019/programme/workshop-on-collocations/ Suzan Verberne 2019
  • 7. Suzan Verberne 2019 DISCOVERY AND SELECTION OF NEW LEMMAS
  • 8. TASK AND RESEARCH QUESTIONS  Example: Den Danske Ordbog (DDO)  Task: “augmenting the lemmata of an existing dictionary by adding either completely new or formerly neglected lemmas”  “How do you in a fast and consistent way compare new lemma candidates to already described lemmas within the same semantic field in order to ensure the consistency of the definitions?” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 9. TOOL FOR LEMMA SELECTION Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 10. WORD2DICT  A lexicographic tool based on a word embedding model  Goal: “to present a number of words that are most semantically related to the lemma that the lexicographer is describing” Suzan Verberne 2019 Sørensen, N. H., & Nimb, S. (2018)
  • 12. WHERE TO START  Linguistics: Distributional hypothesis  Data science: Vector Space Model (VSM) Suzan Verberne 2019
  • 13. DISTRIBUTIONAL HYPOTHESIS  Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162  The context of a word defines its meaning  Words that occur in similar contexts tend to be similar Suzan Verberne 2019
  • 14. VECTOR SPACE MODEL  Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.  Documents and queries represented in a vector space  Where the dimensions are the words Suzan Verberne 2019
  • 15. VECTOR SPACE MODEL  In the vector space model, we can model similarity as closeness  The closer two documents are in the space, the more similar they are Suzan Verberne 2019  We can compute the similarity between two points/vectors using a metric for distance or angle  Most used metric: cosine similarity: the cosine of the angle 𝝷 between two vectors
  • 16. VECTOR SPACE MODEL Linguistic issues with the vector space model:  synonymy: multiple ways to refer to the same concept, e.g. bicycle and bike  polysemy/homonyms: most words have more than one distinct meaning, e.g. bank, bass, chips Suzan Verberne 2019
  • 17. VECTOR SPACE MODEL  Computational issues with the vector space model:  The vector representations are high-dimensional (easily 10,000 dimensions – one for each term in the collection)  The vector representations are sparse (a given document only contains a fraction of those 10,000 terms – the other dimensions have a 0 value) Suzan Verberne 2019
  • 18. VECTOR SPACE MODEL Suzan Verberne 2019
  • 19. VECTOR SPACE MODEL Suzan Verberne 2019
  • 20. WORD EMBEDDINGS  Word embeddings are dense representations of words Suzan Verberne 2019
  • 21. WORD EMBEDDINGS  Word embeddings models represent (embed) words in a continuous vector space  The vector space is relatively low-dimensional (100 – 400 dimensions instead of 10,000s)  Semantically and syntactically similar words are mapped to nearby points because the representations are learned from word occurrences in context (Distributional Hypothesis) Suzan Verberne 2019
  • 22. Suzan Verberne 2019 PCA projection of 320- dimensional vector space
  • 25. WHAT IS WORD2VEC?  Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text  Intuition:  Train a classifier on a binary prediction task (on a text without labels!): “Is word w likely to show up near the word bicycle?”  We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings Suzan Verberne 2019
  • 26. WHERE DOES IT COME FROM  Neural network language model (NNLM) (Bengio et al., 2003)  Mikolov proposed to learn word vectors using a neural network with a single hidden layer (Mikolov et al 2013)  word2vec  Many neural architectures and models have been proposed for computing word vectors  GloVe (2014) - Global Vectors for Word Representation  FastText (2017) - Enriching Word Vectors with Subword Information  ELMo (2018) - Deep contextualized word representations  BERT (2019) - Bidirectional Encoder Representations from Transformers Suzan Verberne 2019
  • 27. WORD2VEC  Starting point: large collection (e.g. 10 Million words)  First step: extract the vocabulary (e.g. 10,000 terms)  Goal: to represent each of these 10,000 terms as a dense, lower- dimensional vector (typically 100-400 dimensions)  Idea: to use the contexts of words to learn their meaning Suzan Verberne 2019
  • 28. TRAINING WORD2VEC  Training task: binary classification of words in the text 1. Treat the target word and a neighboring context word as positive examples 2. Randomly sample other words in the lexicon to get negative samples 3. Train a classifier to distinguish those two cases Suzan Verberne 2019
  • 29. TRAINING WORD2VEC  This example has a target word t (apricot), and 4 context words in the L = ±2 window, resulting in 4 positive training instances  Negative examples are artificially generated: Jurafsky and Martin. Speech and Language Processing (3rd edition, 2019)
  • 30. TRAINING WORD2VEC  The classifier is a neural network with one hidden layer  Logistic functions are used as activation functions in the hidden layer  The regression weights are the embeddings Suzan Verberne 2019 sparse vector dense vector (embeddings)
  • 31. TRAINING WORD2VEC  The weights on the nodes in the hidden layer get random initializations and get updated while the model processes the collection  The outcome of the classification determines whether we adjust the current word vector  Gradually, the vectors converge to sensible descriptors (embeddings) for words Suzan Verberne 2019
  • 32. LANGUAGE MODELLING  The word prediction task is called language modelling  Traditional n-gram model: given the previous n words, predict the next word  Neural language models can handle much longer histories, and they can generalize over contexts of similar words  The resulting embeddings are referred to as language model  It is important that the context classification here is not an aim in itself: it is just an auxiliary task to learn vector representations good for other tasks Suzan Verberne 2019
  • 33. ADVANTAGES OF WORD2VEC  It scales  Train on billion word corpora  In limited time  Possibility of parallel training  Pre-trained word embeddings trained by one can be used by others  For entirely different tasks  Incremental training  Train on one piece of data, save results, continue training later on  There is a Python module for it:  Gensim word2vec Suzan Verberne 2019
  • 34. Suzan Verberne 2019 WHAT CAN YOU DO WITH IT?
  • 35. GENSIM WORD2VEC  Implementation in Python package gensim import gensim model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) size: the dimensionality of the feature vectors (common: 100, 200 or 320) window: the maximum distance between the current and predicted word within a sentence min_count: minimum number of occurrences of a word in the corpus to be included in the model workers: for parallellization with multicore machines Suzan Verberne 2019
  • 36. GENSIM WORD2VEC Sørensen, N. H., & Nimb, S. (2018):  We used the version of the word2vec algorithm implemented in the Gensim Python package  to train a model based on the Danish corpus used by the lexicographers of DDO  The corpus included at the time of the training roughly 920 million running words, mainly newswire, but also, material from magazines, transcripts from the Danish Parliament, and some fiction, among others, spanning the years 1982 to 2017  We trained the model with 500 features, a window size of five, a minimum occurrence of five for all types  The corpus included 6.3 million types, five million of which occurred less than five times  The training took roughly 18 hours on a 2017 MacBook Pro Suzan Verberne 2019
  • 37. DO IT YOURSELF model.most_similar(‘apple’)  [(’banana’, 0.8571481704711914), ...] model.doesnt_match("breakfast cereal dinner lunch".split())  ‘cereal’ model.similarity(‘woman’, ‘man’)  0.73723527 Suzan Verberne 2019 Cosine similarity
  • 38. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Improve NLP applications Suzan Verberne 2019
  • 39. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations Suzan Verberne 2019
  • 40. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  This is the famous example: vector(king) – vector(man) + vector(woman) = vector(queen)  Actually, what the original paper says is: if you substract the vector for ‘man’ from the one for ‘king’ and add the vector for ‘woman’, the vector closest to the one you end up with turns out to be the one for ‘queen’  More interesting: France is to Paris as Germany is to … Suzan Verberne 2019
  • 41. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  • 42. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  It also works for syntactic relations:  vector(biggest) - vector(big) + vector(small) = Suzan Verberne 2019 vector(smallest)
  • 43. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Learning semantic and semantic relations  Selecting out-of-the-list words  Example: which word does not belong in [monkey, lion, dog, truck]  Selectional preferences  Example: predict typical verb-noun pairs: people as subject of eating is more likely than people as object of eating  Discover new words Suzan Verberne 2019
  • 44. Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., ... & Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95. https://github.com/materialsintelligence/mat2vec
  • 45. WHAT CAN YOU DO WITH IT?  Improve NLP applications:  Sentence completion/text prediction/reply suggestion  Bilingual Word Embeddings for Machine Translation with LSTMs  (Near-)Synonym detection ( query expansion)  Concept representation of texts  Example: Twitter sentiment classification  Document similarity  Example: cluster news articles per news event Suzan Verberne 2019
  • 46. WORD EMBEDDINGS AS FEATURES  NLP models take word embeddings as low-level representation of words  Word embeddings as input for convolutional neural networks in text categorization  Word embeddings as input for recurrent neural networks in sequence labelling  Since 2018: word embeddings are used as language models that can be fine-tuned towards any natural language processing task Suzan Verberne 2019
  • 48. SUMMARY Text Mining for Lexicography  Discovery and selection of new lemmas  Word2Dict: tool for lemma selection (Sørensen & Nimb 2018)  Word embeddings  Distributional hypothesis  Vector space model  From sparse to dense representations  Neural language modelling  Practical use in the gensim package Suzan Verberne 2019
  • 49. FURTHER READING  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.  https://rare-technologies.com/word2vec-tutorial/  http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram- model/  http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/  Visualisation of embeddings models: https://projector.tensorflow.org/ Suzan Verberne 2019
  • 50. Suzan Verberne 2019  http://tmr.liacs.nl

Hinweis der Redaktion

  1. Because the data we are interested in is text data and we want to mine knowledge from those text data
  2. Figure 1: A search for ananasjuice (‘pineapple juice’). To the left (the top-half of the interface) we see the most similar words according to the context in which they appear in a corpus. Frequency counts for each word and whether or not the word is included in DDO is also displayed. The frequency counts are color- coded for quicker visual decoding: the darker the color, the higher the frequency. To the right (the bottom- half of the interface), definitions of the words already in the dictionary are shown, as well as their editorial status (e.g. “publiceret” (‘published’)) and the similarity score from the model (e.g. “0.75”, “0.71” – 1.0 equals identical).
  3. One-hot encoding