15. Opportunity for
subject-based access
⢠Studies underline end-users interest in
topical searches, but :
⢠inter-indexing inconsistency
⢠cost of manual indexing
⢠Possibilities and limits of using automated
methods to provide a subject-based access ?
16. Unsupervised machine learning
⢠Often used for exploratory data
analysis by clustering documents
in very large corpora with
unknown content
⢠âDistant readingâ techniques
within the Digital Humanities
⢠Two popular methods :
⢠Topic Modeling (TM)
⢠Word Embeddings (WE)
17. Case-study on non-supervised ML
⢠Combination of
⢠LDA
⢠Word2Vec
⢠To create automated links to
Eurovoc per document
18. Corpus
⢠24.787 pdf documents, representing 138,3 GB
⢠Period 1958 -1982, with documents in French,
Dutch, German, Italian, Danish, English and Greek
⢠Only descriptive metadata available for the fonds
creator
⢠Little value from a traditional archival perspective
but as an aggregate it offers the possibility to analyse
policy development through time
24. K-parameter
⢠Small number of topics results in too generic
categories, high number results in topics which
are not sufďŹciently representative for the corpus
⢠Depends on what you want :
⢠cover the entire corpus by making sure
every document is indexed
⢠or to discover speciďŹc semantics âŚ
25. Finding a balance
⢠Topic âeec regulation council commission
community decision european december
amended articleâ => 0.31336
⢠Topic âenergy nuclear coal projects gas oil
community power heat fuel â => 0.03307
29. Topic labeling
⢠Hulpus et al (2013) & Allahyaria and Kochuta
(2015) use the graph structure of DBPedia to
rank the different label candidates
⢠But - topics may contain different concepts and
the graph structure of DBPedia as a knowledge
structure is not terribly coherent âŚ
⢠Our approach : use pre-trained Word2Vec to
spot which terms form semantic clusters and
match those with Eurovoc
31. Topics as concepts
⢠Usage of W2V to help us detect different
concepts within one topic by making use of the
distance between terms
⢠For example : âlabour, farm, poultry, sheep, pig,
land, family, income, holding, purchasedâ
⢠Three concepts within one topic :
⢠labour, farm, poultry, sheep, pig, land
⢠family
⢠income, holding, purchased
32. Reconciliation
⢠In order to perform the matching with
Eurovoc, we are testing to
⢠Either focus on the most âcentroidâ term
from a concept and see how many match
⢠Use the structure of Eurovoc for decision
making (e.g. pick the term on the deepest
level or which has the most non-
descriptors attached to it)