Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. This is an ``extreme multi-label classification'' problem, where the objective is to assign a small subset of the most relevant subjects from an extremely large label set. Data sparsity and model scalability are the major challenges we need to address to solve it automatically. In this paper, we describe an efficient and effective embedding method that embeds terms, subjects and documents into the same semantic space, where similarity can be computed easily. We then propose a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are not predicted by state-of-the-art classifiers.
Module for Grade 9 for Asynchronous/Distance learning
Non-parametric Subject Prediction
1. Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 1 / 24
2. What is subject indexing?
Subject indexing is the act of describing or classifying a document by
index terms or other symbols in order to indicate what the document
is about, to summarize its content or to increase its findability.
Index terms or classes are normally from established knowledge
organization systems and classification systems that easily contain
tens or hundreds of thousands of terms or codes
Manual subject indexing is time consuming and not scalable anymore
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 2 / 24
3. Automatic subject indexing
The goal of automatic subject indexing is to assign a document a
small subset of relevant subject labels (subjects), taken from tens or
hundreds of thousands target labels.
Existing approaches categorised at ISKO Encyclopedia of Knowledge
Organization:
Text categorisation
document clustering
document classification
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 3 / 24
4. From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
5. From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
It is a form of eXtreme Multi-label Text Classification (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
6. From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive
It is a form of eXtreme Multi-label Text Classification (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.
Challenges: data sparsity and scalability
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 4 / 24
8. Our two-step approach: first embed, then predict
Step 1 Embedding subjects and documents in the same semantic
space
Step 2 Leveraging similarities for subject prediction
Naive similarity based method
Nonparametric (neighbourhood based) method
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 6 / 24
9. Word and document embedding
Words and documents are represented as vectors of numbers that
represent their meaning.
Semantically similar words and documents are mapped to nearby
points
A desirable property: computable similarity
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 7 / 24
10. Embedding approaches
Global co-occurrence count-based methods (e.g. LSA)
Based on statistics of how often some word co-occurs with its
neighbour words in a large text corpus
Dimensionality reduction methods
Local context predictive methods (e.g. word2vec)
Learning to predict a word from its neighbours or vice versa
More complex and powerful deep learning models for embedding words,
sentences and documents
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 8 / 24
11. Weakness of deep learning
Pre-trained word embeddings may not capture domain-specific
semantics
Medical information retrieval, special collection exploration, etc.
Computationally expensive to train from scratch
Often requiring GPUs
Complex optimisation, difficult to find the optimal hyperparameter
settings
Standard benchmarks and evaluation methods often do not answer
practical needs
So what is a good cost function?
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 9 / 24
12. Ariadne semantic embedding
Something fast and discrimitive?
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
13. Ariadne semantic embedding
Fast random projection to compute term embeddings
Orthogonal projection to increase discrimination
Term weight assignment for better document embeddings
1
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 10 / 24
14. Fast Random Projection
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 11 / 24
15. Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
16. Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words
Our solution:
Cancel out the average language vector from the embeddings
(Orthogonal Projection)
Use the original similarity to va inform the importance of terms (Term
weight assignment)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 12 / 24
17. Orthogonal Projection
Vectors are projected on the orthogonal hyperplane to an average
language vector va: v∗
t = vt − (vt · va)va
Removing the stop-wordiness improves the discriminating power
Projected vectors are normalized
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 13 / 24
18. Document embedding: term weights
The more similar to the average vector, the less weight it gets
wt = 1 − cos(vt, va)
No need to remove stop words first
Crucial to get distinctive document embeddings (weighted average)
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 14 / 24
19. Automatic Subject prediction
Naive method based on subject-document similarity
A document is likely to be indexed with subjects that are most related
to it (i.e. with highest cosine similarities)
Non-Parametric (neighbourhood based) method
A document is likely to be indexed similarly to the documents that are
similar to it
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 15 / 24
20. Dataset
Metadata of one million Medline articles with abstract
The training set contains 147,837 unique MESH subjects
Each article is indexed with, on average, 16 MeSH subjects
222,135 subjects are used to index less than 10 article
95,218 subjects are used to index only one single article
7 subjects are each used to index more than 100K articles
Humans indexes nearly 58% of the whole corpus
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 16 / 24
21. Evaluation
We evaluated our methods with fastText
The same server with 2 Intel Xeon Silver 4109T 8-core processors and
384GB memory
The minimum document frequency is 10
The dimensionality of embeddings is 256
10,000 articles for testing
Measure individual precision/recall of top N predictions then take the
average2
P@n = 1
n l∈rn(ˆy) yl
R@n = 1
y 0 l∈rn(ˆy) yl
2
More metrics can be found in the paper
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 17 / 24
22. Results
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average P@n
fastText
Ariadne
Ariadne + NPSP
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average R@n
fastText
Ariadne
Ariadne + NPSP
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 18 / 24
23. A closer look
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 19 / 24
26. Summary and future work
Automatic subject prediction is a form of eXtreme Mult-label Text
Classification problem
We proposed an embedding-based approach for subject prediction
We need more appropriate evaluation metrics
Domain experts need to get involved
Explore deep learning methods for XMTC
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 22 / 24
27. Thank you
Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 23 / 24
28. STS scores and train times of different methods
Method Dev Test Dataset used Train time
Doc2Vec (PV-DBOW) 72.2 64.9 AP-NEWS Unknown
31.0 27.0 S.E. Wikipedia 23m
53.7 49.8 MEDLINE 2h3m
FastText 65.3 53.6 Wikipedia Unknown
48.6 38.9 S.E. Wikipedia 51m
52.8 41.1 MEDLINE 3h4m
Sent2Vec 78.7 75.5 Twitter Unknown
68.8 62.2 S.E. Wikipedia 18m
65.4 55.0 MEDLINE 3h10m
Our method 75.1 64.6 S.E. Wikipedia 17s
73.6 58.9 MEDLINE 43s
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 24 / 24