Non-parametric Subject Prediction

Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente
Shenghui Wang, Rob Koopman, Gwenn Englebienne ( OCLC Research, University of Twente )Non-parametric Subject Prediction 1 / 24

What is subject indexing?
Subject indexing is the act of describing or classifying a document by
index terms or other symbols in order to indicate what the document
is about, to summarize its content or to increase its ﬁndability.
Index terms or classes are normally from established knowledge
organization systems and classiﬁcation systems that easily contain
tens or hundreds of thousands of terms or codes
Manual subject indexing is time consuming and not scalable anymore

Automatic subject indexing
The goal of automatic subject indexing is to assign a document a
small subset of relevant subject labels (subjects), taken from tens or
hundreds of thousands target labels.
Existing approaches categorised at ISKO Encyclopedia of Knowledge
Organization:
Text categorisation
document clustering
document classiﬁcation

From a machine learning perspective
Different from binary or multi-class classification problems, it is a
multi-label classification problem where the target labels are neither
independent nor mutually exclusive

It is a form of eXtreme Multi-label Text Classiﬁcation (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.

It is a form of eXtreme Multi-label Text Classiﬁcation (XMTC)
problem, where the prediction space normally consists of hundreds of
thousands to millions of labels.
Challenges: data sparsity and scalability

Existing XMTC approaches
1-vs-All classiﬁers (e.g. Slice, Parabel, PD-Sparse)
Tree-based ensemble methods(e.g. SwiftXML, FastXML)
Target-embedding methods(e.g. DEFRAG, SLEEC)
Deep learning (e.g. fastText Learn Tree, XML-CNN)
http://manikvarma.org/downloads/XC/XMLRepository.html

Our two-step approach: ﬁrst embed, then predict
Step 1 Embedding subjects and documents in the same semantic
space
Step 2 Leveraging similarities for subject prediction
Naive similarity based method
Nonparametric (neighbourhood based) method

Word and document embedding
Words and documents are represented as vectors of numbers that
represent their meaning.
Semantically similar words and documents are mapped to nearby
points
A desirable property: computable similarity

Embedding approaches
Global co-occurrence count-based methods (e.g. LSA)
Based on statistics of how often some word co-occurs with its
neighbour words in a large text corpus
Dimensionality reduction methods
Local context predictive methods (e.g. word2vec)
Learning to predict a word from its neighbours or vice versa
More complex and powerful deep learning models for embedding words,
sentences and documents

Weakness of deep learning
Pre-trained word embeddings may not capture domain-specific
semantics
Medical information retrieval, special collection exploration, etc.
Computationally expensive to train from scratch
Often requiring GPUs
Complex optimisation, difficult to find the optimal hyperparameter
settings
Standard benchmarks and evaluation methods often do not answer
practical needs
So what is a good cost function?

Ariadne semantic embedding
Something fast and discrimitive?
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)

Ariadne semantic embedding
Fast random projection to compute term embeddings
Orthogonal projection to increase discrimination
Term weight assignment for better document embeddings
1
1
Koopman, R., Wang, S., Englebienne, G.: Fast and Discriminative Semantic
Embedding. In Proceedings of the 13th International Conference on Computational
Semantics - Long Papers. pp. 235-246. ACL, Gothenburg, Sweden (May 2019)

Fast Random Projection

Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words

Discriminative
We do not use heuristics to filter out stop words
Task-specific, dataset-specific, . . .
Most terms end up being similar to average language vector va
va corresponds closely to stop words
Our solution:
Cancel out the average language vector from the embeddings
(Orthogonal Projection)
Use the original similarity to va inform the importance of terms (Term
weight assignment)

Orthogonal Projection
Vectors are projected on the orthogonal hyperplane to an average
language vector va: v∗
t = vt − (vt · va)va
Removing the stop-wordiness improves the discriminating power
Projected vectors are normalized

Document embedding: term weights
The more similar to the average vector, the less weight it gets
wt = 1 − cos(vt, va)
No need to remove stop words ﬁrst
Crucial to get distinctive document embeddings (weighted average)

Automatic Subject prediction
Naive method based on subject-document similarity
A document is likely to be indexed with subjects that are most related
to it (i.e. with highest cosine similarities)
Non-Parametric (neighbourhood based) method
A document is likely to be indexed similarly to the documents that are
similar to it

Dataset
Metadata of one million Medline articles with abstract
The training set contains 147,837 unique MESH subjects
Each article is indexed with, on average, 16 MeSH subjects
222,135 subjects are used to index less than 10 article
95,218 subjects are used to index only one single article
7 subjects are each used to index more than 100K articles
Humans indexes nearly 58% of the whole corpus

Evaluation
We evaluated our methods with fastText
The same server with 2 Intel Xeon Silver 4109T 8-core processors and
384GB memory
The minimum document frequency is 10
The dimensionality of embeddings is 256
10,000 articles for testing
Measure individual precision/recall of top N predictions then take the
average2
P@n = 1
n l∈rn(ˆy) yl
R@n = 1
y 0 l∈rn(ˆy) yl
2
More metrics can be found in the paper

Results
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average P@n
fastText
Ariadne
Ariadne + NPSP
0 20 40 60 80 100
Top n
0.0
0.2
0.4
0.6
0.8
1.0
Average R@n
fastText
Ariadne
Ariadne + NPSP

A closer look

Compare Ariadne, Ariadne+NPSP with fastText
Actual MeSH subjects
(alphabetical order)
dt Ariadne dt Ariadne + NPSP dt fastText dt
Acrylic Resins 920 Lenses, Intraocular 434 Humans 579975 Humans 579975
Aged 118655 Lens Implantation,
Intraocular
317 Male 336647 Female 328885
Aged, 80 and over 40642 Phacoemulsification 225 Female 328885 Middle Aged 168714
Capsulorhexis 24 Visual Acuity 2026 Aged 118655 Male 336647
Female 328885 Lens Capsule, Crystalline/
Surgery
79 Lenses, Intraocular 434 Risk Factors 34538
Humans 579975 Lens Capsule, Crystalline/
Pathology
65 Lens Implantation,
intraocular
317 Adult 194200
Laser therapy* 847 Visual Acuity/Physiology 949 Visual Acuity 2026 Aged 118655
Lens Capsule, Crystalline/
Pathology
65 Lens Implantation,
Intraocular/Methods
90 Phacoemulsification 225 Retrospective Studies 32642
Lens Capsule, Crystalline/
Surgery*
79 Cataract Extraction/Methods 136 Middle Aged 168714 Follow Up Studies 27911
Lens Implantation, Intraocular 317 Phacoemulsification/Methods 106 Prospective Studies 25714 Aged 80 and over 40642
Male 336647 Retrospective Studies 32642 Lens Capsule, Crystalline/
Surgery
79 Incidence 11468
Middle Aged 168714 Aged 118655 Follow Up Studies 27911 Adolescent 75361
Phacoemulsification* 225 Prospective Studies 25714 Acrylic Resins 920 Young Adult 27991
Polymethyl Methacrylate 392 Aged 80 and Over 40642 Lens Capsule, Crystalline/
Pathology
65 Prospective Studies 25714
Postoperative Complications/
Pathology
351 Middle Aged 168714 Postoperative Complications 3961 Time Factors 50339
Postoperative Complications/
Surgery*
823 Male 336647 Aged 80 and Over 40642 Logistic Models 7258
Probability 2914 Cataract Extraction 350 Cataract Etiology 218 Case Control Studies 13306
Retrospective Studies 32642 Lenses, Intraocular/ Adverse
Effects
62 Adult 194200 Cohort Studies 12275
Risk Factors 34538 Adolescent 75361 Retrospective Studies 32642 Treatment Outcome 40496
Sex Factors 10203 Female 328885 Visual Acuity/Physiology 949 Child 53738
Silicone Elastomers 294 Intraocular Pressure 712 Treatment Outcome 40496 Survival Rate 7975
Survival Analysis 7046 Capsulorhexis 24 Cataract Extraction 350 Prognosis 17160
Visual Acuity 2026 Pseudophakia Physiopathology 56 Lens Implantation,
Intraocular/ Adverse Effects
45 Risk Assessment 9271
Adult 194200 Silicone Elastomers 294 Reoperation 3355
Risk Factors 34538 Prosthesis Design 2096 Proportional Hazards Models 3403

Predictions vs subjects’ document frequencies
0 5 10 15 20
document frequency (log2)
10000
5000
0
5000
10000
15000
#predictions
8491464
2583
4155
6021
7528
8144
6742
44352224
1066
407
159
69
25
6 4 2
3
1
Top 10
Targets
fastText
Ariadne
Ariadne + NPSP

Summary and future work
Automatic subject prediction is a form of eXtreme Mult-label Text
Classiﬁcation problem
We proposed an embedding-based approach for subject prediction
We need more appropriate evaluation metrics
Domain experts need to get involved
Explore deep learning methods for XMTC

Thank you
Non-parametric Subject Prediction
Shenghui Wang 1 Rob Koopman 1 Gwenn Englebienne 2
1OCLC Research
2University of Twente

STS scores and train times of diﬀerent methods
Method Dev Test Dataset used Train time
Doc2Vec (PV-DBOW) 72.2 64.9 AP-NEWS Unknown
31.0 27.0 S.E. Wikipedia 23m
53.7 49.8 MEDLINE 2h3m
FastText 65.3 53.6 Wikipedia Unknown
52.8 41.1 MEDLINE 3h4m
Sent2Vec 78.7 75.5 Twitter Unknown
65.4 55.0 MEDLINE 3h10m
Our method 75.1 64.6 S.E. Wikipedia 17s
73.6 58.9 MEDLINE 43s

Non-parametric Subject Prediction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Non-parametric Subject Prediction

Ähnlich wie Non-parametric Subject Prediction (20)

Mehr von Shenghui Wang

Mehr von Shenghui Wang (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Non-parametric Subject Prediction