2. 0. Paper
• GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
• Authors: Kexin Wang, Nandan Thakur, Nils Reimers, Iryna Gurevych
• Published: 2021.12 (Arxiv)
• https://arxiv.org/abs/2112.07577
2
3. 0. Preliminaries
• Information Retrieval
• Query와 관련이 있는 문서를 찾는 작업 (관련이 있는 = 대답할 수 있는)
• Open-domain QA: IR + MRC
• Method: 쿼리와 가장 높은 Score(Similarity) 를 갖는 문서 선택
• Sparse embedding vs Dense embedding
• Keyword/고유명사는 sparse, Synonym/Paraphrase는 dense
3
4. 0. Preliminaries
- 빠른 검색 (Maximum
Inner Project Search)
- 아쉬운 성능
- 좋은 성능
- 엄청 느림
Retriever -> Reranker -> Reader
4
6. 1. Introduction
• Recently, information retrieval methods based on dense vector spaces have become popular to
address the limitation of sparse vector.
• Dense retrieval methods require large amounts of training data to work well.
• Dense retrieval methods are extremely sensitive to domain shifts.
• Models trained on MS MARCO perform rather poorly for questions for COVID-19 scientific
literatures.
• Models did not learn how to represent this topic well in a vector space.
• We present Generative Pseudo Labeling (GPL), an unsupervised domain adaptation for dense
retrieval models.
6
7. 2. Method
• For a given target corpus, we generate for each passage three queries using T5-encoder-decoder
model.
• For each of the generated queries, we use an existing retrieval system to retrieve 50 negative
passages.
• For each (query, positive, negative) – tuple we compute the margin score using cross-encoder.
• Train the bi-encoder with margin score.
7
8. 2. Method
• Multiple Negative Ranking loss considers only the coarse relationship between queries and
passages., i.e. the matching passage is considered as relevant while all other passages are
considered irrelevant.
• However, the query generator might generate queries that are not answerable by the passage.
Further, other passages might actually be relevant as well for a given query.
• MarginMSE loss uses a powerful cross-encoder to soft-label (query, passage) pairs. It then teaches
the dense retriever to mimic the score margin between the positive and negative query-passage
pairs.
In GPL,
- Bad query -> low pos score -> distant
- False negative -> high neg score -> similar
MarginMSE Loss
8
9. 3. Experiments
• Query generator: docT5query
• Negative miner(Retriever): msmarco-distilbert-base-v3, msmarco-MiniLM-L-6-v3
• 50 negatives using each retriever and uniformly sample
• Cross encoder: msmarco-MiniLM-L-6-v2
• Student: MS MARCO DistilBERT + Mean pooling + Dot product
• 140k training steps, 32 batch size (No need of large batch size!)
Experimental Setup
9
10. 3. Experiments
• Six domain-specific text retrieval tasks from the BeIR benchmark
• Evaluation is done using nDCG@10
• 더 관련있는 문서를 더 높은 순위로 예측하자!
Evaluation
• Zero-Shot
• MS MARCO: distil-bert dense retrieval trained with MarginMSE
• BM25: lexical matching from Elastic search
• Pre-Training based Domain Adaptation
• SimCSE: encode same sent with different dropout masks + MNRL loss
• ICT: sample one sent from passage as the pseudo query
• TSDAE: denoising autoencoder
• Generation-based Domain Adaptation
• Qgen: generated query + Multiple Negative Ranking loss
Baselines
10
12. 5. Analysis
• GPL begins to be saturated after around 100K steps.
• With TSDAE pre-training, the performance can be improved consistently.
Influence of Training Steps
Influence of Corpus Size
• We find with more than 10K passages, GPL can already outperform the zero-shot baseline
12
13. 5. Analysis
• Generating 3 queries per passages appears to be optimal, generating more queries per passages
does not yield further improvements.
Robustness against Query Generation
Sensitivity to Starting Checkpoints
• We also evaluate to directly fine-tune a distilbert-model using QGen
13
14. 6. Conclusion
• In this work we propose GPL, a novel unsupervised domain adaptation method
for dense retrieval models.
• Pseudo-labeling overcomes two important shortcomings of previous methods.
• Not all generated queries are of high quality
• Training with mined hard negatives can be noised
• We observe GPL performs well on all the datasets and significantly outperforms
other approaches.
• As a limitation, GPL requires a relatively complex training setup and future work
can focus on simplify this training pipeline.
14
Editor's Notes
MS MARCO로 학습한 bi-encoder가 BM25 보다 성능이 안좋음
Cross encoder에서도 BM25 retriever가 MS MARCO retriever보다 좋음
Pretraining + domain adaptation에서는 TSDAE가 가장 좋음
그 외에서는 GPL이 제일 좋음
Distilbert에 TSDAE 학습 후 GPL 학습하면 더 좋음
Reranking 더 좋음