This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.
2. 2/37
Information Retrieval Task
Definition
Information retrieval (IR) is finding material (usually doc-
uments) of an unstructured nature (usually text) that satis-
fies an information need from within large collections (usually
stored on computers).
source: trec.nist.gov
7. 6/37
IR evaluation
IR system returns ranked list of documents (scored by degree
of relevance)
Users are interested in the top k documents
Development:
Set of documents
Set of training/test queries
Metric: P@10, Percent of relevant documents among the
highest 10 retrieved ones
How to judge relevant/irrelevant documents? Assessment
process
8. 7/37
Data & tools
CLEF eHealth 2015 IR task document collection (corpus)
For searching, queries from CLEF eHealth IR tasks
2013–2015, 166 queries in total
Queries were created in 2013 and 2014 by medical experts
In 2015, queries were created to simulate the way laypeople
write queries
Randomly split into 100 queries for training, 66 for test
Relevance assessment is done by medical experts
9. 8/37
Sample query: CLEF 2013
<t o p i c>
<id>qtest4</ id>
< t i t l e>nausea and vomiting and
hematemesis</ t i t l e>
<desc>What are nausea , vomiting and
hematemesis</ desc>
<narr>What i s the connection with nausea ,
vomiting and hematemesis</ narr>
<p r o f i l e>A 64−year old emigrant who i s not
sure what nausea , vomiting and hematemesis
mean in h i s d i s c h a r g e summary</ p r o f i l e>
</ t o p i c>
10. 9/37
Sample queries: CLEF 2015
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .9</ id>
< t i t l e>red i t c h y eyes</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .16</ id>
< t i t l e>red patchy b r u i s i n g over l e g s</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .44</ id>
< t i t l e>n a i l g e t t i n g dark</ t i t l e>
</ t o p i c>
14. 11/37
Monolingual experiment
Indexing and searching is done using Terrier (an open source
IR system) 1
Set of tuning experiments
P@10: 47.10 (training set) and 50.30 (test set)
1
http://terrier.org
15. 12/37
Cross-lingual IR problem
Definition
Cross Lingual Information Retrieval provides allows a user to
ask a query in native language and then to get the document
in different language.
Czech query
Query: nevolnost a zvracen´ı a hematemeze?
16. 13/37
Cross-lingual IR approaches: query translation
Index
Documents (EN) User poses a query (CS)
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
EN query
Reducing CLIR task into monolingual task
17. 14/37
Cross-lingual data
166 queries in English were translated by native medical
experts into (Czech, French, German, Hungarian, Polish,
Spanish, Swedish)
Task is reduced into Monolingual IR: Same relevance data
18. 15/37
Query translation experiment
Translate queries in all languages into collection language
using online public MT systems:
Google Translate
Bing translator
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
19. 16/37
Baseline CLIR system
Translate queries into English using SMT systems, developed
by colleagues at UFAL
Trained to translate search queries (medical domain)
Returns list of alternative translation (N-best-list)
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Baseline 45.76 47.88 42.58 40.76 36.82 44.09 36.67
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
20. 17/37
Reranking approach
Motivation
The single best translation that is returned by SMT system
is not selected w.r.t CLIR performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Czech French German
01020304050
Histograms of ranks of translation hypotheses with the highest P@10 for
each training query
21. 18/37
Reranking approach
Trained to select the best translation for CLIR performance
P@10 as an objective function (predict the translation that
gives the highest P@10)
Index
Documents (EN)
nevolnost a zvracení a hematemeze
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
N-best-list (EN)
Reranker
EN query
22. 19/37
Feature set
SMT scores: Translation model, language model and
reordering models
Rank features: SMT rank and a Boolean feature (1 for best
rank, 0 otherwise)
Features based on Blind relevance feedback
IDF from the collection (inverse document frequency)
Translation pool
Retrieval statue value
Features that are based on external resources (UMLS1,
Wikipedia)
1
The Unified Medical Language System: large, multi-purpose, and
multi-lingual thesaurus that contains millions of biomedical and health related
concepts
23. 20/37
Training
100 queries for training, 15-best-list hypotheses for each query.
Two approaches for training:
Language-Specific: Model for each language
Language-Independent: One model for all languages
Leave-One-Out cross validation
24. 21/37
Reranker testing
Generate vectors of feature values for each query
The trained regression model predicts the hypothesis that
gives that highest P@10
Run retrieval for that hypothesis query string
25. 22/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
Reranker 50.15 51.06 45.30
Google 50.91 49.70 49.39
Bing 47.88 48.64 46.52
Improvements: 9 queries in Czech, 15 queries in German, and
8 queries in French
Degradations: 2 cases for Czech, 4 cases for German, and 3
cases for French
26. 23/37
System comparisons
Examples of translations of training queries including reference (ref ), oracle
(ora), baseline (base) and best (best) translations (system Reranker). The
scores in parentheses refer to query P@10 scores.
28. 25/37
Queries in new languages
New SMT systems (Spanish, Hungarian, Polish and Swedish)
developed recently also within Khresmoi.
Human experts translated original English queries into these
languages, ”under KConnect project”.
We want to develop CLIR system for these languages.
29. 26/37
Adapting reranker
To adapt the reranker, two sources of data used to create training
set:
Merged data from existing languages (Czech, French and
German)
Data from each new language (Spanish, Hungarian, Polish
and Swedish)
The data is used to create language-independent models
30. 27/37
Language-independent model performance
Final evaluation results of language-independent models on the test set
Spanish Hungarian Polish Swedish
system P@10 P@10 P@10 P@10
Mono 50.30 50.30 50.30 47.10
Baseline 44.09 40.76 36.82 36.67
Reranker 46.36 43.18 36.67 38.79
31. 28/37
Document translation
Last years SMT systems improved significantly
All researches regarding DT are quite old!
Reinvestigate the research question if QT is really better than
DT
32. 29/37
Document translation
Queries are posed by users in their language
Translate the English collection into: Czech, French and
German
Create separate index for each language
Perform the retrieval using original query and the relevant
index
Index (CS)
Documents (EN)
User poses a query (CS) Ranked list of documents
MT system
Indexer
Top-K Retrieval system
Documents (CS)
33. 30/37
Morphological processing
Both queries and documents are processed as follows:
Translate into Czech, French and German
Stemming using Snowball stemmer 1
Lemmatizing using Tree Tagger for French and German2 and
MorphoDiTa for Czech3
1
http://snowball.tartarus.org/
2
http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger
3
http://ufal.mff.cuni.cz/morphodita
34. 31/37
Results - Document Translation
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
DT 37.42 41.67 36.21
DT Stem 41.67 42.73 36.67
DT Lem 39.39 41.06 33.18
35. 32/37
Query expansion
Users fail sometimes to create query that represents their
information need
Query expansion is the process of adding terms to their query
(also called query reformulation)
Our approach is based on machine learning model
36. 33/37
Query expansion
Algorithm
Get 20-best-list translations for each query
Create a translation pool as bag-of-words from these
translations
Use best translation as an original query
Model can predict a term which will give the highest P@10
when it is added to the original query
Features: IDF, TF (pool), similarity between term and query
(word-embeddings)
Expand the query with one term from the translation pool
Run the retrieval using our baseline setting using the
expanded queries.
Translation pool was limited for some queries, expand it pool from
Wikipedia articles
37. 34/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
QE 42.12 46.21 37.88
38. 35/37
Query expansion (QE) improved in average 10 queries over the
baseline system, only 60% coverage, wait to complete the
assessment.
39. 35/37
Query expansion examples
Mono: white patchiness in mouth P@10: 10.00
Base: white coating mouth, P@10: 10.00
Expanded: white coating mouth oral cavity P@10: 70.00
Mono: SOB P@10: 50.00
Base: dyspnoea P@10: 60.00
Expanded: dyspnoea rash breathing dyspnea P@10: 70.00
40. 36/37
Conclusion and future work
Monolingual IR system evaluation and assessment
Cross-lingual IR approaches:
Query translation
Document translation and morphological analysis
Query expansion based on translation pool and Wikipedia
Reranking model to predict, for each query, which translation
hypothesis gives the highest P@10
Contribution to the CLIR community by releasing dataset with
high coverage (doc/query pair)
41. 37/37
Our publications
Shadi Saleh and Pavel Pecina. CUNI at the ShARe/CLEF eHealth Evaluation
Lab 2014. In Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, Sheffield, UK,2014
Shadi Saleh, Feraena Bibyna, Pavel Pecina: CUNI at the CLEF eHealth 2015
Task 2. In: Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, CEUR-WS, Toulouse,France, 2015
Shadi Saleh and Pavel Pecina. Adapting SMT Query Translation Reranker to
New Languages in Cross-Lingual Information Retrieval. In Medical Information
Retrieval (MedIR) Workshop, Association for Computational Linguistics, Pisa,
Italy, 2016
Shadi Saleh and Pavel Pecina. Reranking hypotheses of machine-translated
queries for cross-lingual information retrieval. In Experimental IR Meets
Multilinguality, Multimodality, and Interaction 7th International Conference of
the CLEF Association, CLEF 2016, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Dublin, Ireland, 2017