SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
1/37
Cross-lingual information retrieval
Shadi Saleh
Institute of Formal and Applied Linguistics
Charles University
saleh@ufal.mff.cuni.cz
27 Nov. 2017
2/37
Information Retrieval Task
Definition
Information retrieval (IR) is finding material (usually doc-
uments) of an unstructured nature (usually text) that satis-
fies an information need from within large collections (usually
stored on computers).
source: trec.nist.gov
3/37
Information Retrieval Task
3/37
Information Retrieval Task
4/37
Information Retrieval Task
Heat-map test (golden triangle) is done by Enquiro, Eyetools, and
Didit with search engine users.
5/37
Monolingual IR system structure
6/37
IR evaluation
IR system returns ranked list of documents (scored by degree
of relevance)
Users are interested in the top k documents
Development:
Set of documents
Set of training/test queries
Metric: P@10, Percent of relevant documents among the
highest 10 retrieved ones
How to judge relevant/irrelevant documents? Assessment
process
7/37
Data & tools
CLEF eHealth 2015 IR task document collection (corpus)
For searching, queries from CLEF eHealth IR tasks
2013–2015, 166 queries in total
Queries were created in 2013 and 2014 by medical experts
In 2015, queries were created to simulate the way laypeople
write queries
Randomly split into 100 queries for training, 66 for test
Relevance assessment is done by medical experts
8/37
Sample query: CLEF 2013
<t o p i c>
<id>qtest4</ id>
< t i t l e>nausea and vomiting and
hematemesis</ t i t l e>
<desc>What are nausea , vomiting and
hematemesis</ desc>
<narr>What i s the connection with nausea ,
vomiting and hematemesis</ narr>
<p r o f i l e>A 64−year old emigrant who i s not
sure what nausea , vomiting and hematemesis
mean in h i s d i s c h a r g e summary</ p r o f i l e>
</ t o p i c>
9/37
Sample queries: CLEF 2015
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .9</ id>
< t i t l e>red i t c h y eyes</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .16</ id>
< t i t l e>red patchy b r u i s i n g over l e g s</ t i t l e>
</ t o p i c>
<t o p i c>
<id>c l e f 2 0 1 5 . t e s t .44</ id>
< t i t l e>n a i l g e t t i n g dark</ t i t l e>
</ t o p i c>
10/37
Assessment process
10/37
Assessment process
10/37
Assessment process
11/37
Monolingual experiment
Indexing and searching is done using Terrier (an open source
IR system) 1
Set of tuning experiments
P@10: 47.10 (training set) and 50.30 (test set)
1
http://terrier.org
12/37
Cross-lingual IR problem
Definition
Cross Lingual Information Retrieval provides allows a user to
ask a query in native language and then to get the document
in different language.
Czech query
Query: nevolnost a zvracen´ı a hematemeze?
13/37
Cross-lingual IR approaches: query translation
Index
Documents (EN) User poses a query (CS)
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
EN query
Reducing CLIR task into monolingual task
14/37
Cross-lingual data
166 queries in English were translated by native medical
experts into (Czech, French, German, Hungarian, Polish,
Spanish, Swedish)
Task is reduced into Monolingual IR: Same relevance data
15/37
Query translation experiment
Translate queries in all languages into collection language
using online public MT systems:
Google Translate
Bing translator
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
16/37
Baseline CLIR system
Translate queries into English using SMT systems, developed
by colleagues at UFAL
Trained to translate search queries (medical domain)
Returns list of alternative translation (N-best-list)
Sys Czech French German Hungarian Polish Spanish Swedish
Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30
Baseline 45.76 47.88 42.58 40.76 36.82 44.09 36.67
Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48
Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
17/37
Reranking approach
Motivation
The single best translation that is returned by SMT system
is not selected w.r.t CLIR performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Czech French German
01020304050
Histograms of ranks of translation hypotheses with the highest P@10 for
each training query
18/37
Reranking approach
Trained to select the best translation for CLIR performance
P@10 as an objective function (predict the translation that
gives the highest P@10)
Index
Documents (EN)
nevolnost a zvracení a hematemeze
Indexer
Top-K Retrieval system
Ranked list of documents
MT system
N-best-list (EN)
Reranker
EN query
19/37
Feature set
SMT scores: Translation model, language model and
reordering models
Rank features: SMT rank and a Boolean feature (1 for best
rank, 0 otherwise)
Features based on Blind relevance feedback
IDF from the collection (inverse document frequency)
Translation pool
Retrieval statue value
Features that are based on external resources (UMLS1,
Wikipedia)
1
The Unified Medical Language System: large, multi-purpose, and
multi-lingual thesaurus that contains millions of biomedical and health related
concepts
20/37
Training
100 queries for training, 15-best-list hypotheses for each query.
Two approaches for training:
Language-Specific: Model for each language
Language-Independent: One model for all languages
Leave-One-Out cross validation
21/37
Reranker testing
Generate vectors of feature values for each query
The trained regression model predicts the hypothesis that
gives that highest P@10
Run retrieval for that hypothesis query string
22/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
Reranker 50.15 51.06 45.30
Google 50.91 49.70 49.39
Bing 47.88 48.64 46.52
Improvements: 9 queries in Czech, 15 queries in German, and
8 queries in French
Degradations: 2 cases for Czech, 4 cases for German, and 3
cases for French
23/37
System comparisons
Examples of translations of training queries including reference (ref ), oracle
(ora), baseline (base) and best (best) translations (system Reranker). The
scores in parentheses refer to query P@10 scores.
24/37
Adapting reranker to new languages
25/37
Queries in new languages
New SMT systems (Spanish, Hungarian, Polish and Swedish)
developed recently also within Khresmoi.
Human experts translated original English queries into these
languages, ”under KConnect project”.
We want to develop CLIR system for these languages.
26/37
Adapting reranker
To adapt the reranker, two sources of data used to create training
set:
Merged data from existing languages (Czech, French and
German)
Data from each new language (Spanish, Hungarian, Polish
and Swedish)
The data is used to create language-independent models
27/37
Language-independent model performance
Final evaluation results of language-independent models on the test set
Spanish Hungarian Polish Swedish
system P@10 P@10 P@10 P@10
Mono 50.30 50.30 50.30 47.10
Baseline 44.09 40.76 36.82 36.67
Reranker 46.36 43.18 36.67 38.79
28/37
Document translation
Last years SMT systems improved significantly
All researches regarding DT are quite old!
Reinvestigate the research question if QT is really better than
DT
29/37
Document translation
Queries are posed by users in their language
Translate the English collection into: Czech, French and
German
Create separate index for each language
Perform the retrieval using original query and the relevant
index
Index (CS)
Documents (EN)
User poses a query (CS) Ranked list of documents
MT system
Indexer
Top-K Retrieval system
Documents (CS)
30/37
Morphological processing
Both queries and documents are processed as follows:
Translate into Czech, French and German
Stemming using Snowball stemmer 1
Lemmatizing using Tree Tagger for French and German2 and
MorphoDiTa for Czech3
1
http://snowball.tartarus.org/
2
http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger
3
http://ufal.mff.cuni.cz/morphodita
31/37
Results - Document Translation
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
DT 37.42 41.67 36.21
DT Stem 41.67 42.73 36.67
DT Lem 39.39 41.06 33.18
32/37
Query expansion
Users fail sometimes to create query that represents their
information need
Query expansion is the process of adding terms to their query
(also called query reformulation)
Our approach is based on machine learning model
33/37
Query expansion
Algorithm
Get 20-best-list translations for each query
Create a translation pool as bag-of-words from these
translations
Use best translation as an original query
Model can predict a term which will give the highest P@10
when it is added to the original query
Features: IDF, TF (pool), similarity between term and query
(word-embeddings)
Expand the query with one term from the translation pool
Run the retrieval using our baseline setting using the
expanded queries.
Translation pool was limited for some queries, expand it pool from
Wikipedia articles
34/37
Results - test set
Results of the final evaluation on the test set queries
Czech French German
system P@10 P@10 P@10
Mono 50.30 50.30 50.30
Baseline 45.61 47.73 42.42
QE 42.12 46.21 37.88
35/37
Query expansion (QE) improved in average 10 queries over the
baseline system, only 60% coverage, wait to complete the
assessment.
35/37
Query expansion examples
Mono: white patchiness in mouth P@10: 10.00
Base: white coating mouth, P@10: 10.00
Expanded: white coating mouth oral cavity P@10: 70.00
Mono: SOB P@10: 50.00
Base: dyspnoea P@10: 60.00
Expanded: dyspnoea rash breathing dyspnea P@10: 70.00
36/37
Conclusion and future work
Monolingual IR system evaluation and assessment
Cross-lingual IR approaches:
Query translation
Document translation and morphological analysis
Query expansion based on translation pool and Wikipedia
Reranking model to predict, for each query, which translation
hypothesis gives the highest P@10
Contribution to the CLIR community by releasing dataset with
high coverage (doc/query pair)
37/37
Our publications
Shadi Saleh and Pavel Pecina. CUNI at the ShARe/CLEF eHealth Evaluation
Lab 2014. In Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, Sheffield, UK,2014
Shadi Saleh, Feraena Bibyna, Pavel Pecina: CUNI at the CLEF eHealth 2015
Task 2. In: Working Notes of CLEF 2015 - Conference and Labs of the
Evaluation forum, CEUR-WS, Toulouse,France, 2015
Shadi Saleh and Pavel Pecina. Adapting SMT Query Translation Reranker to
New Languages in Cross-Lingual Information Retrieval. In Medical Information
Retrieval (MedIR) Workshop, Association for Computational Linguistics, Pisa,
Italy, 2016
Shadi Saleh and Pavel Pecina. Reranking hypotheses of machine-translated
queries for cross-lingual information retrieval. In Experimental IR Meets
Multilinguality, Multimodality, and Interaction 7th International Conference of
the CLEF Association, CLEF 2016, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Evora, Portugal, 2016
Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team
CUNI, CLEF 2016 Working Notes, CEUR-WS, Dublin, Ireland, 2017

Weitere ähnliche Inhalte

Was ist angesagt?

4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia dataminingKrish_ver2
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxShivaVemula2
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database Avnish Patel
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction) Primya Tamil
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 

Was ist angesagt? (20)

Text mining
Text miningText mining
Text mining
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Text mining
Text miningText mining
Text mining
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 
Information retrieval (introduction)
Information  retrieval (introduction) Information  retrieval (introduction)
Information retrieval (introduction)
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Inverted index
Inverted indexInverted index
Inverted index
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 

Ähnlich wie Cross-lingual information retrieval approaches explored

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsReport on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsMike Salampasis
 
Mt summit2015 jdu_v2
Mt summit2015 jdu_v2Mt summit2015 jdu_v2
Mt summit2015 jdu_v2Jinhua Du
 
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...Nawanan Theera-Ampornpunt
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Aliaksandr Birukou
 
Multi-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMulti-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMatīss ‎‎‎‎‎‎‎  
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Ethnograph 11 Jul07
Ethnograph 11 Jul07Ethnograph 11 Jul07
Ethnograph 11 Jul07Clara Kwan
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overviewTetsuya Sakai
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process AnalysisMauro Dragoni
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Course-Adaptive Content Recommender for Course Authoring
Course-Adaptive Content Recommender for Course AuthoringCourse-Adaptive Content Recommender for Course Authoring
Course-Adaptive Content Recommender for Course AuthoringPeter Brusilovsky
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeDr. Haxel Consult
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...Association for Computational Linguistics
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval forWaheeb Ahmed
 

Ähnlich wie Cross-lingual information retrieval approaches explored (20)

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized PatentsReport on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
 
Mt summit2015 jdu_v2
Mt summit2015 jdu_v2Mt summit2015 jdu_v2
Mt summit2015 jdu_v2
 
Symbexecsearch
SymbexecsearchSymbexecsearch
Symbexecsearch
 
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
Multi-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-LatvianMulti-system machine translation using online APIs for English-Latvian
Multi-system machine translation using online APIs for English-Latvian
 
kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Ethnograph 11 Jul07
Ethnograph 11 Jul07Ethnograph 11 Jul07
Ethnograph 11 Jul07
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
My experiment
My experimentMy experiment
My experiment
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Semantic-based Process Analysis
Semantic-based Process AnalysisSemantic-based Process Analysis
Semantic-based Process Analysis
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Course-Adaptive Content Recommender for Course Authoring
Course-Adaptive Content Recommender for Course AuthoringCourse-Adaptive Content Recommender for Course Authoring
Course-Adaptive Content Recommender for Course Authoring
 
II-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent OfficeII-SDV 2017: Towards Semantic Search at the European Patent Office
II-SDV 2017: Towards Semantic Search at the European Patent Office
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...
Daniel Preotiuc-Pietro - 2015 - An analysis of the user occupational class th...
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 

Kürzlich hochgeladen

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Kürzlich hochgeladen (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Cross-lingual information retrieval approaches explored

  • 1. 1/37 Cross-lingual information retrieval Shadi Saleh Institute of Formal and Applied Linguistics Charles University saleh@ufal.mff.cuni.cz 27 Nov. 2017
  • 2. 2/37 Information Retrieval Task Definition Information retrieval (IR) is finding material (usually doc- uments) of an unstructured nature (usually text) that satis- fies an information need from within large collections (usually stored on computers). source: trec.nist.gov
  • 5. 4/37 Information Retrieval Task Heat-map test (golden triangle) is done by Enquiro, Eyetools, and Didit with search engine users.
  • 7. 6/37 IR evaluation IR system returns ranked list of documents (scored by degree of relevance) Users are interested in the top k documents Development: Set of documents Set of training/test queries Metric: P@10, Percent of relevant documents among the highest 10 retrieved ones How to judge relevant/irrelevant documents? Assessment process
  • 8. 7/37 Data & tools CLEF eHealth 2015 IR task document collection (corpus) For searching, queries from CLEF eHealth IR tasks 2013–2015, 166 queries in total Queries were created in 2013 and 2014 by medical experts In 2015, queries were created to simulate the way laypeople write queries Randomly split into 100 queries for training, 66 for test Relevance assessment is done by medical experts
  • 9. 8/37 Sample query: CLEF 2013 <t o p i c> <id>qtest4</ id> < t i t l e>nausea and vomiting and hematemesis</ t i t l e> <desc>What are nausea , vomiting and hematemesis</ desc> <narr>What i s the connection with nausea , vomiting and hematemesis</ narr> <p r o f i l e>A 64−year old emigrant who i s not sure what nausea , vomiting and hematemesis mean in h i s d i s c h a r g e summary</ p r o f i l e> </ t o p i c>
  • 10. 9/37 Sample queries: CLEF 2015 <t o p i c> <id>c l e f 2 0 1 5 . t e s t .9</ id> < t i t l e>red i t c h y eyes</ t i t l e> </ t o p i c> <t o p i c> <id>c l e f 2 0 1 5 . t e s t .16</ id> < t i t l e>red patchy b r u i s i n g over l e g s</ t i t l e> </ t o p i c> <t o p i c> <id>c l e f 2 0 1 5 . t e s t .44</ id> < t i t l e>n a i l g e t t i n g dark</ t i t l e> </ t o p i c>
  • 14. 11/37 Monolingual experiment Indexing and searching is done using Terrier (an open source IR system) 1 Set of tuning experiments P@10: 47.10 (training set) and 50.30 (test set) 1 http://terrier.org
  • 15. 12/37 Cross-lingual IR problem Definition Cross Lingual Information Retrieval provides allows a user to ask a query in native language and then to get the document in different language. Czech query Query: nevolnost a zvracen´ı a hematemeze?
  • 16. 13/37 Cross-lingual IR approaches: query translation Index Documents (EN) User poses a query (CS) Indexer Top-K Retrieval system Ranked list of documents MT system EN query Reducing CLIR task into monolingual task
  • 17. 14/37 Cross-lingual data 166 queries in English were translated by native medical experts into (Czech, French, German, Hungarian, Polish, Spanish, Swedish) Task is reduced into Monolingual IR: Same relevance data
  • 18. 15/37 Query translation experiment Translate queries in all languages into collection language using online public MT systems: Google Translate Bing translator Sys Czech French German Hungarian Polish Spanish Swedish Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30 Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48 Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
  • 19. 16/37 Baseline CLIR system Translate queries into English using SMT systems, developed by colleagues at UFAL Trained to translate search queries (medical domain) Returns list of alternative translation (N-best-list) Sys Czech French German Hungarian Polish Spanish Swedish Mono 50.30 50.30 50.30 50.30 50.30 50.30 50.30 Baseline 45.76 47.88 42.58 40.76 36.82 44.09 36.67 Google 51.06 49.85 49.55 42.42 43.33 50.61 38.48 Bing 47.88 48.79 46.67 38.79 40.91 50.61 44.70
  • 20. 17/37 Reranking approach Motivation The single best translation that is returned by SMT system is not selected w.r.t CLIR performance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Czech French German 01020304050 Histograms of ranks of translation hypotheses with the highest P@10 for each training query
  • 21. 18/37 Reranking approach Trained to select the best translation for CLIR performance P@10 as an objective function (predict the translation that gives the highest P@10) Index Documents (EN) nevolnost a zvracení a hematemeze Indexer Top-K Retrieval system Ranked list of documents MT system N-best-list (EN) Reranker EN query
  • 22. 19/37 Feature set SMT scores: Translation model, language model and reordering models Rank features: SMT rank and a Boolean feature (1 for best rank, 0 otherwise) Features based on Blind relevance feedback IDF from the collection (inverse document frequency) Translation pool Retrieval statue value Features that are based on external resources (UMLS1, Wikipedia) 1 The Unified Medical Language System: large, multi-purpose, and multi-lingual thesaurus that contains millions of biomedical and health related concepts
  • 23. 20/37 Training 100 queries for training, 15-best-list hypotheses for each query. Two approaches for training: Language-Specific: Model for each language Language-Independent: One model for all languages Leave-One-Out cross validation
  • 24. 21/37 Reranker testing Generate vectors of feature values for each query The trained regression model predicts the hypothesis that gives that highest P@10 Run retrieval for that hypothesis query string
  • 25. 22/37 Results - test set Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 Reranker 50.15 51.06 45.30 Google 50.91 49.70 49.39 Bing 47.88 48.64 46.52 Improvements: 9 queries in Czech, 15 queries in German, and 8 queries in French Degradations: 2 cases for Czech, 4 cases for German, and 3 cases for French
  • 26. 23/37 System comparisons Examples of translations of training queries including reference (ref ), oracle (ora), baseline (base) and best (best) translations (system Reranker). The scores in parentheses refer to query P@10 scores.
  • 27. 24/37 Adapting reranker to new languages
  • 28. 25/37 Queries in new languages New SMT systems (Spanish, Hungarian, Polish and Swedish) developed recently also within Khresmoi. Human experts translated original English queries into these languages, ”under KConnect project”. We want to develop CLIR system for these languages.
  • 29. 26/37 Adapting reranker To adapt the reranker, two sources of data used to create training set: Merged data from existing languages (Czech, French and German) Data from each new language (Spanish, Hungarian, Polish and Swedish) The data is used to create language-independent models
  • 30. 27/37 Language-independent model performance Final evaluation results of language-independent models on the test set Spanish Hungarian Polish Swedish system P@10 P@10 P@10 P@10 Mono 50.30 50.30 50.30 47.10 Baseline 44.09 40.76 36.82 36.67 Reranker 46.36 43.18 36.67 38.79
  • 31. 28/37 Document translation Last years SMT systems improved significantly All researches regarding DT are quite old! Reinvestigate the research question if QT is really better than DT
  • 32. 29/37 Document translation Queries are posed by users in their language Translate the English collection into: Czech, French and German Create separate index for each language Perform the retrieval using original query and the relevant index Index (CS) Documents (EN) User poses a query (CS) Ranked list of documents MT system Indexer Top-K Retrieval system Documents (CS)
  • 33. 30/37 Morphological processing Both queries and documents are processed as follows: Translate into Czech, French and German Stemming using Snowball stemmer 1 Lemmatizing using Tree Tagger for French and German2 and MorphoDiTa for Czech3 1 http://snowball.tartarus.org/ 2 http://www.cis.uni-muenchen.de/˜schmid/tools/TreeTagger 3 http://ufal.mff.cuni.cz/morphodita
  • 34. 31/37 Results - Document Translation Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 DT 37.42 41.67 36.21 DT Stem 41.67 42.73 36.67 DT Lem 39.39 41.06 33.18
  • 35. 32/37 Query expansion Users fail sometimes to create query that represents their information need Query expansion is the process of adding terms to their query (also called query reformulation) Our approach is based on machine learning model
  • 36. 33/37 Query expansion Algorithm Get 20-best-list translations for each query Create a translation pool as bag-of-words from these translations Use best translation as an original query Model can predict a term which will give the highest P@10 when it is added to the original query Features: IDF, TF (pool), similarity between term and query (word-embeddings) Expand the query with one term from the translation pool Run the retrieval using our baseline setting using the expanded queries. Translation pool was limited for some queries, expand it pool from Wikipedia articles
  • 37. 34/37 Results - test set Results of the final evaluation on the test set queries Czech French German system P@10 P@10 P@10 Mono 50.30 50.30 50.30 Baseline 45.61 47.73 42.42 QE 42.12 46.21 37.88
  • 38. 35/37 Query expansion (QE) improved in average 10 queries over the baseline system, only 60% coverage, wait to complete the assessment.
  • 39. 35/37 Query expansion examples Mono: white patchiness in mouth P@10: 10.00 Base: white coating mouth, P@10: 10.00 Expanded: white coating mouth oral cavity P@10: 70.00 Mono: SOB P@10: 50.00 Base: dyspnoea P@10: 60.00 Expanded: dyspnoea rash breathing dyspnea P@10: 70.00
  • 40. 36/37 Conclusion and future work Monolingual IR system evaluation and assessment Cross-lingual IR approaches: Query translation Document translation and morphological analysis Query expansion based on translation pool and Wikipedia Reranking model to predict, for each query, which translation hypothesis gives the highest P@10 Contribution to the CLIR community by releasing dataset with high coverage (doc/query pair)
  • 41. 37/37 Our publications Shadi Saleh and Pavel Pecina. CUNI at the ShARe/CLEF eHealth Evaluation Lab 2014. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Sheffield, UK,2014 Shadi Saleh, Feraena Bibyna, Pavel Pecina: CUNI at the CLEF eHealth 2015 Task 2. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, CEUR-WS, Toulouse,France, 2015 Shadi Saleh and Pavel Pecina. Adapting SMT Query Translation Reranker to New Languages in Cross-Lingual Information Retrieval. In Medical Information Retrieval (MedIR) Workshop, Association for Computational Linguistics, Pisa, Italy, 2016 Shadi Saleh and Pavel Pecina. Reranking hypotheses of machine-translated queries for cross-lingual information retrieval. In Experimental IR Meets Multilinguality, Multimodality, and Interaction 7th International Conference of the CLEF Association, CLEF 2016, Evora, Portugal, 2016 Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team CUNI, CLEF 2016 Working Notes, CEUR-WS, Evora, Portugal, 2016 Shadi Saleh, Pavel Pecina: Task3 Patient-Centred Information Retrieval: Team CUNI, CLEF 2016 Working Notes, CEUR-WS, Dublin, Ireland, 2017