1. April 7, 2006
Natural Language Processing/Language
Technology for the Web
Cross-Language Information
Retrieval (CLIR)
Gouranga Charan Jena
Computer Science & Engg., KIIT University.
Guide Name: Dr. Siddharth Swarup Rautaray
2. Cross Language Information Retrieval
(CLIR)
Definition :
“A subfield of information retrieval dealing with retrieving
information written in a language different from the
language of the user's query.”
E.g., Using Odia/Hindi queries to retrieve English
documents
Also called multi-lingual, cross-lingual, or trans-lingual
IR.
3. Why CLIR?
E.g., On the web, we have:
Documents in different languages
Multilingual documents
Images with captions in different languages
A single query should retrieve all such resources.
4. Approaches to CLIR
Knowledge-
based
Corpus-based
Query Translation Dictionary/Thes
aurus-based
Pseudo-
Relevance
Feedback (PRF)
Document
Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate
Representation
UNL
(AgroExplorer)
Latent Semantic
Indexing
Most effective approaches are hybrid – a combination of knowledge
and corpus-based methods.
most
efficient;
commonly
used
infeasible
for
large
collections
9. Problem with counting co-occurrences:
data sparsity
freq(Marathi Shallow Parsing CRFs)
freq(Marathi Shallow Structuring CRFs)
freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing,
structuring, and analyzing?
17. Results on TREC8 (disks 4 and 5)
English topics (401-450) manually translated to Hindi
Assumption: relevance judgments for English topics
hold for the translated queries
Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5
19. (User) Relevance Feedback (mono-lingual)
1. Retrieve documents using the user’s query
2. The user marks relevant documents
3. Choose the top N terms from these
documents
Top terms IDF is one option for scoring
1. Add these N terms to the user’s query to
form a new query
2. Use this new query to retrieve a new set of
documents
20. Pseudo-Relevance Feedback (PRF)
(mono-lingual)
1. Retrieve documents using the user’s query
2. Assume that the top M documents retrieved
are relevant
3. Choose the top N terms from these M
documents
4. Add these N terms to the user’s query to
form a new query
5. Use this new query to retrieve a new set of
documents
21. PRF for CLIR
Corpus-based Query Translation
Uses a parallel corpus of documents:
H1 E1
H2 E2
. .
. .
. .
Hm Em
Hindi collection H English collection E
22. PRF for CLIR
1. Retrieve documents in H using the user’s query
2. Assume that the top M documents retrieved are
relevant
3. Select the M documents in E that are aligned to
the top M retrieved documents
4. Choose the top N terms from these documents
5. These N terms are the translated query
6. Use this query to retrieve from the target collection
(which is in the same language as E)
24. Ranking with Relevance Models
Relevance model or Query
model (distribution encodes
the information need):
Probability of word
occurrence in a relevant
document
Probability of word
occurrence in the candidate
document
Ranking function (relative
entropy or KL divergence)
RΘ
)|( RwP Θ
)|( DwP
∑ Θ
=
w RwP
DwP
DwP
RDKL
)|(
)|(
log).|(
)||(
26. Estimating Cross-Lingual Relevance Models
∑ ∏Μ∈ =
=
},{ 1
21 )|()|(}),({)...,(
EH MM
m
i
HiEEHm MhPMwPMMPhhhwP
)()1()|(
,
,
wP
freq
freq
MwP
v Xv
Xw
X λλ −+
=
∑
27. CLIR Evaluation – TREC
(Text REtrieval Conference)
TREC CLIR track (2001 and 2002)
Retrieval of Arabic language newswire documents from
topics in English
383,872 Arabic documents (896 MB) with SGML markup
50 topics
Use of provided resources (stemmers, bilingual
dictionaries, MT systems, parallel corpora) is
encouraged to minimize variability
http://trec.nist.gov/
28. CLIR Evaluation – CLEF
(Cross Language Evaluation Forum)
Major CLIR evaluation forum
Tracks include
Multilingual retrieval on news collections
topics will be provided in many languages including Hindi
Multiple language Question Answering
ImageCLEF
Cross Language Speech Retrieval
WebCLEF
http://www.clef-campaign.org/
29. Summary
CLIR techniques
Query Translation-based
Document Translation-based
Intermediate Representation-based
Query translation using dictionaries, followed by
disambiguation, is a simple and effective technique
for CLIR
PRF uses a parallel corpus for query translation
Parallel corpora can also be used to estimate cross-
lingual relevance models
CLEF and TREC: important CLIR evaluation
conferences
30. References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross-
language Information Retrieval, Lisa Ballesteros and W. Bruce
Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros
and W. Bruce Croft, Research and Development in Information
Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-
Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.
Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross-
Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr,
Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-
3897, University of Maryland, 1998.
31. References (2)
5. Translingual Information Retrieval: A Comparative Evaluation,
Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.
Brown, Yibing Geng, and Danny Lee, International Joint
Conference on Artificial Intelligence, 1997.
6. A Multistage Search Strategy for Cross Lingual Information
Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak
Bhattacharyya, Symposium on Indian Morphology, Phonology
and Language Engineering, IIT Kharagpur, February, 2005.
7. Relevance-Based Language Models, Victor Lavrenko, and W.
Bruce Croft, Research and Development in Information
Retrieval, 2001.
8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette,
and W. Croft, ACM-SIGIR, 2002.