07 04-06

April 7, 2006
Natural Language Processing/Language
Technology for the Web
Cross-Language Information
Retrieval (CLIR)
Gouranga Charan Jena
Computer Science & Engg., KIIT University.
Guide Name: Dr. Siddharth Swarup Rautaray

Cross Language Information Retrieval
(CLIR)
Definition :
“A subfield of information retrieval dealing with retrieving
information written in a language different from the
language of the user's query.”
E.g., Using Odia/Hindi queries to retrieve English
documents
Also called multi-lingual, cross-lingual, or trans-lingual
IR.

Why CLIR?
E.g., On the web, we have:
 Documents in different languages
 Multilingual documents
 Images with captions in different languages
A single query should retrieve all such resources.

Approaches to CLIR
Knowledge-
based
Corpus-based
Query Translation Dictionary/Thes
aurus-based
Pseudo-
Relevance
Feedback (PRF)
Document
Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate
Representation
UNL
(AgroExplorer)
Latent Semantic
Indexing
Most effective approaches are hybrid – a combination of knowledge
and corpus-based methods.
most
efficient;
commonly
used
infeasible
for
large
collections

Dictionary-based Query Translation
Ireland
peace
talks
Hindi-English
dictionaries
Collection
search
• phrase identification
• words to be transliterated

The problem with dictionary-based
CLIR -- ambiguity
cosmic outer-space
incident event occurrence
lessen subside decrease lower
diminish ebb decline reduce
lattice mesh net wire_netting
meshed_fabric counterfeit
forged false fabricated
small_net network gauze
grating sieve
money riches wealth appositive
property
Ireland
peace calm tranquility silence
quietude
conversation talk negotiation
tale

… filtering/disambiguation is required after
query translation.

Disambiguation using
co-occurrence statistics
Hypothesis: correct translations of query terms will
co-occur and incorrect translations will tend not
to co-occur

Problem with counting co-occurrences:
data sparsity
freq(Marathi Shallow Parsing CRFs)
freq(Marathi Shallow Structuring CRFs)
freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing,
structuring, and analyzing?

Pair-wise co-occurrence
cosmic outer-space
incident event occurrence lessen subside decrease lower diminish ebb
decline reduce
freq(cosmic incident)  70800
freq(cosmic event  269000
freq(cosmic lessen)  7130
freq(cosmic subside)  3120
freq(outer-space incident)  26100
freq(outer-space event)  104000
freq(outer-space lessen)  2600
freq(outer-space subside)  980

Shallow Parsing, Structuring or Analyzing?
shallow parsing  166000
shallow structuring  180000
shallow analyzing  1230000
CRFs parsing  540
CRFs structuring  125
CRFs analyzing  765
Marathi parsing  17100
Marathi structuring  511
Marathi analyzing  12200
“shallow parsing”  40700
“shallow structuring”  11
“shallow analyzing”  2
collocation?
But,
analyzing  74100000
parsing  40400000
structuring  17400000
shallow  33300000

Ranking senses using co-occurrence
statistics
 Use co-occurrence scores to calculate
similarity between two words: sim(x, y)
 Point-wise mutual information (PMI)
 Dice coefficient
 PMI-IR
)()(
)(
log),(-
yhitsxhits
yxhits
yxIRPMI
AND
×
=

Disambiguation algorithm
},...,{
:querysuser'
21
s
m
ss
qqqq =
}{
ons,translatiofsetthe,eachFor
,
t
jii
s
i
wS
q
=

∑
∈∀
=
','
'' ),(),(.1 ,,,
i
t
li
Sw
t
li
t
jii
t
ji wwsimSwsim
∑
≠∀
=
ii
i
t
ji
t
ji Swsimwscore
'
),()(.2 ',,
},...,,{
querytranslated
21
t
m
ttt
qqqq =
)(maxarg.3 ,
,
t
ji
w
t
i wscoreq
t
ji
=

Example
cosmic outer-space
incident event lessen subside decrease lower
diminish ebb decline reduce
score(cosmic)= PMI-IR(cosmic, incident) +
PMI-IR(cosmic, event) +
PMI-IR(cosmic, lessen) +
PMI-IR(cosmic, subside) …

Disambiguation algorithm: sample outputs
Ireland peace talks
cosmic events
net money (?)

Results on TREC8 (disks 4 and 5)
 English topics (401-450) manually translated to Hindi
 Assumption: relevance judgments for English topics
hold for the translated queries
 Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5

Pseudo-Relevance Feedback for CLIR

(User) Relevance Feedback (mono-lingual)
1. Retrieve documents using the user’s query
2. The user marks relevant documents
3. Choose the top N terms from these
documents
 Top terms  IDF is one option for scoring
1. Add these N terms to the user’s query to
form a new query
2. Use this new query to retrieve a new set of
documents

Pseudo-Relevance Feedback (PRF)
(mono-lingual)
1. Retrieve documents using the user’s query
2. Assume that the top M documents retrieved
are relevant
3. Choose the top N terms from these M
documents
4. Add these N terms to the user’s query to
form a new query
5. Use this new query to retrieve a new set of
documents

PRF for CLIR
Corpus-based Query Translation
 Uses a parallel corpus of documents:
H1  E1
H2  E2
. .
. .
. .
Hm Em
Hindi collection H English collection E

PRF for CLIR
1. Retrieve documents in H using the user’s query
2. Assume that the top M documents retrieved are
relevant
3. Select the M documents in E that are aligned to
the top M retrieved documents
4. Choose the top N terms from these documents
5. These N terms are the translated query
6. Use this query to retrieve from the target collection
(which is in the same language as E)

Cross-Lingual Relevance Models
- Estimate relevance models using a parallel corpus

Ranking with Relevance Models
 Relevance model or Query
model (distribution encodes
the information need):
 Probability of word
occurrence in a relevant
document
 Probability of word
occurrence in the candidate
document
 Ranking function (relative
entropy or KL divergence)
RΘ
)|( RwP Θ
)|( DwP
∑ Θ
=
w RwP
DwP
DwP
RDKL
)|(
)|(
log).|(
)||(

Estimating Mono-Lingual Relevance
Models
)...(
)...,(
)...|()|()|(
21
21
21
m
m
mR
hhhP
hhhwP
hhhwPQwPwP
=
=≈Θ
∑ ∏Μ∈ =






=
M
m
i
im MhPMwPMPhhhwP
1
21 )|()|()()...,(

Estimating Cross-Lingual Relevance Models
∑ ∏Μ∈ =






=
},{ 1
21 )|()|(}),({)...,(
EH MM
m
i
HiEEHm MhPMwPMMPhhhwP
)()1()|(
,
,
wP
freq
freq
MwP
v Xv
Xw
X λλ −+








=
∑

CLIR Evaluation – TREC
(Text REtrieval Conference)
 TREC CLIR track (2001 and 2002)
 Retrieval of Arabic language newswire documents from
topics in English
 383,872 Arabic documents (896 MB) with SGML markup
 50 topics
 Use of provided resources (stemmers, bilingual
dictionaries, MT systems, parallel corpora) is
encouraged to minimize variability
http://trec.nist.gov/

CLIR Evaluation – CLEF
(Cross Language Evaluation Forum)
 Major CLIR evaluation forum
 Tracks include
 Multilingual retrieval on news collections
 topics will be provided in many languages including Hindi
 Multiple language Question Answering
 ImageCLEF
 Cross Language Speech Retrieval
 WebCLEF
http://www.clef-campaign.org/

Summary
 CLIR techniques
 Query Translation-based
 Document Translation-based
 Intermediate Representation-based
 Query translation using dictionaries, followed by
disambiguation, is a simple and effective technique
for CLIR
 PRF uses a parallel corpus for query translation
 Parallel corpora can also be used to estimate cross-
lingual relevance models
 CLEF and TREC: important CLIR evaluation
conferences

References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross-
language Information Retrieval, Lisa Ballesteros and W. Bruce
Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros
and W. Bruce Croft, Research and Development in Information
Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-
Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.
Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross-
Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr,
Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-
3897, University of Maryland, 1998.

References (2)
5. Translingual Information Retrieval: A Comparative Evaluation,
Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.
Brown, Yibing Geng, and Danny Lee, International Joint
Conference on Artificial Intelligence, 1997.
6. A Multistage Search Strategy for Cross Lingual Information
Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak
Bhattacharyya, Symposium on Indian Morphology, Phonology
and Language Engineering, IIT Kharagpur, February, 2005.
7. Relevance-Based Language Models, Victor Lavrenko, and W.
Bruce Croft, Research and Development in Information
Retrieval, 2001.
8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette,
and W. Croft, ACM-SIGIR, 2002.

07 04-06

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie 07 04-06

Ähnlich wie 07 04-06 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

07 04-06