SlideShare ist ein Scribd-Unternehmen logo
1 von 32
April 7, 2006
Natural Language Processing/Language
Technology for the Web
Cross-Language Information
Retrieval (CLIR)
Gouranga Charan Jena
Computer Science & Engg., KIIT University.
Guide Name: Dr. Siddharth Swarup Rautaray
Cross Language Information Retrieval
(CLIR)
Definition :
“A subfield of information retrieval dealing with retrieving
information written in a language different from the
language of the user's query.”
E.g., Using Odia/Hindi queries to retrieve English
documents
Also called multi-lingual, cross-lingual, or trans-lingual
IR.
Why CLIR?
E.g., On the web, we have:
 Documents in different languages
 Multilingual documents
 Images with captions in different languages
A single query should retrieve all such resources.
Approaches to CLIR
Knowledge-
based
Corpus-based
Query Translation Dictionary/Thes
aurus-based
Pseudo-
Relevance
Feedback (PRF)
Document
Translation
MT
(rule-based)
MT
(EBMT/StatMT)
Intermediate
Representation
UNL
(AgroExplorer)
Latent Semantic
Indexing
Most effective approaches are hybrid – a combination of knowledge
and corpus-based methods.
most
efficient;
commonly
used
infeasible
for
large
collections
Dictionary-based Query Translation
Ireland
peace
talks
Hindi-English
dictionaries
Collection
search
• phrase identification
• words to be transliterated
The problem with dictionary-based
CLIR -- ambiguity
cosmic outer-space
incident event occurrence
lessen subside decrease lower
diminish ebb decline reduce
lattice mesh net wire_netting
meshed_fabric counterfeit
forged false fabricated
small_net network gauze
grating sieve
money riches wealth appositive
property
Ireland
peace calm tranquility silence
quietude
conversation talk negotiation
tale
… filtering/disambiguation is required after
query translation.
Disambiguation using
co-occurrence statistics
Hypothesis: correct translations of query terms will
co-occur and incorrect translations will tend not
to co-occur
Problem with counting co-occurrences:
data sparsity
freq(Marathi Shallow Parsing CRFs)
freq(Marathi Shallow Structuring CRFs)
freq(Marathi Shallow Analyzing CRFs)
… are all zero.
How do we choose between parsing,
structuring, and analyzing?
Pair-wise co-occurrence
cosmic outer-space
incident event occurrence lessen subside decrease lower diminish ebb
decline reduce
freq(cosmic incident)  70800
freq(cosmic event  269000
freq(cosmic lessen)  7130
freq(cosmic subside)  3120
freq(outer-space incident)  26100
freq(outer-space event)  104000
freq(outer-space lessen)  2600
freq(outer-space subside)  980
Shallow Parsing, Structuring or Analyzing?
shallow parsing  166000
shallow structuring  180000
shallow analyzing  1230000
CRFs parsing  540
CRFs structuring  125
CRFs analyzing  765
Marathi parsing  17100
Marathi structuring  511
Marathi analyzing  12200
“shallow parsing”  40700
“shallow structuring”  11
“shallow analyzing”  2
collocation?
But,
analyzing  74100000
parsing  40400000
structuring  17400000
shallow  33300000
Ranking senses using co-occurrence
statistics
 Use co-occurrence scores to calculate
similarity between two words: sim(x, y)
 Point-wise mutual information (PMI)
 Dice coefficient
 PMI-IR
)()(
)(
log),(-
yhitsxhits
yxhits
yxIRPMI
AND
×
=
Disambiguation algorithm
},...,{
:querysuser'
21
s
m
ss
qqqq =
}{
ons,translatiofsetthe,eachFor
,
t
jii
s
i
wS
q
=
∑
∈∀
=
','
'' ),(),(.1 ,,,
i
t
li
Sw
t
li
t
jii
t
ji wwsimSwsim
∑
≠∀
=
ii
i
t
ji
t
ji Swsimwscore
'
),()(.2 ',,
},...,,{
querytranslated
21
t
m
ttt
qqqq =
)(maxarg.3 ,
,
t
ji
w
t
i wscoreq
t
ji
=
Example
cosmic outer-space
incident event lessen subside decrease lower
diminish ebb decline reduce
score(cosmic)= PMI-IR(cosmic, incident) +
PMI-IR(cosmic, event) +
PMI-IR(cosmic, lessen) +
PMI-IR(cosmic, subside) …
Disambiguation algorithm: sample outputs
Ireland peace talks
cosmic events
net money (?)
Results on TREC8 (disks 4 and 5)
 English topics (401-450) manually translated to Hindi
 Assumption: relevance judgments for English topics
hold for the translated queries
 Results (all TF-IDF):
Technique MAP
Monolingual 23
All-translations 16
PMI based disambiguation 20.5
Manual filtering 21.5
Pseudo-Relevance Feedback for CLIR
(User) Relevance Feedback (mono-lingual)
1. Retrieve documents using the user’s query
2. The user marks relevant documents
3. Choose the top N terms from these
documents
 Top terms  IDF is one option for scoring
1. Add these N terms to the user’s query to
form a new query
2. Use this new query to retrieve a new set of
documents
Pseudo-Relevance Feedback (PRF)
(mono-lingual)
1. Retrieve documents using the user’s query
2. Assume that the top M documents retrieved
are relevant
3. Choose the top N terms from these M
documents
4. Add these N terms to the user’s query to
form a new query
5. Use this new query to retrieve a new set of
documents
PRF for CLIR
Corpus-based Query Translation
 Uses a parallel corpus of documents:
H1  E1
H2  E2
. .
. .
. .
Hm Em
Hindi collection H English collection E
PRF for CLIR
1. Retrieve documents in H using the user’s query
2. Assume that the top M documents retrieved are
relevant
3. Select the M documents in E that are aligned to
the top M retrieved documents
4. Choose the top N terms from these documents
5. These N terms are the translated query
6. Use this query to retrieve from the target collection
(which is in the same language as E)
Cross-Lingual Relevance Models
- Estimate relevance models using a parallel corpus
Ranking with Relevance Models
 Relevance model or Query
model (distribution encodes
the information need):
 Probability of word
occurrence in a relevant
document
 Probability of word
occurrence in the candidate
document
 Ranking function (relative
entropy or KL divergence)
RΘ
)|( RwP Θ
)|( DwP
∑ Θ
=
w RwP
DwP
DwP
RDKL
)|(
)|(
log).|(
)||(
Estimating Mono-Lingual Relevance
Models
)...(
)...,(
)...|()|()|(
21
21
21
m
m
mR
hhhP
hhhwP
hhhwPQwPwP
=
=≈Θ
∑ ∏Μ∈ =






=
M
m
i
im MhPMwPMPhhhwP
1
21 )|()|()()...,(
Estimating Cross-Lingual Relevance Models
∑ ∏Μ∈ =






=
},{ 1
21 )|()|(}),({)...,(
EH MM
m
i
HiEEHm MhPMwPMMPhhhwP
)()1()|(
,
,
wP
freq
freq
MwP
v Xv
Xw
X λλ −+








=
∑
CLIR Evaluation – TREC
(Text REtrieval Conference)
 TREC CLIR track (2001 and 2002)
 Retrieval of Arabic language newswire documents from
topics in English
 383,872 Arabic documents (896 MB) with SGML markup
 50 topics
 Use of provided resources (stemmers, bilingual
dictionaries, MT systems, parallel corpora) is
encouraged to minimize variability
http://trec.nist.gov/
CLIR Evaluation – CLEF
(Cross Language Evaluation Forum)
 Major CLIR evaluation forum
 Tracks include
 Multilingual retrieval on news collections
 topics will be provided in many languages including Hindi
 Multiple language Question Answering
 ImageCLEF
 Cross Language Speech Retrieval
 WebCLEF
http://www.clef-campaign.org/
Summary
 CLIR techniques
 Query Translation-based
 Document Translation-based
 Intermediate Representation-based
 Query translation using dictionaries, followed by
disambiguation, is a simple and effective technique
for CLIR
 PRF uses a parallel corpus for query translation
 Parallel corpora can also be used to estimate cross-
lingual relevance models
 CLEF and TREC: important CLIR evaluation
conferences
References (1)
1. Phrasal Translation and Query Expansion Techniques for Cross-
language Information Retrieval, Lisa Ballesteros and W. Bruce
Croft, Research and Development in Information Retrieval, 1995.
2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros
and W. Bruce Croft, Research and Development in Information
Retrieval, 1998.
3. A Maximum Coherence Model for Dictionary-Based Cross-
Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y.
Chai, ACM SIGIR, 2005.
4. A Comparative Study of Knowledge-Based Approaches for Cross-
Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr,
Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-
3897, University of Maryland, 1998.
References (2)
5. Translingual Information Retrieval: A Comparative Evaluation,
Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D.
Brown, Yibing Geng, and Danny Lee, International Joint
Conference on Artificial Intelligence, 1997.
6. A Multistage Search Strategy for Cross Lingual Information
Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak
Bhattacharyya, Symposium on Indian Morphology, Phonology
and Language Engineering, IIT Kharagpur, February, 2005.
7. Relevance-Based Language Models, Victor Lavrenko, and W.
Bruce Croft, Research and Development in Information
Retrieval, 2001.
8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette,
and W. Croft, ACM-SIGIR, 2002.
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Julien PLU
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...Iman Mirrezaei
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesIJwest
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Riccardo Albertoni
 
RDF2Rule PRESENTATION
RDF2Rule PRESENTATIONRDF2Rule PRESENTATION
RDF2Rule PRESENTATIONEfrah Shakir
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 

Was ist angesagt? (18)

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
 
SAC 2019 ester giallonardo
SAC 2019 ester giallonardoSAC 2019 ester giallonardo
SAC 2019 ester giallonardo
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Matching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sourcesMatching and merging anonymous terms from web sources
Matching and merging anonymous terms from web sources
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
Environmental Thesauri Under the Lens of Reusability (EGOVIS 2014)
 
RDF2Rule PRESENTATION
RDF2Rule PRESENTATIONRDF2Rule PRESENTATION
RDF2Rule PRESENTATION
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 

Ähnlich wie 07 04-06

Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indianeSAT Publishing House
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrievaldannyijwest
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsMauro Dragoni
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinalDeborah McGuinness
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Daniel Valcarce
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemAriel Hess
 
A decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageA decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageacijjournal
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...cseij
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALA SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...Kim Daniels
 
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSTUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSkevig
 
Tuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural NetworksTuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural Networkskevig
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
 

Ähnlich wie 07 04-06 (20)

C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
Cross language information retrieval in indian
Cross language information retrieval in indianCross language information retrieval in indian
Cross language information retrieval in indian
 
A SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUESA SURVEY ON VARIOUS CLIR TECHNIQUES
A SURVEY ON VARIOUS CLIR TECHNIQUES
 
A Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information RetrievalA Review on the Cross and Multilingual Information Retrieval
A Review on the Cross and Multilingual Information Retrieval
 
Using Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR SystemsUsing Semantic and Domain-based Information in CLIR Systems
Using Semantic and Domain-based Information in CLIR Systems
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
Exploring Statistical Language Models for Recommender Systems [RecSys '15 DS ...
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
Multilingualism in Information Retrieval System
Multilingualism in Information Retrieval SystemMultilingualism in Information Retrieval System
Multilingualism in Information Retrieval System
 
A decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri languageA decision tree based word sense disambiguation system in manipuri language
A decision tree based word sense disambiguation system in manipuri language
 
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALA SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVAL
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
 
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSTUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
 
Tuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural NetworksTuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural Networks
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 
An Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic DocumentsAn Evaluation and Overview of Indices Based on Arabic Documents
An Evaluation and Overview of Indices Based on Arabic Documents
 

Kürzlich hochgeladen

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESNarmatha D
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptxNikhil Raut
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 

Kürzlich hochgeladen (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Industrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIESIndustrial Safety Unit-I SAFETY TERMINOLOGIES
Industrial Safety Unit-I SAFETY TERMINOLOGIES
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Steel Structures - Building technology.pptx
Steel Structures - Building technology.pptxSteel Structures - Building technology.pptx
Steel Structures - Building technology.pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 

07 04-06

  • 1. April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Gouranga Charan Jena Computer Science & Engg., KIIT University. Guide Name: Dr. Siddharth Swarup Rautaray
  • 2. Cross Language Information Retrieval (CLIR) Definition : “A subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query.” E.g., Using Odia/Hindi queries to retrieve English documents Also called multi-lingual, cross-lingual, or trans-lingual IR.
  • 3. Why CLIR? E.g., On the web, we have:  Documents in different languages  Multilingual documents  Images with captions in different languages A single query should retrieve all such resources.
  • 4. Approaches to CLIR Knowledge- based Corpus-based Query Translation Dictionary/Thes aurus-based Pseudo- Relevance Feedback (PRF) Document Translation MT (rule-based) MT (EBMT/StatMT) Intermediate Representation UNL (AgroExplorer) Latent Semantic Indexing Most effective approaches are hybrid – a combination of knowledge and corpus-based methods. most efficient; commonly used infeasible for large collections
  • 6. The problem with dictionary-based CLIR -- ambiguity cosmic outer-space incident event occurrence lessen subside decrease lower diminish ebb decline reduce lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sieve money riches wealth appositive property Ireland peace calm tranquility silence quietude conversation talk negotiation tale
  • 7. … filtering/disambiguation is required after query translation.
  • 8. Disambiguation using co-occurrence statistics Hypothesis: correct translations of query terms will co-occur and incorrect translations will tend not to co-occur
  • 9. Problem with counting co-occurrences: data sparsity freq(Marathi Shallow Parsing CRFs) freq(Marathi Shallow Structuring CRFs) freq(Marathi Shallow Analyzing CRFs) … are all zero. How do we choose between parsing, structuring, and analyzing?
  • 10. Pair-wise co-occurrence cosmic outer-space incident event occurrence lessen subside decrease lower diminish ebb decline reduce freq(cosmic incident)  70800 freq(cosmic event  269000 freq(cosmic lessen)  7130 freq(cosmic subside)  3120 freq(outer-space incident)  26100 freq(outer-space event)  104000 freq(outer-space lessen)  2600 freq(outer-space subside)  980
  • 11. Shallow Parsing, Structuring or Analyzing? shallow parsing  166000 shallow structuring  180000 shallow analyzing  1230000 CRFs parsing  540 CRFs structuring  125 CRFs analyzing  765 Marathi parsing  17100 Marathi structuring  511 Marathi analyzing  12200 “shallow parsing”  40700 “shallow structuring”  11 “shallow analyzing”  2 collocation? But, analyzing  74100000 parsing  40400000 structuring  17400000 shallow  33300000
  • 12. Ranking senses using co-occurrence statistics  Use co-occurrence scores to calculate similarity between two words: sim(x, y)  Point-wise mutual information (PMI)  Dice coefficient  PMI-IR )()( )( log),(- yhitsxhits yxhits yxIRPMI AND × =
  • 14. ∑ ∈∀ = ',' '' ),(),(.1 ,,, i t li Sw t li t jii t ji wwsimSwsim ∑ ≠∀ = ii i t ji t ji Swsimwscore ' ),()(.2 ',, },...,,{ querytranslated 21 t m ttt qqqq = )(maxarg.3 , , t ji w t i wscoreq t ji =
  • 15. Example cosmic outer-space incident event lessen subside decrease lower diminish ebb decline reduce score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …
  • 16. Disambiguation algorithm: sample outputs Ireland peace talks cosmic events net money (?)
  • 17. Results on TREC8 (disks 4 and 5)  English topics (401-450) manually translated to Hindi  Assumption: relevance judgments for English topics hold for the translated queries  Results (all TF-IDF): Technique MAP Monolingual 23 All-translations 16 PMI based disambiguation 20.5 Manual filtering 21.5
  • 19. (User) Relevance Feedback (mono-lingual) 1. Retrieve documents using the user’s query 2. The user marks relevant documents 3. Choose the top N terms from these documents  Top terms  IDF is one option for scoring 1. Add these N terms to the user’s query to form a new query 2. Use this new query to retrieve a new set of documents
  • 20. Pseudo-Relevance Feedback (PRF) (mono-lingual) 1. Retrieve documents using the user’s query 2. Assume that the top M documents retrieved are relevant 3. Choose the top N terms from these M documents 4. Add these N terms to the user’s query to form a new query 5. Use this new query to retrieve a new set of documents
  • 21. PRF for CLIR Corpus-based Query Translation  Uses a parallel corpus of documents: H1  E1 H2  E2 . . . . . . Hm Em Hindi collection H English collection E
  • 22. PRF for CLIR 1. Retrieve documents in H using the user’s query 2. Assume that the top M documents retrieved are relevant 3. Select the M documents in E that are aligned to the top M retrieved documents 4. Choose the top N terms from these documents 5. These N terms are the translated query 6. Use this query to retrieve from the target collection (which is in the same language as E)
  • 23. Cross-Lingual Relevance Models - Estimate relevance models using a parallel corpus
  • 24. Ranking with Relevance Models  Relevance model or Query model (distribution encodes the information need):  Probability of word occurrence in a relevant document  Probability of word occurrence in the candidate document  Ranking function (relative entropy or KL divergence) RΘ )|( RwP Θ )|( DwP ∑ Θ = w RwP DwP DwP RDKL )|( )|( log).|( )||(
  • 25. Estimating Mono-Lingual Relevance Models )...( )...,( )...|()|()|( 21 21 21 m m mR hhhP hhhwP hhhwPQwPwP = =≈Θ ∑ ∏Μ∈ =       = M m i im MhPMwPMPhhhwP 1 21 )|()|()()...,(
  • 26. Estimating Cross-Lingual Relevance Models ∑ ∏Μ∈ =       = },{ 1 21 )|()|(}),({)...,( EH MM m i HiEEHm MhPMwPMMPhhhwP )()1()|( , , wP freq freq MwP v Xv Xw X λλ −+         = ∑
  • 27. CLIR Evaluation – TREC (Text REtrieval Conference)  TREC CLIR track (2001 and 2002)  Retrieval of Arabic language newswire documents from topics in English  383,872 Arabic documents (896 MB) with SGML markup  50 topics  Use of provided resources (stemmers, bilingual dictionaries, MT systems, parallel corpora) is encouraged to minimize variability http://trec.nist.gov/
  • 28. CLIR Evaluation – CLEF (Cross Language Evaluation Forum)  Major CLIR evaluation forum  Tracks include  Multilingual retrieval on news collections  topics will be provided in many languages including Hindi  Multiple language Question Answering  ImageCLEF  Cross Language Speech Retrieval  WebCLEF http://www.clef-campaign.org/
  • 29. Summary  CLIR techniques  Query Translation-based  Document Translation-based  Intermediate Representation-based  Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR  PRF uses a parallel corpus for query translation  Parallel corpora can also be used to estimate cross- lingual relevance models  CLEF and TREC: important CLIR evaluation conferences
  • 30. References (1) 1. Phrasal Translation and Query Expansion Techniques for Cross- language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995. 2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998. 3. A Maximum Coherence Model for Dictionary-Based Cross- Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005. 4. A Comparative Study of Knowledge-Based Approaches for Cross- Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR- 3897, University of Maryland, 1998.
  • 31. References (2) 5. Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997. 6. A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005. 7. Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001. 8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.