SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
Solution for workshop
AINL FRUCT:
Artificial Intelligence and Natural
Language Conference
10-12 NOVEMBER 2016
SAINT-PETERSBURG
http://ainlconf.ru/
Paraphrase Detection using
Semantic Similarity Algorithms
Dmitry Kravchenko
Ben-Gurion University of the Negev
Tasks description
Input:
2 files with list of pairs of sentences in Russian in XML format:
a) training set
b) test set
Output:
Task 1:
Algorithm should classify each pair into one of three classes: Non-
paraphrase, Near-paraphrase, Precise-paraphrase
Task 2:
Algorithm should classify each pair into one of three classes: Non-
paraphrase, Paraphrase
Algorithm Data-Flow
SEMILAR Toolkit
DKPro Similarity
Python difflib
NLTK WordNet
Swoogle
BLEU algorithms
Google
Yandex
Microsoft
Gradient
Boosting
Classifier
Input
substitution
of acronyms
using online
dictionary:
wiktionary.org
Output
Classification algorithm
● GradientBooster Classifier
● Task 1:
– Feature vector which contain 77 features:
● 18 features: 6 scores of SEMILAR toolkit * 3 translation
engines
● 39 features: 13 scores of DKPro Similarity toolkit * 3
translation engines
● 3 features: 1 python difflib similarity score * 3 translation
engines
● 6 features: 2 scores of sentence similarity scores (Yuhua Li,
David McLean, etc. et al) * 3 translation engines
● 3 features: 1 score of Swoogle comparator * 3 translations
● 8 BLEU scores on source sentences (in Russian)
Classification algorithm
● Task 2:
– Feature vector which contain 69 features:
● 18 features: 6 scores of SEMILAR toolkit * 3 translation
engines
●
39 features: 13 scores of DKPro Similarity toolkit * 3
translation engines
● 3 features: 1 python difflib similarity score * 3 translation
engines
● 6 features: 2 scores of sentence similarity scores (Yuhua Li,
David McLean, etc. et al) * 3 translation engines
● 3 features: 1 score of Swoogle comparator * 3 translations
● (without BLEU scores)
6 scores
of SEMILAR toolkit
● greedyComparerWNLin
● optimumComparerLSATasa
● dependencyComparerWnLeskTanim
● cmComparer
● bleuComparer
● lsaComparer
greedyComparerWNLin
This score refers to a sentence to sentence similarity method
which greedily aligns words between given sentences. The
word alignment method used is WordNet based method
proposed by Lin in 1998: article name is “An information-
theoretic definition of similarity”.
Please refer to:
A Comparison of Greedy and Optimal Assessment of Natural
Language Student Input Using Word-to-Word Similarity Metrics
http://www.aclweb.org/website/old_anthology/W/W12/W12-
20.pdf#page=175
optimumComparerLSATasa
Similar to greedyComparerWNLin, but the words are
aligned optimally (similar to job assignment problem) and
the word-to-word similarity method
Article name is: Latent Semantic Analysis Models on
Wikipedia and TASA
http://deeptutor2.memphis.edu/Semilar-
Web/public/downloads/LSA-Models-
LREC014/LSAModelsOnWikipediaAndTASADanEtAl-
LREC014.pdf
dependencyComparerWnLeskTanim
Please see:
● https://www.aaai.org/ocs/index.php/FLAIRS/200
9/paper/viewFile/55/298.
The word-to-word similarity method used.
It is WordNet based method proposed by Lesk
and Tanim
cmComparer
Method proposed by Corley and Mihalcea.
(article name is: SEMILAR: The Semantic Similarity Toolkit)
lsaComparer
LSA based word representation are summed up
for each sentence and the similarity is
calculated using the resultant representation.
● (resultant Vector based method is described in
the article: NeRoSim: A System for Measuring
and Interpreting Semantic Textual Similarity
http://alt.qcri.org/semeval2015/cdrom/pdf/SemE
val030.pdf)
Word-to-word Similarity score
Article: NeRoSim: A System for Measuring and Interpreting Semantic Textual Similarity
13 scores
of DKPro Similarity toolkit
● CosineSimilarity,
● ExactStringMatchComparator,
● GreedyStringTiling2-gram, GreedyStringTiling 4-gram,
● JaroSecondStringComparator,
● JaroWinklerSecondStringComparator,
● normalized LevenshteinComparator,
● LongestCommonSubsequenceNormComparator,
● SubstringMatchComparator,
● WordNGramContainmentMeasure,
● WordNGram-JaccardMeasure 2-gram, WordNGramJaccardMeasure
3-gram, WordNGramJaccardMeasure 4-gram
Four rest Toolkits
● Python difflib comparator
● NLTK WordNet. Sentence similarity scores
(Yuhua Li, David McLean, etc. et al)
● Swoogle comparator
● BLEU scores (for Russian language, no need
for English translation): bleu def 1-gram, bleu
def 2-gram, bleu def 3-gram, bleu def 4-gram,
bleu lin 1-gram, bleu lin 2-gram, bleu lin 3-gram,
bleu lin 4-gram
Results on Test Set
Task number Accuracy F1 macro Place
First Task Standard 0.5695 0.5437 4 out of 11
Second Task Standard 0.7153 0.7853 6 out of 10
Which impact Toolkits gave?
SEMILAR DKPro Similarity Swoogle NLTK WordNet Python difflib
66.00
68.00
70.00
72.00
74.00
76.00
78.00
80.00
82.00
80.13
79.52
78.94 78.76
75.92
77.02
75.78
75.03 75.02
71.36
Accuracy F1 macro
5-fold cross validation results on the Training Set Second Task
Which Translation Engine is Better?
5-fold cross validation results on the Training Set Second Task
Symbols for Toolkit on X axis:
1: SEMILAR 2: DKPro Similarity 3: Python difflib 4: NLTK WordNet 5: Swoogle
6: All 5 Toolkits together
Conclusion
By using this algorithm we can detect semantic
similarity not only for Russian language, but for
any other language, which translation is
available via translation engines.
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
Merin Paul
 

Was ist angesagt? (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Language models
Language modelsLanguage models
Language models
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
Plagiarism introduction
Plagiarism introductionPlagiarism introduction
Plagiarism introduction
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Pattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to DatabasePattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to Database
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 

Andere mochten auch

Andere mochten auch (20)

AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, Couto
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Skornyakov
AINL 2016: SkornyakovAINL 2016: Skornyakov
AINL 2016: Skornyakov
 
AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, Kazorin
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Ustalov
AINL 2016: Ustalov AINL 2016: Ustalov
AINL 2016: Ustalov
 
AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, Ledovaya
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 

Ähnlich wie AINL 2016: Kravchenko

Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
IJECEIAES
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
Xavier Amatriain
 

Ähnlich wie AINL 2016: Kravchenko (20)

DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalabl...
 
Icdm2013 slides
Icdm2013 slidesIcdm2013 slides
Icdm2013 slides
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
Plagiarism Checker.pptx
Plagiarism Checker.pptxPlagiarism Checker.pptx
Plagiarism Checker.pptx
 
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at NetflixMLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
MLConf - Emmys, Oscars & Machine Learning Algorithms at Netflix
 
Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013Xavier amatriain, dir algorithms netflix m lconf 2013
Xavier amatriain, dir algorithms netflix m lconf 2013
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)Property-based testing an open-source compiler, pflua (FOSDEM 2015)
Property-based testing an open-source compiler, pflua (FOSDEM 2015)
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Graph-to-Graph Transformer for Transition-based Dependency Parsing
Graph-to-Graph Transformer for Transition-based Dependency ParsingGraph-to-Graph Transformer for Transition-based Dependency Parsing
Graph-to-Graph Transformer for Transition-based Dependency Parsing
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32BDD Testing Using Godog - Bangalore Golang Meetup # 32
BDD Testing Using Godog - Bangalore Golang Meetup # 32
 

Mehr von Lidia Pivovarova (8)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 

Kürzlich hochgeladen

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
Bhagirath Gogikar
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Kürzlich hochgeladen (20)

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 

AINL 2016: Kravchenko

  • 1. Solution for workshop AINL FRUCT: Artificial Intelligence and Natural Language Conference 10-12 NOVEMBER 2016 SAINT-PETERSBURG http://ainlconf.ru/
  • 2. Paraphrase Detection using Semantic Similarity Algorithms Dmitry Kravchenko Ben-Gurion University of the Negev
  • 3. Tasks description Input: 2 files with list of pairs of sentences in Russian in XML format: a) training set b) test set Output: Task 1: Algorithm should classify each pair into one of three classes: Non- paraphrase, Near-paraphrase, Precise-paraphrase Task 2: Algorithm should classify each pair into one of three classes: Non- paraphrase, Paraphrase
  • 4. Algorithm Data-Flow SEMILAR Toolkit DKPro Similarity Python difflib NLTK WordNet Swoogle BLEU algorithms Google Yandex Microsoft Gradient Boosting Classifier Input substitution of acronyms using online dictionary: wiktionary.org Output
  • 5. Classification algorithm ● GradientBooster Classifier ● Task 1: – Feature vector which contain 77 features: ● 18 features: 6 scores of SEMILAR toolkit * 3 translation engines ● 39 features: 13 scores of DKPro Similarity toolkit * 3 translation engines ● 3 features: 1 python difflib similarity score * 3 translation engines ● 6 features: 2 scores of sentence similarity scores (Yuhua Li, David McLean, etc. et al) * 3 translation engines ● 3 features: 1 score of Swoogle comparator * 3 translations ● 8 BLEU scores on source sentences (in Russian)
  • 6. Classification algorithm ● Task 2: – Feature vector which contain 69 features: ● 18 features: 6 scores of SEMILAR toolkit * 3 translation engines ● 39 features: 13 scores of DKPro Similarity toolkit * 3 translation engines ● 3 features: 1 python difflib similarity score * 3 translation engines ● 6 features: 2 scores of sentence similarity scores (Yuhua Li, David McLean, etc. et al) * 3 translation engines ● 3 features: 1 score of Swoogle comparator * 3 translations ● (without BLEU scores)
  • 7. 6 scores of SEMILAR toolkit ● greedyComparerWNLin ● optimumComparerLSATasa ● dependencyComparerWnLeskTanim ● cmComparer ● bleuComparer ● lsaComparer
  • 8. greedyComparerWNLin This score refers to a sentence to sentence similarity method which greedily aligns words between given sentences. The word alignment method used is WordNet based method proposed by Lin in 1998: article name is “An information- theoretic definition of similarity”. Please refer to: A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics http://www.aclweb.org/website/old_anthology/W/W12/W12- 20.pdf#page=175
  • 9. optimumComparerLSATasa Similar to greedyComparerWNLin, but the words are aligned optimally (similar to job assignment problem) and the word-to-word similarity method Article name is: Latent Semantic Analysis Models on Wikipedia and TASA http://deeptutor2.memphis.edu/Semilar- Web/public/downloads/LSA-Models- LREC014/LSAModelsOnWikipediaAndTASADanEtAl- LREC014.pdf
  • 10. dependencyComparerWnLeskTanim Please see: ● https://www.aaai.org/ocs/index.php/FLAIRS/200 9/paper/viewFile/55/298. The word-to-word similarity method used. It is WordNet based method proposed by Lesk and Tanim
  • 11. cmComparer Method proposed by Corley and Mihalcea. (article name is: SEMILAR: The Semantic Similarity Toolkit)
  • 12. lsaComparer LSA based word representation are summed up for each sentence and the similarity is calculated using the resultant representation. ● (resultant Vector based method is described in the article: NeRoSim: A System for Measuring and Interpreting Semantic Textual Similarity http://alt.qcri.org/semeval2015/cdrom/pdf/SemE val030.pdf)
  • 13. Word-to-word Similarity score Article: NeRoSim: A System for Measuring and Interpreting Semantic Textual Similarity
  • 14. 13 scores of DKPro Similarity toolkit ● CosineSimilarity, ● ExactStringMatchComparator, ● GreedyStringTiling2-gram, GreedyStringTiling 4-gram, ● JaroSecondStringComparator, ● JaroWinklerSecondStringComparator, ● normalized LevenshteinComparator, ● LongestCommonSubsequenceNormComparator, ● SubstringMatchComparator, ● WordNGramContainmentMeasure, ● WordNGram-JaccardMeasure 2-gram, WordNGramJaccardMeasure 3-gram, WordNGramJaccardMeasure 4-gram
  • 15. Four rest Toolkits ● Python difflib comparator ● NLTK WordNet. Sentence similarity scores (Yuhua Li, David McLean, etc. et al) ● Swoogle comparator ● BLEU scores (for Russian language, no need for English translation): bleu def 1-gram, bleu def 2-gram, bleu def 3-gram, bleu def 4-gram, bleu lin 1-gram, bleu lin 2-gram, bleu lin 3-gram, bleu lin 4-gram
  • 16. Results on Test Set Task number Accuracy F1 macro Place First Task Standard 0.5695 0.5437 4 out of 11 Second Task Standard 0.7153 0.7853 6 out of 10
  • 17. Which impact Toolkits gave? SEMILAR DKPro Similarity Swoogle NLTK WordNet Python difflib 66.00 68.00 70.00 72.00 74.00 76.00 78.00 80.00 82.00 80.13 79.52 78.94 78.76 75.92 77.02 75.78 75.03 75.02 71.36 Accuracy F1 macro 5-fold cross validation results on the Training Set Second Task
  • 18. Which Translation Engine is Better? 5-fold cross validation results on the Training Set Second Task Symbols for Toolkit on X axis: 1: SEMILAR 2: DKPro Similarity 3: Python difflib 4: NLTK WordNet 5: Swoogle 6: All 5 Toolkits together
  • 19. Conclusion By using this algorithm we can detect semantic similarity not only for Russian language, but for any other language, which translation is available via translation engines.