This document describes a paraphrase detection algorithm that uses semantic similarity scores from various NLP toolkits and machine translation engines. It evaluates pairs of sentences to classify them as either non-paraphrases, near-paraphrases, or precise paraphrases (Task 1) or simply paraphrase and non-paraphrase (Task 2). The algorithm uses feature vectors containing similarity scores from tools like SEMILAR, DKPro Similarity, NLTK WordNet, Swoogle, and BLEU, fed into a gradient boosting classifier. Evaluation on test data showed an accuracy of 0.5695 for Task 1 and 0.7153 for Task 2, placing middle of submissions. The Microsoft
3. Tasks description
Input:
2 files with list of pairs of sentences in Russian in XML format:
a) training set
b) test set
Output:
Task 1:
Algorithm should classify each pair into one of three classes: Non-
paraphrase, Near-paraphrase, Precise-paraphrase
Task 2:
Algorithm should classify each pair into one of three classes: Non-
paraphrase, Paraphrase
4. Algorithm Data-Flow
SEMILAR Toolkit
DKPro Similarity
Python difflib
NLTK WordNet
Swoogle
BLEU algorithms
Google
Yandex
Microsoft
Gradient
Boosting
Classifier
Input
substitution
of acronyms
using online
dictionary:
wiktionary.org
Output
5. Classification algorithm
● GradientBooster Classifier
● Task 1:
– Feature vector which contain 77 features:
● 18 features: 6 scores of SEMILAR toolkit * 3 translation
engines
● 39 features: 13 scores of DKPro Similarity toolkit * 3
translation engines
● 3 features: 1 python difflib similarity score * 3 translation
engines
● 6 features: 2 scores of sentence similarity scores (Yuhua Li,
David McLean, etc. et al) * 3 translation engines
● 3 features: 1 score of Swoogle comparator * 3 translations
● 8 BLEU scores on source sentences (in Russian)
8. greedyComparerWNLin
This score refers to a sentence to sentence similarity method
which greedily aligns words between given sentences. The
word alignment method used is WordNet based method
proposed by Lin in 1998: article name is “An information-
theoretic definition of similarity”.
Please refer to:
A Comparison of Greedy and Optimal Assessment of Natural
Language Student Input Using Word-to-Word Similarity Metrics
http://www.aclweb.org/website/old_anthology/W/W12/W12-
20.pdf#page=175
9. optimumComparerLSATasa
Similar to greedyComparerWNLin, but the words are
aligned optimally (similar to job assignment problem) and
the word-to-word similarity method
Article name is: Latent Semantic Analysis Models on
Wikipedia and TASA
http://deeptutor2.memphis.edu/Semilar-
Web/public/downloads/LSA-Models-
LREC014/LSAModelsOnWikipediaAndTASADanEtAl-
LREC014.pdf
12. lsaComparer
LSA based word representation are summed up
for each sentence and the similarity is
calculated using the resultant representation.
● (resultant Vector based method is described in
the article: NeRoSim: A System for Measuring
and Interpreting Semantic Textual Similarity
http://alt.qcri.org/semeval2015/cdrom/pdf/SemE
val030.pdf)
15. Four rest Toolkits
● Python difflib comparator
● NLTK WordNet. Sentence similarity scores
(Yuhua Li, David McLean, etc. et al)
● Swoogle comparator
● BLEU scores (for Russian language, no need
for English translation): bleu def 1-gram, bleu
def 2-gram, bleu def 3-gram, bleu def 4-gram,
bleu lin 1-gram, bleu lin 2-gram, bleu lin 3-gram,
bleu lin 4-gram
16. Results on Test Set
Task number Accuracy F1 macro Place
First Task Standard 0.5695 0.5437 4 out of 11
Second Task Standard 0.7153 0.7853 6 out of 10
17. Which impact Toolkits gave?
SEMILAR DKPro Similarity Swoogle NLTK WordNet Python difflib
66.00
68.00
70.00
72.00
74.00
76.00
78.00
80.00
82.00
80.13
79.52
78.94 78.76
75.92
77.02
75.78
75.03 75.02
71.36
Accuracy F1 macro
5-fold cross validation results on the Training Set Second Task
18. Which Translation Engine is Better?
5-fold cross validation results on the Training Set Second Task
Symbols for Toolkit on X axis:
1: SEMILAR 2: DKPro Similarity 3: Python difflib 4: NLTK WordNet 5: Swoogle
6: All 5 Toolkits together
19. Conclusion
By using this algorithm we can detect semantic
similarity not only for Russian language, but for
any other language, which translation is
available via translation engines.