Slides from a ligthning talk oabout the Perl module Text::Perfide::BookPairs, presented on the I International Per-fide Workshops, at University of MInho, 2011.
2. What we get
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
3. Duplicated versions
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
4. Duplicated versions
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
5. Candidate pairs
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
6. Candidate pairs
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
7. Candidate pairs
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
8. What this is really about
similarity
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
9. It’s all LIEs!
Language Independent Element (LIE)
Terms which are usually kept untouched during
translation.
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
10. It’s all LIEs!
Language Independent Element (LIE)
Terms which are usually kept untouched during
translation.
Year references (e.g. “1977”)
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
11. It’s all LIEs!
Language Independent Element (LIE)
Terms which are usually kept untouched during
translation.
Year references (e.g. “1977”)
Proper names (e.g. “Sherlock Holmes”)
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
12. Measuring similarity
|ALIEs ∩ BLIEs |
similarity (A, B) =
|ALIEs ∪ BLIEs |
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
13. Measuring similarity
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
14. pairbooks
Similarity values
< 0.2 Documents are not related
> 0.4 Documents are candidate pairs
> 0.9 Documents are near duplicates
1.0 Documents are duplicates
Languages
High similarity, same language: (Near) duplicates
High similarity, different language: Candidate pairs
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
16. Perfect LIEs do not exist
Year references
Can be confused with page numbers
Headers/footers can contain them
(publishing year, copyright, . . . )
Proper names
Sometimes are translated (e.g. “S˜o
a
Tom´” “Judas Tom´” etc)
e, e,
Some languages use different scripts
(e.g. Russian)
Some languages have declensions
...
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
17. How to improve LIEs (future work)
accept a list of equivalent words
accept a list of stop words
...
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents
18. Give me one of those!
CPAN
http://search.cpan.org/perldoc?pairbooks
Developer version
requires Linux, Perl
Incomplete documentation
Andr´ Santos andrefs@cpan.org
e Identifying similar text documents