Slides from the ALLEA Kwan symposium, Amsterdam 2011-12-14, about technical possibilities of detecting plagiarism - Comparative analysis of detection tools.
Measures of Central Tendency: Mean, Median and Mode
ALLEA KWAN symposium Amsterdam 2011-12-14
1. flickr, cc by-nc jobadge, 2011
Technical possibilities of detecting plagiarism -
Comparative analysis of detection tools
Katrin Köhler (B.SC.)
Plagiarism - legal, moral and educational aspects, Amsterdam, 2011-12-14
Slides based on Debora Weber-Wulff, edited by Katrin Köhler
2. About me
• Research assistant of Prof. Dr. Weber-Wulff
since 2007
• Sofware Test in 2008 and 2010
• Masterthesis about “Cryptographic Watermarking
for Texts”
2 / 52
3. Contents
• Plagiarism Detection Test 2010
• Doctor Thesis of Karl-Theodor zu Guttenberg
• Discovering plagiarism
3 / 52
4. Teachers and administrations
want an simple solution
Photo: Flickr cc-by-nc-sa: xtrarant, 2008
Art Installation: Jamie Pawlus, Indianapolis, Indiana, 2003
4 / 52
6. Plagiarism detection software
• Can be extremely expensive!
• Teachers want to have all papers
marked original or plagiarism before
they start reading them.
• Students are afraid of wrongly being
labeled plagiarists.
• Only a teacher can decide if it is indeed
plagiarism! Software cannot be used to solve
social problems.
• Prof. Dr. Weber-Wulff has tested plagiarism
detection software 4.5 times: 2004, 2007, 2008,
2010 and zu Guttenberg’s thesis
6
/6
150
7. Test process 2010
• 9 months of work with 2 persons
• 42 test cases in English, German
and Japanese
• Different types of plagiarism,
a few originals
• Market survey
• Access to the systems
• 48 systems found, 26 could be
completely evaluated
7 / 52
8. Evaluation metric: Effectivness
• Plagiarism or not:
What was found?
• Total
• Without the first 10 tests
(Google accident)
• English cases
• Japanese cases as additional
challenge Flickr, cc-by, arthit, 2005
➡No winner,
continuous between 55% and 64 %
8 / 52
9. Evaluation metric: Usability
• Design, language consistency, navigation,
labelling, print quality of the reports, fits in
university processes
• Support by email:
Speed, good answers
• Top: PlagScan, followed by
PlagiarismFinder, Ephorus,
PlagAware and TurnItIn
Flickr, cc-by, Quapan, 2008
9 / 52
10. Evaluation metric : Professionalism
• Street address with town, telephone
number, name of a person
• Domain registration in own name
Flickr, cc-by-sa,
• No parallel offers of term papers or sludgegulper , 2008
pornography or advertising for such services
• German-speaking availability by telephone
during German working hours
• No installation of viruses
➡ PlagiarismFinder, followed by PlagAware,
Strike Plagiarism, TurnItIn, Docoloc,
PlagScan, Blackboard
10 / 52
11. Problems: Effectiveness
• Nothing found from books - not
even if they are in Google
Books!
• We had one 100% plagiarism
from Google books register at
less than 25%
• Translations are not found
11 / 52
12. Problems: Effectiveness
• Umlauts cause problems, although less so than
in earlier tests
• Redacted texts are found less often
• Many systems very
difficult to use
• Not all companies
trustworthy
• Some keep copies - and
award themselves
rights to use the text!
12 / 52
13. Problems: Usability
• Language mix
• Workflow problems
• The reports are generally not useful
13 / 52
14. Problems: Professionalism
• No info, no names
• The address listed is a parking lot
• Support questions not answered, telephone does
not pick up
• Offer term papers or
pornography in parallel,
all rights given
to the company
14 / 52
15. How to rank?
• No system was best in all of the metrics
• We set up a ranking for each of the five criteria
(three effectiveness, one usability, one
professionalism)
• Calculated the average ranking
15 / 52
16. Results: Useful
• There were no systems in
this category - only human
are able to reach this level of
effectiveness.
Flickr, cc-by-nc, dianejp, 2009
16 / 52
18. Partially useful: PlagAware
• German System
• Good documentation
• Average effectiveness: 61%
• But: each file must be submitted by itself (5
clicks!), this does not fit with the workflow
• Looks for plagiarism in online texts
18 / 52
20. Partially useful : turnitin
• Best results for material that is stored in their
database
• Translation problems
• Umlaut problems
• Return Wikipedia copies with ads for porn
• The source URLs reported are often no longer
valid
• Just adds up the percent values for the
“originality” report
• Only system to deal with Japanese properly
20 / 52
30. Partially useful: PlagScan
• Newcomer from Germany
• One purchases “PlagPoints”
• Useful: Subaccounts for teachers
• First place in usability
• Three kinds of report, none of which are a
side-by-side report
• Only 60% in effectiveness
30 / 52
33. Partially useful: Urkund
• Swedish system
• Second in overall effectiveness
• 13th in usability and professionalism
• Language problems
• Complex navigation
• Catastrophic layout
• Unusable reports
• Cryptic error messages
• Test cases from 2008 were still stored
33 / 52
36. Barely useful Systems
• They find something, but miss a lot
• They are not really easy to use
• They have professionalism problems
• Docoloc, Copyscape, Blackboard/Safe Assign,
Plagiarism Finder, Plagiarisma, Compilatio,
StrikePlagiarism, The Plagiarism Checker
36 / 52
38. checkforplagiarism.net
• In 2007 it was called
iPlagiarismcheck.com
• Was a plagiarism of
turnitin, but they said:
These are the sources!
• Charge 15 €
for 5 tests, students
are the target group
• turnitin set up a
Honeypot
38 / 52
40. Viper
• Is installed on a PC
• In the terms of use: You give us
irrevocable rights to use your text
as we see fit
• Also runs a paper mill
• Complicated reports
• Only 24% effectiveness -
better to throw a coin!
• Advertise in the UK by power
cleaning the sidewalks
40 / 52
42. GuttenPlag
Collaborative documentation of plagiarism
42 / 52
43. The Extent of
the Plagiarism
• 135 sources
• 94% of pages
• 63% of lines
43
/43
150
44. Test Results
• 38 of the (at the time of the test) 131 known
sources were found by at least one of the
systems
• Many of these sources (no longer) online
• Over all of the possible sources were found:
iThenticate 30 23 %
PlagScan 19 15 %
Urkund 16 12 %
PlagAware 7 5%
Ephorus 6 5%
44
/44
150
45. We tested these systems on
zu Guttenbergs thesis
• The usability for such large
works was extremely poor
• The numbers appear to be
random
• Many sources throw a 404
“file not found” error with
iThenticate
• Nothing from books (or the
Bundestag) was found
45 / 52
46. The major problem is:
• They don’t find plagiarism! Just (marginally
changed)
copies of text - even properly referenced!
Flickr, cc-by-nc, Leeks, 2006
46 / 52
47. So let’s have a look ourselves....
• But doesn’t the thesis have to be available
digitally?
• And the thesis is so long?
• And the Internet
is extremely
large?
Flickr, cc-by-nc-nd, t_buchtele, 2009
47 / 52
48. Suspicion
• Upon careful reading you find it nicely written,
but .....
• The style is too polished, the vocabulary not that
of your students.
• There is some
strange formatting
• Interesting spelling
errors
• Lurching breaks in style
Flickr, cc-by, redcctshirt, 2009
48 / 52
49. Searching with Google & Co
• Phrase in "..."
• 3-5 nouns
Flickr, cc-by-nc-nd, Athena1970, 2008
• The typo
• Check the second page
of hits
• Set a time limit
49 / 52