DUPE: Depository of Universal Plagiarism Examples
In 2003 I created the CodeMatch program that very quickly became
a de facto standard in software IP litigation. I created a test
bench of purposely plagiarized code that could be used to independently
and objectively compare the results produced by different plagiarism
detection programs. Some in the academic community claimed that
my tests were biased toward the algorithms used by CodeMatch, which
explained why CodeMatch fared so well compared to the other programs.
However, these same critics, despite my requests, never produced
their own set of standard tests.
Although I believe that the standard tests I have used are not
biased, it occurred to me that there could be a better way to eliminate
even unintentional bias. The solution would be to take the source
code for certain open source programs and announce a new open source
project that would involve purposely plagiarizing the code. Programmers
from around the world would be invited, perhaps in a competition,
to change the source code while retaining the functionality. The
original programs and the plagiarized versions submitted from others
would be stored in a database known as the Depository of Universal
Plagiarism Examples or DUPE. Plagiarism detection programs would
then be run on DUPE and comparisons of the results could be made
to determine which programs best detected copying. Also, important
statistics about plagiarized code could be determined, as well as
patterns identified in order to improve the plagiarism detection
programs.
SAFE Corporation has begun looking into creating this database.
However, we realize that we would like to work with partners in
academia and industry. We believe that there are several key issues
that need to be resolved in creating DUPE. These are:
- Choosing appropriate open source projects.
- Creating a minimum definition of software plagiarism.
- Creating the database.
- Determining policies including who can access it, how it will
be used, and who will maintain it.
- Determining how to run the tests, how to generate the results,
and how to distribute the results.
Please contact me if you're interested in working on this important
and groundbreaking project.
|