Can Whitespace Patterns Provide Clues to Plagiarism?
Over the years I've run into expert witnesses and attorneys who
have told me about software copyright infringement cases where the
only clues that copying occurred were patterns of spaces and tabs
("whitespace"). The idea is that if a truly ambitious
thief wanted to cover his tracks, he would modify the stolen code
so much that there was no longer a visible trace of copying. However,
the clever software sleuth could find patterns of whitespace that
the thief had missed; although virtually nothing remained, the invisible
tabs and spaces could produce a conviction.
This always sounded intriguing, but I wondered whether anyone had
ever tested this theory. We could find no articles or papers on
the subject, except for one inconclusive
paper, and I dreaded to think that some programmer was convicted
based on an untested theory. I decided to have my consulting company,
Zeidman Consulting,
do some carefully controlled research. If the results turned out
well, SAFE Corporation would add whitespace pattern algorithms to
CodeSuite to further enhance its ability to detect copying.
Our results were published in a paper entitled Measuring Whitespace
Patterns as an Indication of Plagiarism that was recently presented
at the ADFSL
Conference on Digital Forensics, Security and Law. Our results
are summarized in the final paragraph:
This whitespace pattern matching method can be used to focus
a search for evidence of similarity or copying, but this method
cannot stand by itself.
What we discovered is that even very different files have often
have similar whitespace patterns. At Zeidman Consulting we've used
whitespace patterns to confirm copying that was already detected
through the use of CodeMatch to find correlated programming elements.
In those cases, the whitespace patterns offered further confidence
in our findings and in some cases showed which program had been
developed first. For a copy of the paper, email us at info@SAFE-corp.com.
Our next research project is to look at sequences of whitespace
within files. Maybe there we'll find some clues to copying. But
for now our results show that whitespace patterns without any other
evidence should not be used to determine that copying occurred.
|