SAFE Corporation Logo
Software Analysis and Forensic Engineering
Newsletter Registration
Free Software DownloadPurchase CodeSuite License

CodeMatch Algorithms

The Algorithms

CodeMatch uses several algorithms to determine similarity between two source code files. These algorithms are described below. When multiple files are compared, file pairs are given a correlation score between 0 and 100. The file pairs are then ranked by CodeMatch score so that you can examine the most similar files.

Statement Matching

The CodeMatch Statement Matching algorithm compares each functional line of source code from both files, excluding comment lines. Also, all whitespace removed. For source lines to be considered matches, they must contain at least one non-keyword such as a variable name or function name. Otherwise, lines containing basic operations would be reported as matching. This algorithm yields a correlation score representing the number of matching instruction lines in the two files.

Comment/String Matching

The CodeMatch Comment/String Matching algorithm compares each line of comments and each string from both files. Sequences of whitespace are converted to single spaces so that the syntax structure of the line is preserved. The entire comment or string is compared, regardless of whether there are keywords in the comment or string. This algorithm yields a correlation score representing the number of matching comments and strings in the two files.

Instruction Sequence Matching

The CodeMatch Instruction Sequence Matching algorithm compares the first instruction of every source line in the pair of files, ignoring blank lines and comment lines. This algorithm finds sequences of code that appear to perform the same functions despite changed comments and identifier names. The algorithm finds the longest common sequence within both files. This algorithm yields a correlation score representing the number of source lines in the longest matching instruction sequence in the two files. Even if variable names are altered in one file, CodeMatch will report similarities in the files. The following shows an example of two identical instruction sequences in C:

// File 1
if (x == 5)
    // Loop on j here
    for (j = 0; j < Index; j++)
        printf("x = %i", j);
    break;    // Here's the break

// File 2
if (xyz < 2)
    for (jjj = 0; jjj < i; jjj++)
        printf("Hello world\n");

Identifier Matching

For each file pair, the CodeMatch Identifier Matching algorithm counts the number of matching identifiers that are not programming language keywords. In order to determine whether an identifier is a programming language keyword, comparison is done with a list of programming language keywords. Only non-keywords are compared in order to find matching function names, variable names, and other identifiers. This algorithm also examines each non-keyword identifier in the source code of one file and finds all identifiers that match a sequence within one or more non-keyword identifiers in the other file. For example, if identifier abc is found in one file and the identifiers aabc, abc1111111, and abcxxxyz are found in anther file, then there is a partial match for the identifier abc. Partial matching finds common function names, variable names, and other identifiers where someone has attempted to disguise the identifiers in a fairly basic way. This algorithm yields a correlation score representing the number of matching and partially matching identifiers in the two files.

Correlation Score

Finally, a single correlation score is given for the similarity of the file pairs. If a file pair has a higher score, it implies that these files are more similar and may be plagiarized from each other or from a common third file. There are many kinds of plagiarism and many ways of fooling plagiarism detection programs. For this reason, CodeMatch produces an HTML output report with a list of file pairs ordered by their total match scores. The user can click on a file pair hyperlink to bring up a detailed HTML report showing exact matches that occurred between the selected file pairs. In this way, experts are directed to suspicious similarities and allowed to make their own judgments. CodeMatch is not a tool for precisely pinpointing plagiarized code, but rather a tool to assist an expert in finding plagiarized code. CodeMatch reduces the effort needed by the expert by allowing him to narrow his focus from hundreds of thousands of lines of code in hundreds of files to dozens of lines of code in dozens of files.

Terms & Conditions | Contact Us | Careers | Privacy Policy | Search the Site
Free Newsletter Registration