How can i compare two or more word files against each other and return their similarity out of 100?

I am building a system that is meant to try and reduce the plagiarism rate with in a university by compare student works. To start, a user (Teacher) uploads files to the database of student works in ms word format which vary in page numbers. Then a teacher selects a file which is registered with upload id 1 given automatically by the system and selects all the other files . After the teacher clicks on a button "Check Work" to start the operation and after works that a found to be similar are returned with the percentage of similarity and student names. If percentage is >= 50 , that is a plagiarism case .


  • XiveXive Egypt
    edited May 2014

    A quick simple algorithm comes to mind when comparing two ms word files is to do the following with each of the two files:

    1. Combine all pages into a single page or "String" ( now contains sentences as plain text )

    2. Remove white-space and special characters like (!?.,#% ...etc - including break-lines [CR])

    3. Force lowercase on all the letters

    4. Now its a very long line of text "String" that contains words separated by spaces,
      Now you can split that string by the spaces to have an array of the words used in that string ( aka: ms word paper )

    5. You have two options on how to compare those two array of words "Strings" depends on how sophisticated you want the system to be:

    A. You can run through each word and check its occurrence in the other array, obviously more word matches means more probability of a plagiarism case. But that method isn't very accurate as comparing word by word isn't a bullet proof method that the person cheated.

    B. You can check if at least 3 or more consecutive word has occurred in the same sequence in the other array, obviously more matches means more probability of a plagiarism case.
    This is a more accurate method in my opinion as "Three" consecutive word matches is a proof that these two papers have identical sentences you can increase the number of consecutive word matches as you like if you don't feel comfortable with this number, the bigger the number of words is the more accurate the proof is however it shouldn't exceed 7 words as it might not catch any case if the number is that big in any case it shouldn't exceed the max number of words a sentence can be.

    The percentage should be the successful match cases of combinations of words used divided by the total number of those combinations.
    I would strongly recommend using Regular Expression as its significantly faster than String comparing for each word.

    Good luck.

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


In this Discussion