This tool implements a web application which allows you to check a document and/or a document's metadata against a Fedora repository the find out possible duplicates.
The similarity algorithm is a Java port of the original C code from "Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012]. They include portions based on code written by and copyright Daria Sorokina, 2005.
Fedora 2.2.1, Java 1.5 and Tomcat 5 are needed to run the application. Newer version should work, too.
Copy the docsim.war into Tomcat's webapps directory. If your Fedora configuration differs from
then you have to adjust the settings in webapps/docsim/WEB-INF/classes/docsim.properties.
Start your web browser and open the entry page of the application by calling
Now you see a formular where you can enter some document metadata (title, authors) and/or a document file name together with the document's MIME type. To run the check with a document file / MIME type you first have to create the index (press button "rebuild index").
The check with title or authors only finds exact matches.
The check with file + MIME type finds all document which overlap the given document somehow. In the docsim.properties (see above) you can specify a threshold for the percentage of overlapping. Default is 75 %.
To call the check function the button "run check" should be pressed. This gives a human readable output.