Duplication Check

This tool implements a web application which allows you to check a document and/or a document's metadata against a Fedora repository the find out possible duplicates.

The similarity algorithm is a Java port of the original C code from "Plagiarism Detection in arXiv", Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg [ICDM'06, http://arxiv.org/abs/cs/0702012]. They include portions based on code written by and copyright Daria Sorokina, 2005.


Fedora 2.2.1, Java 1.5 and Tomcat 5 are needed to run the application. Newer version should work, too.

Basic Installation

Copy the docsim.war into Tomcat's webapps directory. If your Fedora configuration differs from

URL: http://localhost:8082/fedora
account: fedoraAdmin/fedoraAdmin

then you have to adjust the settings in webapps/docsim/WEB-INF/classes/docsim.properties.


Start your web browser and open the entry page of the application by calling

http://<tomcat url>/docsim/

Now you see a formular where you can enter some document metadata (title, authors) and/or a document file name together with the document's MIME type. To run the check with a document file / MIME type you first have to create the index (press button "rebuild index").

The check with title or authors only finds exact matches.

The check with file + MIME type finds all document which overlap the given document somehow. In the docsim.properties (see above) you can specify a threshold for the percentage of overlapping. Default is 75 %.

To call the check function the button "run check" should be pressed. This gives a human readable output.


For comments, enhancements or bug reports please send a mail to andre.schenk@fiz-karlsruhe.de.