Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Signature extraction for overlap detection in documents

Published: 01 January 2002 Publication History
  • Get Citation Alerts
  • Abstract

    Easy access to the Web has led to increased potential for students cheating on assignments by plagiarising others' work. By the same token, Web-based tools offer the potential for instructors to check submitted assignments for signs of plagiarism. Overlap-detection tools are easy to use and accurate in plagiarism detection, so they can be an excellent deterrent to plagiarism. Documents can overlap for other reasons, too: Old documents are superseded, and authors summarize previous work identically in several papers. Overlap-detection tools can pinpoint interconnections in a corpus of documents and could be used in search engines.We describe a web-accessible text registry based on signature extraction. We extract a small but diagnostic signature from each registered text for permanent storage and comparison against other stored signatures. This comparison allows us to estimate the amount of overlap between pairs of documents, although the total time required is linear in the total size of the documents. We compare our algorithm with several alternatives and present both efficiency and accuracy results.

    References

    [1]
    {Argetsinger, 2001} Argetsinger, A. (2001). Technology exposes cheating at U-Va. The Washington Post.]]
    [2]
    {Arya et al., 1994} Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. (1994). An optimal algorithm for approximate nearest neighbor searching. In Sleator, D. D., editor, Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 573-582, Arlington, VA.]]
    [3]
    {Benjaminson, 1999} Benjaminson, A. (1999). Internet offers new path to plagiarism, UC-Berkeley officials say. Daily Californian.]]
    [4]
    {Broder et al., 1997} Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G. (1997). Syntactic clustering of the Web. Computer Networks and ISDN Systems, 29(8-13):1157-1166.]]
    [5]
    {Cleary and Teahan, 1997} Cleary, J. and Teahan, W. J. (1997). Unbounded length contexts for PPM. Computer Journal, 40(2/3):67-75.]]
    [6]
    {Deutsch, 1996} Deutsch, L. P. (1996). RFC 1952: GZIP file format specification version 4.3. Internet Activities Board.]]
    [7]
    {Friedman et al., 1977} Friedman, J. H., Bentley, J. L., and Finkel, R. A. (1977). An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions on Mathematical Software, 3(3):209-226.]]
    [8]
    {Google, 2001} Google (2001). http://www.google.com.]]
    [9]
    {Heintze, 1996} Heintze, N. (1996). Scalable document fingerprinting. In Proceedings of the second USENIX Workshop on Electronic Commerce: November 18-21, 1996, Oakland, California, pages 191-200, Berkeley, CA, USA.]]
    [10]
    {Manber, 1994} Manber, U. (1994). Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Conference: January 17-21, 1994, San Francisco, California, USA, pages 1-10, Berkeley, CA, USA.]]
    [11]
    {Monostori et al., 2000} Monostori, K., Zaslavsky, A. B., and Schmidt, H. (2000). Document overlap detection system for distributed digital libraries. In Proceedings of the 5th ACM Conference on Digital Libraries (DL-00), pages 226-227, New York, NY.]]
    [12]
    {Prechelt et al., 2000} Prechelt, L., Malpohl, G., and Philippsen, M. (2000). Finding plagiarisms among a set of programs with JPlag. Submitted to Journal of Universal Computer Science; http://wwwipd.ira.uka.de/jplag.]]
    [13]
    {Rabin, 1981} Rabin, M. O. (1981). Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University.]]
    [14]
    {Rivest, 1992} Rivest, R. L. (1992). RFC 1321: The MD5 Message-Digest Algorithm. Internet Activities Board.]]
    [15]
    {Seward, 2000} Seward, J. (2000). http://sources.redhat.com/bzip2.]]
    [16]
    {Shivakumar and Garcia-Molina, 1996} Shivakumar, N. and Garcia-Molina, H. (1996). Building a scalable and accurate copy detection mechanism. In Proceedings of the 1st ACM Conference on Digital Libraries (DL'96), Bethesda, Maryland.]]
    [17]
    {Wall and Schwartz, 1992} Wall, L. and Schwartz, R. L. (1992). Programming Perl. O'Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164, USA.]]

    Cited By

    View all
    • (2020)Evaluation of Fingerprint Selection Algorithms for Local Text Reuse DetectionApplied Computer Systems10.2478/acss-2020-000225:1(11-18)Online publication date: 5-Jun-2020
    • (2009)An evolutionary neural network approach to intrinsic plagiarism detectionProceedings of the 20th Irish conference on Artificial intelligence and cognitive science10.5555/1939047.1939055(33-40)Online publication date: 19-Aug-2009

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Australian Computer Science Communications
    Australian Computer Science Communications  Volume 24, Issue 1
    January-February 2002
    320 pages

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 01 January 2002

    Author Tag

    1. plagiarism document overlap culling digest

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Evaluation of Fingerprint Selection Algorithms for Local Text Reuse DetectionApplied Computer Systems10.2478/acss-2020-000225:1(11-18)Online publication date: 5-Jun-2020
    • (2009)An evolutionary neural network approach to intrinsic plagiarism detectionProceedings of the 20th Irish conference on Artificial intelligence and cognitive science10.5555/1939047.1939055(33-40)Online publication date: 19-Aug-2009

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media