Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/382006.383197guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

dSCAM: finding document copies across multiple databases

Published: 01 December 1996 Publication History
  • Get Citation Alerts
  • Abstract

    The advent of the Internet has made the illegal dissemination of copyrighted material easy. An important problem is how to automatically detect when a “new” digital document is “suspiciously close” to existing ones. The SCAM project at Stanford University has addressed this problem when there is a single registered-document database. However, in practice, text documents may appear in many autonomous databases, and one would like to discover copies without having to exhaustively search in all databases. Our approach, dSCAM, is a distributed version of SCAM that keeps succinct metainformation about the contents of the available document databases. Given a suspicious document S, dSCAM uses its information to prune all databases that cannot contain any document that is close enough to S, and hence the search can focus on the remaining sites. We also study how to query the remaining databases so as to minimize different querying costs. We empirically study the pruning and searching schemes, using a collection of 50 databases and two sets of test documents.

    Cited By

    View all
    • (2011)Hypergeometric language models for republished article findingProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009983(485-494)Online publication date: 24-Jul-2011
    • (2011)Large-scale copy detectionProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989454(1205-1208)Online publication date: 12-Jun-2011
    • (2009)Do not crawl in the DUSTACM Transactions on the Web (TWEB)10.1145/1462148.14621513:1(1-31)Online publication date: 17-Jan-2009
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    DIS '96: Proceedings of the fourth international conference on on Parallel and distributed information systems
    December 1996
    295 pages
    ISBN:081867475X
    • Chairman:
    • Wei Sun

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 01 December 1996

    Author Tags

    1. Internet
    2. SCAM project
    3. autonomous database
    4. copyright
    5. dSCAM
    6. database pruning
    7. database querying
    8. digital document
    9. document copy finding
    10. illegal copyrighted material dissemination
    11. multiple databases
    12. querying costs
    13. searching schemes
    14. succinct metainformation
    15. test documents

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2011)Hypergeometric language models for republished article findingProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2009983(485-494)Online publication date: 24-Jul-2011
    • (2011)Large-scale copy detectionProceedings of the 2011 ACM SIGMOD International Conference on Management of data10.1145/1989323.1989454(1205-1208)Online publication date: 12-Jun-2011
    • (2009)Do not crawl in the DUSTACM Transactions on the Web (TWEB)10.1145/1462148.14621513:1(1-31)Online publication date: 17-Jan-2009
    • (2007)Do not crawl in the dustProceedings of the 16th international conference on World Wide Web10.1145/1242572.1242588(111-120)Online publication date: 8-May-2007
    • (2003)Plagiarism detection of text using knowledge-based techniquesDesign and application of hybrid intelligent systems10.5555/998038.998145(973-982)Online publication date: 1-Jan-2003
    • (2003)Analysis of source identified text corporaProceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 110.3115/1075096.1075145(383-390)Online publication date: 7-Jul-2003

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media