Article

Identifying and Filtering Near-Duplicate Documents

Author:

Andrei Z. BroderAuthors Info & Claims

COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

Pages 1 - 10

Published: 21 June 2000 Publication History

Abstract

The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document.

However, for effcient large scale web indexing it is not necessary to determine the actual resemblance value: it suffces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a "sample" of less than 50 bytes per document.

The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest.

The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

References

[1]

S. Brin, J. Davis, H. García-Molina. Copy Detection Mechanisms for Digital Documents. Proceedings of the ACM SIGMOD Annual Conference, May 1995.

[2]

K. Bharat and A.Z. Broder. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. In Proceedings of Seventh International World Wide Web Conference, pages 379-388, 1998.

[3]

A.Z. Broder. Some applications of Rabin's fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143-152. Springer-Verlag, 1993.

[4]

A.Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences 1997, pages 21-29. IEEE Computer Society, 1997.

[5]

A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 327-336, 1998.

[6]

A.Z. Broder and U. Feige. Min-Wise versus Linear Independence. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 147- 154, 2000.

[7]

A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the Sixth International World Wide Web Conference, pages 391-404, 1997.

[8]

N. Heintze. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce, pages 191-200, 1996.

[9]

U. Manber. Finding similar files in a large file system. Proceedings of the Winter 1994 USENIX Conference, pages 1-10, 1994.

[10]

R. Seltzer, E.J. Ray, and D.S. Ray. The AltaVista Search Revolution: How to Find Anything on the Internet. McGraw-Hill, 1996.

[11]

N. Shivakumar, H. García-Molina. SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.

[12]

N. Shivakumar and H. García-Molina. Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of the 3rd International Conference on Theory and Practice of Digital Libraries, 1996.

[13]

N. Shivakumar and H. García-Molina. Finding near-replicas of documents on the web. In Proceedings of Workshop on Web Databases (WebDB'98), March 1998.

[14]

Z. Smith. The Truth About the Web: Crawling Towards Eternity, Web Techniques Magazine, May 1997.

[15]

M.O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.

[16]

E. Ukkonen. Approximate string-matching distance and the q-gram distance. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 300-312. Springer-Verlag, 1993.

Cited By

Zhang YJiang HWang CHuang WChen MZhang YZhang L(2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3318404
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Zhang YXia WFeng DJiang HHua YWang QMerchant AWeatherspoon H(2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323310
Show More Cited By

Recommendations

On identifying representative relevant documents
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Using relevance feedback can significantly improve the effectiveness of ad hoc (query-based) retrieval. However, retrieval performance can significantly vary with respect to the given set of relevant documents. Our goal is to establish a quantitative ...
Achieving both high precision and high recall in near-duplicate detection
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these ...
Content-based filtering for semi-structured documents

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

June 2000

421 pages

ISBN:3540676333

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 21 June 2000

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YJiang HWang CHuang WChen MZhang YZhang L(2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3318404
Fürst JArgerich MCheng B(2023)VersaMatch: Ontology Matching with Weak SupervisionProceedings of the VLDB Endowment10.14778/3583140.358314816:6(1305-1318)Online publication date: 20-Apr-2023
https://dl.acm.org/doi/10.14778/3583140.3583148
Zhang YXia WFeng DJiang HHua YWang QMerchant AWeatherspoon H(2019)FinesseProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323310(121-128)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323310
Ge YWu JDai GLiu Y(2019)Text Deduplication with Minimum Loss RatioProceedings of the 2019 11th International Conference on Machine Learning and Computing10.1145/3318299.3318369(310-316)Online publication date: 22-Feb-2019
https://dl.acm.org/doi/10.1145/3318299.3318369
Conte AFerraro GGrossi RMarino ASadakane KUno TGuo YFarooq F(2018)Node Similarity with q -Grams for Real-World Labeled NetworksProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3220085(1282-1291)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3220085
Brackenbury WLiu RMondal MElmore AUr BChard KFranklin M(2018)Draining the Data SwampProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3209900.3209911(1-7)Online publication date: 10-Jun-2018
https://dl.acm.org/doi/10.1145/3209900.3209911
Di Castro DGamzu IGrabovitch-Zuyev ILewin-Eytan LPundir ASahoo NViderman MChampin PGandon FMédini LLalmas MIpeirotis P(2018)Automated Extractions for Machine Generated MailCompanion Proceedings of the The Web Conference 201810.1145/3184558.3186582(655-662)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3184558.3186582
Bury MSchwiegelshohn CSorella MChang YZhai CLiu YMaarek Y(2018)Sketch 'Em AllProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159694(72-80)Online publication date: 2-Feb-2018
https://dl.acm.org/doi/10.1145/3159652.3159694
Beame PRashtchian CKlein P(2017)Massively-parallel similarity join, edge-isoperimetry, and distance correlations on the hypercubeProceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3039686.3039705(289-306)Online publication date: 16-Jan-2017
https://dl.acm.org/doi/10.5555/3039686.3039705
Xu LPavlo ASengupta SGanger GChirkova RYang JSuciu D(2017)Online Deduplication for DatabasesProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3035938(1355-1368)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3035938
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents