Article

Free access

Finding replicated Web collections

Authors:

Junghoo Cho,

Narayanan Shivakumar,

Hector Garcia-MolinaAuthors Info & Claims

SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

Pages 355 - 366

https://doi.org/10.1145/342009.335429

Published: 16 May 2000 Publication History

PDF eReader

Abstract

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

References

[1]

Alexa Corporation. http://www.alexa.com, 1999.

Google Scholar

[2]

Krishna Bharat and Andrei Z. Broder. Mirror, Mirror, on the Web: A study of host pairs with replicated content. In Proceedings of 8th International Conference on World Wide Web (WWW'99), May 1999.

Digital Library

Google Scholar

[3]

Sergey Brin and Lawrence Page. Google search engine. http://www.google.com, 1999.

Google Scholar

[4]

Andrei Broder. On the resemblance and containment of documents. In Compression and complexity of Sequences (SEQUENCES'97), pages 21 - 29, 1997.

Digital Library

Google Scholar

[5]

Andrei Broder, Steve C. Glassman, and Mark S. Manasse. Syntactic clustering of the web. In Sixth International World Wide Web Conference, pages 391 -404, April 1997.

Digital Library

Google Scholar

[6]

Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to algorithms. The MIT Press, 1991.

Digital Library

Google Scholar

[7]

Min Fang, Narayanan Shivakumar, Hector Garcia- Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries effciently. In Proceedings of International Conference on Very Large Databases (VLDB '98), pages 299- 310, August 1998.

Digital Library

Google Scholar

[8]

Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107-109, 1999.

Crossref

Google Scholar

[9]

M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998.

Digital Library

Google Scholar

[10]

James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In International conference on Computer and Human Interaction (CHI'97), 1997.

Digital Library

Google Scholar

[11]

Gerard Salton. Itroduction to modern information retrieval. McGraw-Hill, New York, 1983.

Digital Library

Google Scholar

[12]

Narayanan Shivakumar and Hector Garcia-Molina. SCAM:a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95), Austin, Texas, June 1995.

Google Scholar

[13]

Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of 1st A CM Conference on Digital Libraries (DL'96), Bethesda, Maryland, March 1996.

Digital Library

Google Scholar

Cited By

View all

Ardi CHeidemann J(2019)Precise Detection of Content Reuse in the WebACM SIGCOMM Computer Communication Review10.1145/3336937.333694049:2(9-24)Online publication date: 21-May-2019
https://dl.acm.org/doi/10.1145/3336937.3336940
Narayana VGaddameedhi SKoppula VRaju K(2018)Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)10.1109/PDGC.2018.8745792(102-104)Online publication date: Dec-2018
https://doi.org/10.1109/PDGC.2018.8745792
Mahfoud H(2018)Graph Pattern Matching Preserving Label-Repetition ConstraintsModel and Data Engineering10.1007/978-3-030-00856-7_17(268-281)Online publication date: 13-Sep-2018
https://doi.org/10.1007/978-3-030-00856-7_17
Show More Cited By

Index Terms

Finding replicated Web collections
1. Information systems

Recommendations

Finding replicated Web collections

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and ...
An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections
ACE '10: Proceedings of the 2010 International Conference on Advances in Computer Engineering

As the Web continues to grow, it has become a difficult task to search for the relevant information using traditional search engines. There are many index based web search engines to search information in various domains on the Web. By using such search ...
Understanding web documents: finding pagelets for transformation using structural patterns

Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web ...

Comments

Information & Contributors

Information

Published In

SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

May 2000

604 pages

ISBN:1581132174

DOI:10.1145/342009

Chairmen:
Maggie Dunham
Southern Methodist Univ.
,
Jeffrey F. Naughton
Univ. of Wisconsin-Madison
,
Weidong Chen
Southern Methodist Univ.
,
Nick Koudas
AT &T Labs

ACM SIGMOD Record Volume 29, Issue 2
June 2000
609 pages
ISSN:0163-5808
DOI:10.1145/335191
Editors:
Weidong Chen
Southern Methodist Univ., Dallas, TX
,
Jeffrey Naughton
Univ. of Wisconsin-Madison, Madison
,
Philip A. Bernstein
Microsoft
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2000

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS00

Sponsor:

SIGMOD

SIGMOD/PODS00: ACM International Conference on Management of Data and Symposium on Principles of Database Systems

May 15 - 18, 2000

Texas, Dallas, USA

Acceptance Rates

SIGMOD '00 Paper Acceptance Rate 42 of 248 submissions, 17%;

Overall Acceptance Rate 695 of 3,542 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

101
Total Citations
View Citations
116
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)11

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ardi CHeidemann J(2019)Precise Detection of Content Reuse in the WebACM SIGCOMM Computer Communication Review10.1145/3336937.333694049:2(9-24)Online publication date: 21-May-2019
https://dl.acm.org/doi/10.1145/3336937.3336940
Narayana VGaddameedhi SKoppula VRaju K(2018)Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)10.1109/PDGC.2018.8745792(102-104)Online publication date: Dec-2018
https://doi.org/10.1109/PDGC.2018.8745792
Mahfoud H(2018)Graph Pattern Matching Preserving Label-Repetition ConstraintsModel and Data Engineering10.1007/978-3-030-00856-7_17(268-281)Online publication date: 13-Sep-2018
https://doi.org/10.1007/978-3-030-00856-7_17
Umemoto KSong RNie JXie XTanaka KRui Y(2017)Search by Screenshots for Universal Article Clipping in Mobile AppsACM Transactions on Information Systems10.1145/309110735:4(1-29)Online publication date: 23-Jun-2017
https://dl.acm.org/doi/10.1145/3091107
Song RUmemoto KNie JXie XTanaka KRui Y(2016)UniClip: Leveraging Web Search for Universal Clipping of Articles on MobileData Science and Engineering10.1007/s41019-016-0012-21:2(101-113)Online publication date: 18-Jul-2016
https://doi.org/10.1007/s41019-016-0012-2
Wang YWang HLi JGao H(2015)Efficient subgraph join based on connectivity similarityWorld Wide Web10.1007/s11280-014-0286-018:4(871-887)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1007/s11280-014-0286-0
Sharma DJain S(2014)Content sharing in information storage and retrieval system using tree representation of documents2014 Conference on IT in Business, Industry and Government (CSIBIG)10.1109/CSIBIG.2014.7056941(1-4)Online publication date: Mar-2014
https://doi.org/10.1109/CSIBIG.2014.7056941
Xiao CWang WLin XYu JWang G(2011)Efficient similarity joins for near-duplicate detectionACM Transactions on Database Systems10.1145/2000824.200082536:3(1-41)Online publication date: 26-Aug-2011
https://dl.acm.org/doi/10.1145/2000824.2000825
Li RYang WJiang H(2010)Based on semantic web similarity2010 3rd International Conference on Computer Science and Information Technology10.1109/ICCSIT.2010.5564990(327-330)Online publication date: Jul-2010
https://doi.org/10.1109/ICCSIT.2010.5564990
Duan FZheng Y(2010)Analysis of Duplicated Web Pages Identification Methods in Search Engine2010 2nd International Workshop on Database Technology and Applications10.1109/DBTA.2010.5659105(1-5)Online publication date: Nov-2010
https://doi.org/10.1109/DBTA.2010.5659105
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Finding replicated Web collections

An Architectural Framework of a Crawler for Retrieving Highly Relevant Web Documents by Filtering Replicated Web Collections

Understanding web documents: finding pagelets for transformation using structural patterns