Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/342009.335429acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

Finding replicated Web collections

Published: 16 May 2000 Publication History

Abstract

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

References

[1]
Alexa Corporation. http://www.alexa.com, 1999.
[2]
Krishna Bharat and Andrei Z. Broder. Mirror, Mirror, on the Web: A study of host pairs with replicated content. In Proceedings of 8th International Conference on World Wide Web (WWW'99), May 1999.
[3]
Sergey Brin and Lawrence Page. Google search engine. http://www.google.com, 1999.
[4]
Andrei Broder. On the resemblance and containment of documents. In Compression and complexity of Sequences (SEQUENCES'97), pages 21 - 29, 1997.
[5]
Andrei Broder, Steve C. Glassman, and Mark S. Manasse. Syntactic clustering of the web. In Sixth International World Wide Web Conference, pages 391 -404, April 1997.
[6]
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to algorithms. The MIT Press, 1991.
[7]
Min Fang, Narayanan Shivakumar, Hector Garcia- Molina, Rajeev Motwani, and Jeffrey D. Ullman. Computing iceberg queries effciently. In Proceedings of International Conference on Very Large Databases (VLDB '98), pages 299- 310, August 1998.
[8]
Steve Lawrence and C. Lee Giles. Accessibility of information on the web. Nature, 400:107-109, 1999.
[9]
M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998.
[10]
James Pitkow and Peter Pirolli. Life, death, and lawfulness on the electronic frontier. In International conference on Computer and Human Interaction (CHI'97), 1997.
[11]
Gerard Salton. Itroduction to modern information retrieval. McGraw-Hill, New York, 1983.
[12]
Narayanan Shivakumar and Hector Garcia-Molina. SCAM:a copy detection mechanism for digital documents. In Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95), Austin, Texas, June 1995.
[13]
Narayanan Shivakumar and Hector Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proceedings of 1st A CM Conference on Digital Libraries (DL'96), Bethesda, Maryland, March 1996.

Cited By

View all
  • (2019)Precise Detection of Content Reuse in the WebACM SIGCOMM Computer Communication Review10.1145/3336937.333694049:2(9-24)Online publication date: 21-May-2019
  • (2018)Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)10.1109/PDGC.2018.8745792(102-104)Online publication date: Dec-2018
  • (2018)Graph Pattern Matching Preserving Label-Repetition ConstraintsModel and Data Engineering10.1007/978-3-030-00856-7_17(268-281)Online publication date: 13-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
May 2000
604 pages
ISBN:1581132174
DOI:10.1145/342009
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2000

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS00
Sponsor:

Acceptance Rates

SIGMOD '00 Paper Acceptance Rate 42 of 248 submissions, 17%;
Overall Acceptance Rate 695 of 3,542 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)58
  • Downloads (Last 6 weeks)11
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Precise Detection of Content Reuse in the WebACM SIGCOMM Computer Communication Review10.1145/3336937.333694049:2(9-24)Online publication date: 21-May-2019
  • (2018)Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)10.1109/PDGC.2018.8745792(102-104)Online publication date: Dec-2018
  • (2018)Graph Pattern Matching Preserving Label-Repetition ConstraintsModel and Data Engineering10.1007/978-3-030-00856-7_17(268-281)Online publication date: 13-Sep-2018
  • (2017)Search by Screenshots for Universal Article Clipping in Mobile AppsACM Transactions on Information Systems10.1145/309110735:4(1-29)Online publication date: 23-Jun-2017
  • (2016)UniClip: Leveraging Web Search for Universal Clipping of Articles on MobileData Science and Engineering10.1007/s41019-016-0012-21:2(101-113)Online publication date: 18-Jul-2016
  • (2015)Efficient subgraph join based on connectivity similarityWorld Wide Web10.1007/s11280-014-0286-018:4(871-887)Online publication date: 1-Jul-2015
  • (2014)Content sharing in information storage and retrieval system using tree representation of documents2014 Conference on IT in Business, Industry and Government (CSIBIG)10.1109/CSIBIG.2014.7056941(1-4)Online publication date: Mar-2014
  • (2011)Efficient similarity joins for near-duplicate detectionACM Transactions on Database Systems10.1145/2000824.200082536:3(1-41)Online publication date: 26-Aug-2011
  • (2010)Based on semantic web similarity2010 3rd International Conference on Computer Science and Information Technology10.1109/ICCSIT.2010.5564990(327-330)Online publication date: Jul-2010
  • (2010)Analysis of Duplicated Web Pages Identification Methods in Search Engine2010 2nd International Workshop on Database Technology and Applications10.1109/DBTA.2010.5659105(1-5)Online publication date: Nov-2010
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media