Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2361354.2361380acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Structural and visual comparisons for web page archiving

Published: 04 September 2012 Publication History

Abstract

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.

References

[1]
M. Ben Saad, S. Gançarski, and Z. Pehlivan, "A novel web archiving approach based on visual pages analysis," in IWAW 2009.
[2]
M. Oita and P. Senellart, "Deriving dynamics of web pages: A survey," in TWAW, March 2011.
[3]
D. Cai, S. Yu, J. Wen, and W. Ma, "Vips: a vision-based page segmentation algorithm," Microsoft Technical Report, MSR-TR-2003--79--2003, 2003.
[4]
Z. Pehlivan, M. Ben Saad, and S. Gançarski, "Vi-DIFF: Understanding Web Pages Changes," in DEXA 2010.
[5]
J. Cao, B. Mao, and J. Luo, "A segmentation method for web page analysis using shrinking and dividing," JPEDS, vol. 25, 2010.
[6]
A.Y. Fu, L. Wenyin, and X. Deng, "Detecting phishing web pages with visual similarity assessment based on earth mover's distance (emd)," TDSC, vol. 3, 2006.
[7]
N. Thome, D. Merad, and S. Miguet, "Learning articulated appearance models for tracking humans: A spectral graph matching approach," Signal Processing: Image Communication, vol. 23, no. 10, 2008.
[8]
A. Spengler and P. Gallinari, "Document structure meets page layout: Loopy random fields for web news content extraction," in DocEng, 2010.
[9]
D. Lowe, "Distinctive image features from scale-invariant keypoints," IJCV, vol. 60, 2004.
[10]
W.Y. Ma and B.S. Manjunath, "Netra: A toolbox for navigating large image databases," in ICIP 1997.
[11]
J. Fournier, M. Cord, and S. Philipp-Foliguet, "Retin: A content-based image indexing and retrieval system," PAA, vol. 4, no. 2, pp. 153--173, 2001.
[12]
S. Avila, N. Thome, M. Cord, E. Valle, and A. Araújo, "Bossa: Extended bow formalism for image classification," in ICIP 2011.
[13]
K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, "The devil is in the details: an evaluation of recent feature encoding methods," BMVC, 2011.
[14]
R. Song, H. Liu, J.R. Wen, and W.Y. Ma, "Learning block importance models for web pages," in WWW 2004.
[15]
D. Picard, N. Thome, and M. Cord, "An efficient system for combining complementary kernels in complex visual categorization tasks," in ICIP 2010.
[16]
L. Yang and R. Jin, "Distance metric learning: A comprehensive survey," Michigan State University, pp. 1--51, 2006.
[17]
A. Frome, Y. Singer, and J. Malik, "Image retrieval and classification using local distance functions," in NIPS 2006.
[18]
D. Mladenić, J. Brank, M. Grobelnik, and N. Milic-Frayling, "Feature selection using linear classifier weights: interaction with classification models," in SIGIR 2004.

Cited By

View all
  • (2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
  • (2017)Learning a Distance Metric from Relative Comparisons between Quadruplets of ImagesInternational Journal of Computer Vision10.1007/s11263-016-0923-4121:1(65-94)Online publication date: 1-Jan-2017
  • (2015)Scalable decision support for digital preservation: an assessmentOCLC Systems & Services: International digital library perspectives10.1108/OCLC-06-2014-002631:1(11-34)Online publication date: 9-Feb-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
September 2012
256 pages
ISBN:9781450311168
DOI:10.1145/2361354
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. change detection algorithms
  2. digital preservation
  3. pattern recognition
  4. support vector machines
  5. web archiving

Qualifiers

  • Research-article

Conference

DocEng '12
Sponsor:
DocEng '12: ACM Symposium on Document Engineering
September 4 - 7, 2012
Paris, France

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Webpage retrieval based on query by example for think tank constructionInformation Processing and Management: an International Journal10.1016/j.ipm.2021.10276759:1Online publication date: 1-Jan-2022
  • (2017)Learning a Distance Metric from Relative Comparisons between Quadruplets of ImagesInternational Journal of Computer Vision10.1007/s11263-016-0923-4121:1(65-94)Online publication date: 1-Jan-2017
  • (2015)Scalable decision support for digital preservation: an assessmentOCLC Systems & Services: International digital library perspectives10.1108/OCLC-06-2014-002631:1(11-34)Online publication date: 9-Feb-2015
  • (2014)On cloud deployment of digital preservation environmentsProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740859(443-444)Online publication date: 8-Sep-2014
  • (2014)On Cloud deployment of digital preservation environmentsIEEE/ACM Joint Conference on Digital Libraries10.1109/JCDL.2014.6970216(443-444)Online publication date: Sep-2014
  • (2014)Scalable decision support for digital preservationOCLC Systems & Services: International digital library perspectives10.1108/OCLC-06-2014-002530:4(249-284)Online publication date: 10-Nov-2014

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media