research-article

Validation of an Improved Vision-Based Web Page Parsing Pipeline

Authors:

Michael Cormier,

Shangshang ZhengAuthors Info & Claims

ACM Transactions on the Web, Volume 18, Issue 3

Article No.: 36, Pages 1 - 23

https://doi.org/10.1145/3580519

Published: 21 August 2024 Publication History

Abstract

In this article, we present a novel approach to quantitative evaluation of a model for parsing web pages as visual images, intended to provide improvements for users with assistive needs (cognitive or visual deficits, enabling decluttering or zooming and supporting more effective screen reader output). This segmentation-classification pipeline is tested in stages: We first discuss the validation of the segmentation algorithm, showing that our approach produces automated segmentations that are very similar to those produced by real users when making use of a drawing interface to designate edges and regions. We also examine the properties of these ground truth segmentations produced under different conditions. We then describe our Hidden Markov tree approach for classification and present results which serve provide important validation for this model. The analysis is set against effective choices for dataset and pruning options, measured with respect to manual ground truth labelling of regions. In all, we offer a detailed quantitative validation (focused on complex news pages) of a fully pipelined approach for interpreting web pages as visual images, an approach which enables important advances for users with assistive needs.

References

[1]

2006. Research-Based Web Design & Usability Guidelines. U.S. Dept. of Health and Human Services.

[2]

Hamed Ahmadi and Jun Kong. 2008. Efficient web browsing on small screens. In Working Conference on Advanced Visual Interfaces. ACM, 23–30.

Digital Library

[3]

M. Elgin Akpinar and Yeliz Yeşilada. 2013. Vision based page segmentation algorithm: Extended and perceived success. In Lecture Notes in Computer Science: Current Trends in Web Engineering. Springer International Publishing, 238–252.

Digital Library

[4]

M. Elgin Akpinar and Yeliz Yeşilada. 2017. Discovering visual elements of web pages and their roles: Users’ perception. Interacting with Computers 29, 6 (2017), 845–867.

[5]

C. Asakawa and H. Takagi. 2000. Annotation-based transcoding for nonvisual web access. In ASSETS 2000. ACM Press, 172–179.

Digital Library

[6]

Omer Barkol, Ruth Bergman, Ayelet Pnueli, and Sagi Schein. 2009. Semantic Automation from Screen Capture. Technical Report HPL-2009-161. HP Labs.

[7]

Mark Birbeck, Richard Schwerdtfeger, T. V. Raman, Steven Pemberton, and Shane McCarron. 2010. XHTML Role Attribute Module. W3C Note. W3C. http://www.w3.org/TR/2010/NOTE-xhtml-role-20101216/.

[8]

Y. Borodin, J. Mahmud, I. V. Ramakrishnan, and A. Stent. 2007. The hearsay non-visual web browser. In 2007 International Cross-disciplinary Conf. on Web Accessibility (W4A'07). ACM, New York, 128–129.

Digital Library

[9]

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Vision-based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79. Microsoft.

[10]

Jiuxin Cao, Bo Mao, and Junzhou Luo. 2010. A segmentation method for web page analysis using shrinking and dividing. International Journal of Parallel, Emergent and Distributed Systems 25, 2 (2010), 93–104.

Digital Library

[11]

Jinlin Chen, Ping Zhong, and T. Cook. 2006. Detecting web content function using generalized hidden Markov model. In 5th International Conference on Machine Learning and Applications. 279–284. DOI:

Digital Library

[12]

Kai Chen, Mathias Seuret, Jean Hennebert, and Rolf Ingold. 2017. Convolutional neural networks for page segmentation of historical document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 965–970.

[13]

J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.

[14]

Michael Cormier, Richard Mann, Robin Cohen, and Karyn Moffatt. 2016. Classification via hidden Markov trees for a vision-based approach to conveying webpages to users with assistive needs. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence.

[15]

Michael Cormier, Richard Mann, Karyn Moffatt, and Robin Cohen. 2017. Towards an improved vision-based web page segmentation algorithm. In 2017 Conference on Computer and Robot Vision (CRV).

[16]

Michael Cormier, Karyn Moffatt, Robin Cohen, and Richard Mann. 2016. Purely vision-based segmentation of web pages for assistive technology. CVIU Special Issue on Assistive Computer Vision and Robotics (2016).

[17]

Nicola R. Di Matteo and James Blustein. 2020. A framework to evaluate webpage segment recognizers. In ACM Symposium on Document Engineering 2020. 1–4.

Digital Library

[18]

Carlos Duarte, Ana Salvado, M. Elgin Akpinar, Yeliz Yeşilada, and Luís Carriço. 2018. Automatic role detection of visual elements of web pages for automatic accessibility evaluation. In 15th International Web for All Conference. 1–4.

Digital Library

[19]

Sukru Eraslan, Yeliz Yesilada, and Simon Harper. 2020. “The best of both worlds” integration of web page and eye tracking data driven approaches for automatic AOI detection. ACM Transactions on the Web (TWEB) 14, 1 (2020), 1–31.

Digital Library

[20]

Dafang He, Scott Cohen, Brian Price, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale multi-task fcn for semantic page segmentation and table detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 254–261.

[21]

Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. 2020. Web page segmentation revisited: Evaluation framework and dataset. In 29th ACM International Conference on Information & Knowledge Management. 3047–3054.

Digital Library

[22]

Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. 2021. An empirical comparison of web page segmentation algorithms. In Advances in Information Retrieval. 43rd European Conference on IR Research (ECIR 2021) (Lecture Notes in Computer Science), Djoerd Hiemstra, Maria-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, and Fabrizio Sebastiani (Eds.), Vol. 12657. Springer, Berlin, 62–74. DOI:

Digital Library

[23]

Andrew Kirkpatrick, Joshue O. Connor, Michael Cooper, and Alastair Campbell. 2018. Web Content Accessibility Guidelines (WCAG) 2.1. W3C Recommendation. W3C. https://www.w3.org/TR/2018/REC-WCAG21-20180605/.

[24]

Koichi Kise. 2014. Handbook of Document Image Processing and Recognition. Springer-Verlag, London, Chapter 5: Page Segmentation Techniques in Document Analysis.

[25]

Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic. 2002. Recognition of common areas in a web page using visual information: A possible application in a page classification. In 2002 IEEE International Conference on Data Mining, 2002 (ICDM'03). IEEE, 250–257.

[26]

Robert Kreuzer, Jurriaan Hage, and Ad Feelders. 2015. A quantitative comparison of semantic web page segmentation approaches. In International Conference on Web Engineering. Springer, 374–391.

Digital Library

[27]

Bernhard Krüpl-Sypien, Ruslan R. Fayzrakhmanov, Wolfgang Holzinger, Mathias Panzenböck, and Robert Baumgartner. 2011. A versatile model for web page representation, information extraction and content re-packaging. In 11th ACM Symposium on Document Engineering (DocEng’11). ACM, New York, 129–138.

Digital Library

[28]

Sri Hastuti Kurniawan, Alasdair King, David Gareth Evans, and PL Blenkhorn. 2006. Personalising web page presentation for older people. Interacting with Computers 18, 3 (2006), 457–477.

Digital Library

[29]

J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159–174.

[30]

Alison Lee. 2004. Scaffolding visually cluttered web pages to facilitate accessibility. In Working Conference on Advanced Visual Interfaces (AVI’04). ACM, New York, 90–93.

Digital Library

[31]

J. U. Mahmud, Y. Borodin, and I. V. Ramakrishnan. 2007. Csurf: A context-driven non-visual webbrowser. In 16th International Conference on World Wide Web, WWW 2007. ACM, New York, 31–40.

[32]

D. Martin, C. Fowlkes, D. Tal, and J. Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In 8th International Conference on Computer Vision, Vol. 2. 416–423.

[33]

Benjamin Meier, Thilo Stadelmann, Jan Stampfli, Marek Arnold, and Mark Cieliebak. 2017. Fully convolutional neural networks for newspaper article segmentation. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 414–419.

[34]

G. Nagy and S. Seth. 1984. Hierarchical representation of optically scanned documents. In ICPR. 347–349.

[35]

Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., San Francisco, CA.

Digital Library

[36]

Ruth Rosenholtz, Yuanzhen Li, Jonathan Mansfield, and Zhenlan Jin. 2005. Feature congestion: A measure of display clutter. In SIGCHI Conference on Human Factors in Computing Systems (CHI’05). ACM, New York, 761–770.

Digital Library

[37]

Andrés Sanoja and Stéphane Gançarski. 2014. Block-o-matic: A web page segmentation framework. In 2014 International Conference on Multimedia Computing and Systems (ICMCS). IEEE, 595–600.

[38]

H. Takagi, C. Asakawa, K. Fukuda, and J. Maeda. 2002. Site-wide annotation: Reconstructing existing pages to be accessible. In ASSETS 2002. ACM Press, 81–88.

Digital Library

[39]

TUWIEN Database and Artificial Intelligence Group. [n.d.]. TUWIEN Project ABBA: Web Accessibility. http://www.dbai.tuwien.ac.at/proj/ABBA/. Accessed June 4, 2014.

[40]

Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu. 2005. Semantic partitioning of web pages. In Web Information Systems Engineering - WISE 2005, Anne H. H. Ngu, Masaru Kitsuregawa, ErichJ. Neuhold, Jen-Yao Chung, and QuanZ. Sheng (Eds.). Lecture Notes in Computer Science, Vol. 3806. Springer Berlin, 107–118. DOI:

Digital Library

[41]

W3C. 2010. WAI-ARIA 1.0 Primer: An introduction to rich Internet application accessibility challenges and solutions. https://www.w3.org/TR/wai-aria-primer/. (accessed Feb 16, 2016). (2010).

[42]

W3C. 2014. Accessible Rich Internet Applications (WAI-ARIA) 1.0. W3C Recommendation. http://www.w3.org/TR/wai-aria. (accessed Feb 16, 2016). (2014).

[43]

Y. Yesilada, S. Harper, C. A. Goble, and R. Stevens. 2004. Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers). In ICWE 2004 (LNCS), N. Koch, P. Fraternali, and M. Wirsing (Eds.), Vol. 3140. Springer, Berlin.

[44]

Y. Yesilada, R. Stevens, S. Harper, and C. Goble. 2007. Evaluating DANTE: Semantic Transcoding for visually disabled users. ACM Trans. Hum. Comput. Interact. 14 (2007), 66–96.

Digital Library

[45]

Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing & Management 53, 3 (2017), 735–750.

Digital Library

[46]

Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 2 (2020), 1–35.

Digital Library

Index Terms

Validation of an Improved Vision-Based Web Page Parsing Pipeline

Recommendations

Vision Based Page Segmentation Algorithm: Extended and Perceived Success
Revised Selected Papers of the ICWE 2013 International Workshops on Current Trends in Web Engineering - Volume 8295

Web pages consist of different visual segments, serving different purposes. Typical structural segments are header, right or left columns and main content. Segments can also have nested structure which means some segments may include other segments. ...
Web Page Segmentation: A DOM-Structural Cohesion Analysis Approach
Web Information Systems Engineering – WISE 2023
Abstract
Web page segmentation is a fundamental technique applied in information retrieval systems to enhance web crawling tasks and information extraction. Its objectives are to gain deep insights from crawling results and to extract the main content of a ... $_{^{}}$
Enhancing web page skimmability
CHI EA '12: CHI '12 Extended Abstracts on Human Factors in Computing Systems

Information overload on the Web and limited reading time force users to skim read web pages. For non-native English readers, it is challenging to understand first-hand information written in English under time constraints. Traditional readability ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 18, Issue 3

August 2024

254 pages

EISSN:1559-114X

DOI:10.1145/3613679

Editor:
Ryen White
Microsoft Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2024

Online AM: 21 January 2023

Accepted: 06 December 2022

Revised: 28 October 2022

Received: 11 January 2021

Published in TWEB Volume 18, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
280
Total Downloads

Downloads (Last 12 months)125
Downloads (Last 6 weeks)10

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents