Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Validation of an Improved Vision-Based Web Page Parsing Pipeline

Published: 21 August 2024 Publication History

Abstract

In this article, we present a novel approach to quantitative evaluation of a model for parsing web pages as visual images, intended to provide improvements for users with assistive needs (cognitive or visual deficits, enabling decluttering or zooming and supporting more effective screen reader output). This segmentation-classification pipeline is tested in stages: We first discuss the validation of the segmentation algorithm, showing that our approach produces automated segmentations that are very similar to those produced by real users when making use of a drawing interface to designate edges and regions. We also examine the properties of these ground truth segmentations produced under different conditions. We then describe our Hidden Markov tree approach for classification and present results which serve provide important validation for this model. The analysis is set against effective choices for dataset and pruning options, measured with respect to manual ground truth labelling of regions. In all, we offer a detailed quantitative validation (focused on complex news pages) of a fully pipelined approach for interpreting web pages as visual images, an approach which enables important advances for users with assistive needs.

References

[1]
2006. Research-Based Web Design & Usability Guidelines. U.S. Dept. of Health and Human Services.
[2]
Hamed Ahmadi and Jun Kong. 2008. Efficient web browsing on small screens. In Working Conference on Advanced Visual Interfaces. ACM, 23–30.
[3]
M. Elgin Akpinar and Yeliz Yeşilada. 2013. Vision based page segmentation algorithm: Extended and perceived success. In Lecture Notes in Computer Science: Current Trends in Web Engineering. Springer International Publishing, 238–252.
[4]
M. Elgin Akpinar and Yeliz Yeşilada. 2017. Discovering visual elements of web pages and their roles: Users’ perception. Interacting with Computers 29, 6 (2017), 845–867.
[5]
C. Asakawa and H. Takagi. 2000. Annotation-based transcoding for nonvisual web access. In ASSETS 2000. ACM Press, 172–179.
[6]
Omer Barkol, Ruth Bergman, Ayelet Pnueli, and Sagi Schein. 2009. Semantic Automation from Screen Capture. Technical Report HPL-2009-161. HP Labs.
[7]
Mark Birbeck, Richard Schwerdtfeger, T. V. Raman, Steven Pemberton, and Shane McCarron. 2010. XHTML Role Attribute Module. W3C Note. W3C. http://www.w3.org/TR/2010/NOTE-xhtml-role-20101216/.
[8]
Y. Borodin, J. Mahmud, I. V. Ramakrishnan, and A. Stent. 2007. The hearsay non-visual web browser. In 2007 International Cross-disciplinary Conf. on Web Accessibility (W4A'07). ACM, New York, 128–129.
[9]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Vision-based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79. Microsoft.
[10]
Jiuxin Cao, Bo Mao, and Junzhou Luo. 2010. A segmentation method for web page analysis using shrinking and dividing. International Journal of Parallel, Emergent and Distributed Systems 25, 2 (2010), 93–104.
[11]
Jinlin Chen, Ping Zhong, and T. Cook. 2006. Detecting web content function using generalized hidden Markov model. In 5th International Conference on Machine Learning and Applications. 279–284. DOI:
[12]
Kai Chen, Mathias Seuret, Jean Hennebert, and Rolf Ingold. 2017. Convolutional neural networks for page segmentation of historical document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 965–970.
[13]
J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.
[14]
Michael Cormier, Richard Mann, Robin Cohen, and Karyn Moffatt. 2016. Classification via hidden Markov trees for a vision-based approach to conveying webpages to users with assistive needs. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence.
[15]
Michael Cormier, Richard Mann, Karyn Moffatt, and Robin Cohen. 2017. Towards an improved vision-based web page segmentation algorithm. In 2017 Conference on Computer and Robot Vision (CRV).
[16]
Michael Cormier, Karyn Moffatt, Robin Cohen, and Richard Mann. 2016. Purely vision-based segmentation of web pages for assistive technology. CVIU Special Issue on Assistive Computer Vision and Robotics (2016).
[17]
Nicola R. Di Matteo and James Blustein. 2020. A framework to evaluate webpage segment recognizers. In ACM Symposium on Document Engineering 2020. 1–4.
[18]
Carlos Duarte, Ana Salvado, M. Elgin Akpinar, Yeliz Yeşilada, and Luís Carriço. 2018. Automatic role detection of visual elements of web pages for automatic accessibility evaluation. In 15th International Web for All Conference. 1–4.
[19]
Sukru Eraslan, Yeliz Yesilada, and Simon Harper. 2020. “The best of both worlds” integration of web page and eye tracking data driven approaches for automatic AOI detection. ACM Transactions on the Web (TWEB) 14, 1 (2020), 1–31.
[20]
Dafang He, Scott Cohen, Brian Price, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale multi-task fcn for semantic page segmentation and table detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 254–261.
[21]
Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. 2020. Web page segmentation revisited: Evaluation framework and dataset. In 29th ACM International Conference on Information & Knowledge Management. 3047–3054.
[22]
Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. 2021. An empirical comparison of web page segmentation algorithms. In Advances in Information Retrieval. 43rd European Conference on IR Research (ECIR 2021) (Lecture Notes in Computer Science), Djoerd Hiemstra, Maria-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, and Fabrizio Sebastiani (Eds.), Vol. 12657. Springer, Berlin, 62–74. DOI:
[23]
Andrew Kirkpatrick, Joshue O. Connor, Michael Cooper, and Alastair Campbell. 2018. Web Content Accessibility Guidelines (WCAG) 2.1. W3C Recommendation. W3C. https://www.w3.org/TR/2018/REC-WCAG21-20180605/.
[24]
Koichi Kise. 2014. Handbook of Document Image Processing and Recognition. Springer-Verlag, London, Chapter 5: Page Segmentation Techniques in Document Analysis.
[25]
Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic. 2002. Recognition of common areas in a web page using visual information: A possible application in a page classification. In 2002 IEEE International Conference on Data Mining, 2002 (ICDM'03). IEEE, 250–257.
[26]
Robert Kreuzer, Jurriaan Hage, and Ad Feelders. 2015. A quantitative comparison of semantic web page segmentation approaches. In International Conference on Web Engineering. Springer, 374–391.
[27]
Bernhard Krüpl-Sypien, Ruslan R. Fayzrakhmanov, Wolfgang Holzinger, Mathias Panzenböck, and Robert Baumgartner. 2011. A versatile model for web page representation, information extraction and content re-packaging. In 11th ACM Symposium on Document Engineering (DocEng’11). ACM, New York, 129–138.
[28]
Sri Hastuti Kurniawan, Alasdair King, David Gareth Evans, and PL Blenkhorn. 2006. Personalising web page presentation for older people. Interacting with Computers 18, 3 (2006), 457–477.
[29]
J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159–174.
[30]
Alison Lee. 2004. Scaffolding visually cluttered web pages to facilitate accessibility. In Working Conference on Advanced Visual Interfaces (AVI’04). ACM, New York, 90–93.
[31]
J. U. Mahmud, Y. Borodin, and I. V. Ramakrishnan. 2007. Csurf: A context-driven non-visual webbrowser. In 16th International Conference on World Wide Web, WWW 2007. ACM, New York, 31–40.
[32]
D. Martin, C. Fowlkes, D. Tal, and J. Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In 8th International Conference on Computer Vision, Vol. 2. 416–423.
[33]
Benjamin Meier, Thilo Stadelmann, Jan Stampfli, Marek Arnold, and Mark Cieliebak. 2017. Fully convolutional neural networks for newspaper article segmentation. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 414–419.
[34]
G. Nagy and S. Seth. 1984. Hierarchical representation of optically scanned documents. In ICPR. 347–349.
[35]
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., San Francisco, CA.
[36]
Ruth Rosenholtz, Yuanzhen Li, Jonathan Mansfield, and Zhenlan Jin. 2005. Feature congestion: A measure of display clutter. In SIGCHI Conference on Human Factors in Computing Systems (CHI’05). ACM, New York, 761–770.
[37]
Andrés Sanoja and Stéphane Gançarski. 2014. Block-o-matic: A web page segmentation framework. In 2014 International Conference on Multimedia Computing and Systems (ICMCS). IEEE, 595–600.
[38]
H. Takagi, C. Asakawa, K. Fukuda, and J. Maeda. 2002. Site-wide annotation: Reconstructing existing pages to be accessible. In ASSETS 2002. ACM Press, 81–88.
[39]
TUWIEN Database and Artificial Intelligence Group. [n.d.]. TUWIEN Project ABBA: Web Accessibility. http://www.dbai.tuwien.ac.at/proj/ABBA/. Accessed June 4, 2014.
[40]
Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu. 2005. Semantic partitioning of web pages. In Web Information Systems Engineering - WISE 2005, Anne H. H. Ngu, Masaru Kitsuregawa, ErichJ. Neuhold, Jen-Yao Chung, and QuanZ. Sheng (Eds.). Lecture Notes in Computer Science, Vol. 3806. Springer Berlin, 107–118. DOI:
[41]
W3C. 2010. WAI-ARIA 1.0 Primer: An introduction to rich Internet application accessibility challenges and solutions. https://www.w3.org/TR/wai-aria-primer/. (accessed Feb 16, 2016). (2010).
[42]
W3C. 2014. Accessible Rich Internet Applications (WAI-ARIA) 1.0. W3C Recommendation. http://www.w3.org/TR/wai-aria. (accessed Feb 16, 2016). (2014).
[43]
Y. Yesilada, S. Harper, C. A. Goble, and R. Stevens. 2004. Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers). In ICWE 2004 (LNCS), N. Koch, P. Fraternali, and M. Wirsing (Eds.), Vol. 3140. Springer, Berlin.
[44]
Y. Yesilada, R. Stevens, S. Harper, and C. Goble. 2007. Evaluating DANTE: Semantic Transcoding for visually disabled users. ACM Trans. Hum. Comput. Interact. 14 (2007), 66–96.
[45]
Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing & Management 53, 3 (2017), 735–750.
[46]
Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 2 (2020), 1–35.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 18, Issue 3
August 2024
254 pages
EISSN:1559-114X
DOI:10.1145/3613679
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2024
Online AM: 21 January 2023
Accepted: 06 December 2022
Revised: 28 October 2022
Received: 11 January 2021
Published in TWEB Volume 18, Issue 3

Check for updates

Author Tags

  1. Web page segmentation
  2. web page region classification
  3. Bayesian approach
  4. assistive technology for the visually impaired and persons with cognitive deficit

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 267
    Total Downloads
  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)11
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media