Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access
Just Accepted

Validation of an improved vision-based web page parsing pipeline

Online AM: 21 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we present a novel approach to quantitative evaluation of a model for parsing web pages as visual images, intended to provide improvements for users with assistive needs (cognitive or visual deficits, enabling decluttering or zooming and supporting more effective screen reader output). This segmentation-classification pipeline is tested in stages: We first discuss the validation of the segmentation algorithm, showing that our approach produces automated segmentations that are very similar to those produced by real users when making use of a drawing interface to designate edges and regions. We also examine the properties of these ground truth segmentations produced under different conditions. We then describe our Hidden-Markov tree approach for classification and present results which serve provide important validation for this model. The analysis is set against effective choices for dataset and pruning options, measured with respect to manual ground truth labelling of regions. In all, we offer a detailed quantitative validation (focused on complex news pages) of a fully pipelined approach for interpreting web pages as visual images, an approach which enables important advances for users with assistive needs.

    References

    [1]
    2006. Research-Based Web Design & Usability Guidelines. U.S. Dept. of Health and Human Services.
    [2]
    Hamed Ahmadi and Jun Kong. 2008. Efficient web browsing on small screens. In Proceedings of the working conference on Advanced visual interfaces. ACM, 23–30.
    [3]
    M Elgin Akpinar and Yeliz Yeşilada. 2013. Vision Based Page Segmentation Algorithm: Extended and Perceived Success. In Lecture Notes in Computer Science: Current Trends in Web Engineering. Springer International Publishing, 238–252.
    [4]
    M Elgin Akpinar and Yeliz Yeşilada. 2017. Discovering Visual Elements of Web Pages and Their Roles: Users’ Perception. Interacting with Computers 29, 6 (2017), 845–867.
    [5]
    C. Asakawa and H. Takagi. 2000. Annotation-based transcoding for nonvisual web access. In ASSETS 2000. ACM Press, 172–179.
    [6]
    Omer Barkol, Ruth Bergman, Ayelet Pnueli, and Sagi Schein. 2009. Semantic Automation from Screen Capture. Technical Report HPL-2009-161. HP Labs.
    [7]
    Mark Birbeck, Richard Schwerdtfeger, T.V. Raman, Steven Pemberton, and Shane McCarron. 2010. XHTML Role Attribute Module. W3C Note. W3C. http://www.w3.org/TR/2010/NOTE-xhtml-role-20101216/.
    [8]
    Y. Borodin, J. Mahmud, I.V. Ramakrishnan, and A. Stent. 2007. The hearsay non-visual web browser. In Proc. 2007 International Cross-disciplinary Conf. on Web Accessibility (W4A 2007). ACM, New York, 128–129.
    [9]
    Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79. Microsoft.
    [10]
    Jiuxin Cao, Bo Mao, and Junzhou Luo. 2010. A segmentation method for web page analysis using shrinking and dividing. International Journal of Parallel, Emergent and Distributed Systems 25, 2(2010), 93–104.
    [11]
    Jinlin Chen, Ping Zhong, and T. Cook. 2006. Detecting Web Content Function Using Generalized Hidden Markov Model. In 5th Int. Conf. on Machine Learning and Applications. 279–284.
    [12]
    Kai Chen, Mathias Seuret, Jean Hennebert, and Rolf Ingold. 2017. Convolutional neural networks for page segmentation of historical document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol.  1. IEEE, 965–970.
    [13]
    J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960), 37–46.
    [14]
    Michael Cormier, Richard Mann, Robin Cohen, and Karyn Moffatt. 2016. Classification via Hidden Markov Trees for a Vision-Based Approach to Conveying Webpages to Users with Assistive Needs. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence.
    [15]
    Michael Cormier, Richard Mann, Karyn Moffatt, and Robin Cohen. 2017. Towards an Improved Vision-based Web Page Segmentation Algorithm. In Proceedings of the 2017 Conference on Computer and Robot Vision (CRV).
    [16]
    Michael Cormier, Karyn Moffatt, Robin Cohen, and Richard Mann. 2016. Purely Vision-Based Segmentation of Web Pages for Assistive Technology. CVIU special issue on Assistive Computer Vision and Robotics (2016).
    [17]
    Nicola R Di Matteo and James Blustein. 2020. A Framework to Evaluate Webpage Segment Recognizers. In Proceedings of the ACM Symposium on Document Engineering 2020. 1–4.
    [18]
    Carlos Duarte, Ana Salvado, M Elgin Akpinar, Yeliz Yeşilada, and Luís Carriço. 2018. Automatic role detection of visual elements of web pages for automatic accessibility evaluation. In Proceedings of the 15th International Web for All Conference. 1–4.
    [19]
    Sukru Eraslan, Yeliz Yesilada, and Simon Harper. 2020. “The Best of Both Worlds!” Integration of Web Page and Eye Tracking Data Driven Approaches for Automatic AOI Detection. ACM Transactions on the Web (TWEB) 14, 1 (2020), 1–31.
    [20]
    Dafang He, Scott Cohen, Brian Price, Daniel Kifer, and C Lee Giles. 2017. Multi-scale multi-task fcn for semantic page segmentation and table detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol.  1. IEEE, 254–261.
    [21]
    Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, and Martin Potthast. 2020. Web Page Segmentation Revisited: Evaluation Framework and Dataset. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3047–3054.
    [22]
    Johannes Kiesel, Lars Meyer, Florian Kneist, Benno Stein, and Martin Potthast. 2021. An Empirical Comparison of Web Page Segmentation Algorithms. In Advances in Information Retrieval. 43rd European Conference on IR Research (ECIR 2021) (Lecture Notes in Computer Science), Djoerd Hiemstra, Maria-Francine Moens, Josiane Mothe, Raffaele Perego, Martin Potthast, and Fabrizio Sebastiani (Eds.), Vol.  12657. Springer, Berlin Heidelberg New York, 62–74.
    [23]
    Andrew Kirkpatrick, Joshue O Connor, Michael Cooper, and Alastair Campbell. 2018. Web Content Accessibility Guidelines (WCAG) 2.1. W3C Recommendation. W3C. https://www.w3.org/TR/2018/REC-WCAG21-20180605/.
    [24]
    Koichi Kise. 2014. Handbook of Document Image Processing and Recognition. Springer-Verlag, London, Chapter 5: Page Segmentation Techniques in Document Analysis.
    [25]
    Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic. 2002. Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 250–257.
    [26]
    Robert Kreuzer, Jurriaan Hage, and Ad Feelders. 2015. A quantitative comparison of semantic web page segmentation approaches. In International Conference on Web Engineering. Springer, 374–391.
    [27]
    Bernhard Krüpl-Sypien, Ruslan R. Fayzrakhmanov, Wolfgang Holzinger, Mathias Panzenböck, and Robert Baumgartner. 2011. A Versatile Model for Web Page Representation, Information Extraction and Content Re-packaging. In Proceedings of the 11th ACM Symposium on Document Engineering (DocEng ’11). ACM, New York, NY, USA, 129–138.
    [28]
    Sri Hastuti Kurniawan, Alasdair King, David Gareth Evans, and PL Blenkhorn. 2006. Personalising web page presentation for older people. Interacting with computers 18, 3 (2006), 457–477.
    [29]
    J. R. Landis and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159–174.
    [30]
    Alison Lee. 2004. Scaffolding Visually Cluttered Web Pages to Facilitate Accessibility. In Proceedings of the Working Conference on Advanced Visual Interfaces (AVI ’04). ACM, New York, NY, USA, 90–93.
    [31]
    J.U. Mahmud, Y. Borodin, and I.V. Ramakrishnan. 2007. Csurf: a context-driven non-visual webbrowser. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007. ACM, New York, 31–40.
    [32]
    D. Martin, C. Fowlkes, D. Tal, and J. Malik. 2001. A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In Proc. 8th Int’l Conf. Computer Vision, Vol.  2. 416–423.
    [33]
    Benjamin Meier, Thilo Stadelmann, Jan Stampfli, Marek Arnold, and Mark Cieliebak. 2017. Fully convolutional neural networks for newspaper article segmentation. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol.  1. IEEE, 414–419.
    [34]
    G. Nagy and S. Seth. 1984. Hierarchical representation of optically scanned documents. In Proc. of ICPR. 347–349.
    [35]
    Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
    [36]
    Ruth Rosenholtz, Yuanzhen Li, Jonathan Mansfield, and Zhenlan Jin. 2005. Feature Congestion: A Measure of Display Clutter. In Proc. SIGCHI Conf. on Human Factors in Computing Systems (CHI ’05). ACM, New York, NY, USA, 761–770.
    [37]
    Andrés Sanoja and Stéphane Gançarski. 2014. Block-o-matic: A web page segmentation framework. In 2014 international conference on multimedia computing and systems (ICMCS). IEEE, 595–600.
    [38]
    H. Takagi, C. Asakawa, K. Fukuda, and J. Maeda. 2002. Site-wide annotation: Reconstructing existing pages to be accessible. In ASSETS 2002. ACM Press, 81–88.
    [39]
    TUWIEN Database and Artificial Intelligence Group. TUWIEN Project ABBA: Web Accessibility. http://www.dbai.tuwien.ac.at/proj/ABBA/. (????). Accessed June 4, 2014.
    [40]
    Srinivas Vadrevu, Fatih Gelgi, and Hasan Davulcu. 2005. Semantic Partitioning of Web Pages. In Web Information Systems Engineering – WISE 2005, AnneH.H. Ngu, Masaru Kitsuregawa, ErichJ. Neuhold, Jen-Yao Chung, and QuanZ. Sheng (Eds.). Lecture Notes in Computer Science, Vol.  3806. Springer Berlin Heidelberg, 107–118.
    [41]
    W3C. 2010. WAI-ARIA 1.0 Primer: An introduction to rich Internet application accessibility challenges and solutions.https://www.w3.org/TR/wai-aria-primer/ (accessed Feb 16, 2016). (2010).
    [42]
    W3C. 2014. Accessible Rich Internet Applications (WAI-ARIA) 1.0. W3C Recommendation.http://www.w3.org/TR/wai-aria (accessed Feb 16, 2016). (2014).
    [43]
    Y. Yesilada, S. Harper, C.A. Goble, and R. Stevens. 2004. Screen readers cannot see (ontology based semantic annotation for visually impaired web travellers). In ICWE 2004 (LNCS), N. Koch, P. Fraternali, and M. Wirsing (Eds.), Vol.  3140. Springer, Heidelberg.
    [44]
    Y. Yesilada, R. Stevens, S. Harper, and C. Goble. 2007. Evaluating DANTE: Semantic Transcoding for Visually Disabled Users. ACM Trans. Hum. Comput. Interact. 14 (2007), 66–96.
    [45]
    Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing & Management 53, 3 (2017), 735–750.
    [46]
    Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 2(2020), 1–35.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web Just Accepted
    ISSN:1559-1131
    EISSN:1559-114X
    Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Online AM: 21 January 2023
    Accepted: 06 December 2022
    Revised: 28 October 2022
    Received: 11 January 2021

    Check for updates

    Author Tags

    1. web page segmentation
    2. web page region classification
    3. Bayesian approach
    4. assistive technology for the visually impaired and persons with cognitive deficit

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 214
      Total Downloads
    • Downloads (Last 12 months)114
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media