Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2034691.2034721acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

A versatile model for web page representation, information extraction and content re-packaging

Published: 19 September 2011 Publication History

Abstract

On today's Web, designers take huge efforts to create visually rich websites that boast a magnitude of interactive elements. Contrarily, most web information extraction (WIE) algorithms are still based on attributed tree methods which struggle to deal with this complexity. In this paper, we introduce a versatile model to represent web documents. The model is based on gestalt theory principles---trying to capture the most important aspects in a formally exact way. It (i) represents and unifies access to visual layout, content and functional aspects; (ii) is implemented with semantic web techniques that can be leveraged for i.e. automatic reasoning. Considering the visual appearance of a web page, we view it as a collection of gestalt figures---based on gestalt primitives---each representing a specific design pattern, be it navigation menus or news articles. Based on this model, we introduce our WIE methodology, a re-engineering process involving design patterns, statistical distributions and text content properties. The complete framework consists of the UOM model, which formalizes the mentioned components, and the MANM layer that hints on structure and serialization, providing document re-packaging foundations. Finally, we discuss how we have applied and evaluated our model in the area of web accessibility.

References

[1]
Freedom Scientific: JAWS for Windows Screen Reading Software (retrieved April 2011). http://www.freedomscientific.com/products/fs/jaws-product-page.asp.
[2]
GW-Micro - Window-Eyes (retrieved April 2011). http://www.gwmicro.com/Window-Eyes/.
[3]
Welie Design Pattern Library (retrieved April 2011). http://www.welie.com/patterns/.
[4]
Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification, 2009.
[5]
R. Baumgartner, R. R. Fayzrakhmanov, W. Holzinger, B. Krüpl, M. C. Göbel, D. Klein, and R. Gattringer. Web 2.0 vision for the blind. In Proc. of Web Science Conference 2010 (WebSci10), page 8, Raleigh, USA, 2010.
[6]
Y. Borodin, J. P. Bigham, A. Stent, and I. V. Ramakrishnan. Towards one world web with HearSay3. In Proc. of the International Cross-Disciplinary Workshop on Web Accessibility (W4A' 08), pages 130--131, New York, USA, 2008. ACM Press.
[7]
Y. Borodin, J. Mahmud, and I. Ramakrishnan. The HearSay Non-Visual Web Browser. (Vxml):128--129, 2007.
[8]
A. G. Cohn. Qualitative spatial representation and reasoning techniques, volume 1303, pages 1--30. Springer Berlin, Berlin, Germany, May 1997.
[9]
R. R. Fayzrakhmanov, M. C. Göbel, W. Holzinger, B. Krüpl, and R. Baumgartner. A Unified ontology-based web page model for improving accessibility. In Proc. of the 19th international conference on World Wide Web (WWW'2010), pages 1087--1088, Raleigh, USA, 2010. ACM.
[10]
R. R. Fayzrakhmanov, M. C. Göbel, W. Holzinger, B. Krüpl, A. Mager, and R. Baumgartner. Modelling Web navigation with the user in mind. In Proc. of the International Cross Disciplinary Conference on Web Accessibility (W4A'2010), page 4, Raleigh, USA, 2010.
[11]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 71--80, New York, NY, USA, 2007. ACM.
[12]
W. Gatterbauer, B. Krüpl, W. Holzinger, and M. Herzog. Web information extraction using eupeptic data in web tables. In In Proceedings of the 1st International Workshop on Representation and Analysis of Web Space (RAWS 2005), pages 41--48, Prague, Czech Republic, 2005. VSB - Technical University of Ostrava.
[13]
X. He, D. Cai, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. Clustering and searching www images using link and page layout analysis. ACM Trans. Multimedia Comput. Commun. Appl., 3, May 2007.
[14]
M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Visual adjacency multigraphs -- a novel approach for a web page classification. In Proc. of the Workshop on Statistical Approaches to Web Mining (SAWM'2004), pages 38--49, 2004.
[15]
B. Krüpl and M. Herzog. Visually guided bottom-up table detection and segmentation in web documents. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 933--934, New York, NY, USA, 2006. ACM.
[16]
J. Mahmud, Y. Borodin, and I. Ramakrishnan. CSurf: A Context-Driven Non-Visual Web-Browser. 1(c), 2007.
[17]
I. V. Ramakrishnan, A. Stent, and G. Yang. Hearsay: enabling audio browsing on hypertext content. In Proc. of the 13th International Conference on World Wide Web (WWW '04), pages 80--89, New York, NY, USA, 2004. ACM.
[18]
A. Spengler and P. Gallinari. Document structure meets page layout: loopy random fields for web news content extraction. In Proc. of the 10th ACM Symposium on Document Engineering (DocEng'10), pages 151--160, New York, USA, 2010. ACM.
[19]
M. Wertheimer. Untersuchungen zur lehre von der gestalt. Psychological Research, 1:47--58, 1922. 10.1007/BF00410385.

Cited By

View all
  • (2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
  • (2023)Validation of an improved vision-based web page parsing pipelineACM Transactions on the Web10.1145/3580519Online publication date: 21-Jan-2023
  • (2019)Large-scale holistic approach to Web block classificationWorld Wide Web10.1007/s11280-018-0634-622:5(1999-2015)Online publication date: 1-Sep-2019
  • Show More Cited By

Index Terms

  1. A versatile model for web page representation, information extraction and content re-packaging

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DocEng '11: Proceedings of the 11th ACM symposium on Document engineering
      September 2011
      296 pages
      ISBN:9781450308632
      DOI:10.1145/2034691
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 September 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. gestalt theory
      2. web adaptation
      3. web information extraction
      4. web page model
      5. web page understanding

      Qualifiers

      • Research-article

      Conference

      DocEng '11
      Sponsor:
      DocEng '11: ACM Symposium on Document Engineering
      September 19 - 22, 2011
      California, Mountain View, USA

      Acceptance Rates

      Overall Acceptance Rate 194 of 564 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
      • (2023)Validation of an improved vision-based web page parsing pipelineACM Transactions on the Web10.1145/3580519Online publication date: 21-Jan-2023
      • (2019)Large-scale holistic approach to Web block classificationWorld Wide Web10.1007/s11280-018-0634-622:5(1999-2015)Online publication date: 1-Sep-2019
      • (2019)Robust Web Data Extraction Based on Unsupervised Visual ValidationIntelligent Information and Database Systems10.1007/978-3-030-14799-0_7(77-89)Online publication date: 7-Mar-2019
      • (2018)Browserless Web Data ExtractionProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186008(1095-1104)Online publication date: 10-Apr-2018
      • (2018)$${{\textsc {ber}}}_{y}{\textsc {l}}$$BERyL: A System for Web Block ClassificationTransactions on Computational Science XXXIII10.1007/978-3-662-58039-4_4(61-78)Online publication date: 16-Sep-2018
      • (2018)When Different Is Wrong: Visual Unsupervised Validation for Web Information ExtractionMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96133-0_10(132-146)Online publication date: 8-Jul-2018
      • (2017)Automatic Customization Framework for Efficient Vehicle Routing System DeploymentComputational Methods and Models for Transport10.1007/978-3-319-54490-8_8(105-120)Online publication date: 30-Jun-2017
      • (2016)Purely vision-based segmentation of web pages for assistive technologyComputer Vision and Image Understanding10.5555/2951132.2951430148:C(46-66)Online publication date: 1-Jul-2016
      • (2015)Models and Approaches for Web Information Extraction and Web Page UnderstandingThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch002(25-50)Online publication date: 2015
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media