research-article

A versatile model for web page representation, information extraction and content re-packaging

Authors:

Bernhard Krüpl-Sypien,

Ruslan R. Fayzrakhmanov,

Wolfgang Holzinger,

Mathias Panzenböck,

Robert BaumgartnerAuthors Info & Claims

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Pages 129 - 138

https://doi.org/10.1145/2034691.2034721

Published: 19 September 2011 Publication History

Abstract

On today's Web, designers take huge efforts to create visually rich websites that boast a magnitude of interactive elements. Contrarily, most web information extraction (WIE) algorithms are still based on attributed tree methods which struggle to deal with this complexity. In this paper, we introduce a versatile model to represent web documents. The model is based on gestalt theory principles---trying to capture the most important aspects in a formally exact way. It (i) represents and unifies access to visual layout, content and functional aspects; (ii) is implemented with semantic web techniques that can be leveraged for i.e. automatic reasoning. Considering the visual appearance of a web page, we view it as a collection of gestalt figures---based on gestalt primitives---each representing a specific design pattern, be it navigation menus or news articles. Based on this model, we introduce our WIE methodology, a re-engineering process involving design patterns, statistical distributions and text content properties. The complete framework consists of the UOM model, which formalizes the mentioned components, and the MANM layer that hints on structure and serialization, providing document re-packaging foundations. Finally, we discuss how we have applied and evaluated our model in the area of web accessibility.

References

[1]

Freedom Scientific: JAWS for Windows Screen Reading Software (retrieved April 2011). http://www.freedomscientific.com/products/fs/jaws-product-page.asp.

[2]

GW-Micro - Window-Eyes (retrieved April 2011). http://www.gwmicro.com/Window-Eyes/.

[3]

Welie Design Pattern Library (retrieved April 2011). http://www.welie.com/patterns/.

[4]

Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification, 2009.

[5]

R. Baumgartner, R. R. Fayzrakhmanov, W. Holzinger, B. Krüpl, M. C. Göbel, D. Klein, and R. Gattringer. Web 2.0 vision for the blind. In Proc. of Web Science Conference 2010 (WebSci10), page 8, Raleigh, USA, 2010.

[6]

Y. Borodin, J. P. Bigham, A. Stent, and I. V. Ramakrishnan. Towards one world web with HearSay3. In Proc. of the International Cross-Disciplinary Workshop on Web Accessibility (W4A' 08), pages 130--131, New York, USA, 2008. ACM Press.

Digital Library

[7]

Y. Borodin, J. Mahmud, and I. Ramakrishnan. The HearSay Non-Visual Web Browser. (Vxml):128--129, 2007.

Digital Library

[8]

A. G. Cohn. Qualitative spatial representation and reasoning techniques, volume 1303, pages 1--30. Springer Berlin, Berlin, Germany, May 1997.

Digital Library

[9]

R. R. Fayzrakhmanov, M. C. Göbel, W. Holzinger, B. Krüpl, and R. Baumgartner. A Unified ontology-based web page model for improving accessibility. In Proc. of the 19th international conference on World Wide Web (WWW'2010), pages 1087--1088, Raleigh, USA, 2010. ACM.

Digital Library

[10]

R. R. Fayzrakhmanov, M. C. Göbel, W. Holzinger, B. Krüpl, A. Mager, and R. Baumgartner. Modelling Web navigation with the user in mind. In Proc. of the International Cross Disciplinary Conference on Web Accessibility (W4A'2010), page 4, Raleigh, USA, 2010.

Digital Library

[11]

W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th international conference on World Wide Web, WWW '07, pages 71--80, New York, NY, USA, 2007. ACM.

Digital Library

[12]

W. Gatterbauer, B. Krüpl, W. Holzinger, and M. Herzog. Web information extraction using eupeptic data in web tables. In In Proceedings of the 1st International Workshop on Representation and Analysis of Web Space (RAWS 2005), pages 41--48, Prague, Czech Republic, 2005. VSB - Technical University of Ostrava.

[13]

X. He, D. Cai, J.-R. Wen, W.-Y. Ma, and H.-J. Zhang. Clustering and searching www images using link and page layout analysis. ACM Trans. Multimedia Comput. Commun. Appl., 3, May 2007.

Digital Library

[14]

M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Visual adjacency multigraphs -- a novel approach for a web page classification. In Proc. of the Workshop on Statistical Approaches to Web Mining (SAWM'2004), pages 38--49, 2004.

[15]

B. Krüpl and M. Herzog. Visually guided bottom-up table detection and segmentation in web documents. In Proceedings of the 15th international conference on World Wide Web, WWW '06, pages 933--934, New York, NY, USA, 2006. ACM.

Digital Library

[16]

J. Mahmud, Y. Borodin, and I. Ramakrishnan. CSurf: A Context-Driven Non-Visual Web-Browser. 1(c), 2007.

[17]

I. V. Ramakrishnan, A. Stent, and G. Yang. Hearsay: enabling audio browsing on hypertext content. In Proc. of the 13th International Conference on World Wide Web (WWW '04), pages 80--89, New York, NY, USA, 2004. ACM.

Digital Library

[18]

A. Spengler and P. Gallinari. Document structure meets page layout: loopy random fields for web news content extraction. In Proc. of the 10th ACM Symposium on Document Engineering (DocEng'10), pages 151--160, New York, USA, 2010. ACM.

Digital Library

[19]

M. Wertheimer. Untersuchungen zur lehre von der gestalt. Psychological Research, 1:47--58, 1922. 10.1007/BF00410385.

Cited By

Chaudhary MPooja Chandwani G(2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
https://doi.org/10.1109/IC2PCT60090.2024.10486413
Cormier MCohen RMann RMoffatt KVogel DLiu MZheng S(2023)Validation of an improved vision-based web page parsing pipelineACM Transactions on the Web10.1145/3580519Online publication date: 21-Jan-2023
https://doi.org/10.1145/3580519
Kravchenko A(2019)Large-scale holistic approach to Web block classificationWorld Wide Web10.1007/s11280-018-0634-622:5(1999-2015)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1007/s11280-018-0634-6
Show More Cited By

Index Terms

A versatile model for web page representation, information extraction and content re-packaging
1. Human-centered computing
  1. Human computer interaction (HCI)
2. Information systems
  1. Information retrieval

Recommendations

Extracting web information using representation patterns
HotWeb '17: Proceedings of the fifth ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured ...
Print-friendly page extraction for web printing service
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Printing Web pages from browsers usually results in unsatisfactory printouts because the pages are typically ill formatted and contain non-informative content such as navigation menu and ads. Thus, print-worthy Web pages such as articles generally ...
Tag tree template for Web information and schema extraction

The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

September 2011

296 pages

ISBN:9781450308632

DOI:10.1145/2034691

Conference Chair:
Matthew Hardy
Adobe Systems, Inc., USA
,
Program Chair:
Frank Wm. Tompa
University of Waterloo, Canada

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

In-Cooperation

SIGDOC: ACM Special Interest Group for Design of Communications

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DocEng '11

Sponsor:

SIGWEB

DocEng '11: ACM Symposium on Document Engineering

September 19 - 22, 2011

California, Mountain View, USA

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
355
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary MPooja Chandwani G(2024)Unveiling Insights: Vision-Based Data Mining Analysis of Webpages in Transition from HTML 3 to HTML 52024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT)10.1109/IC2PCT60090.2024.10486413(1646-1651)Online publication date: 9-Feb-2024
https://doi.org/10.1109/IC2PCT60090.2024.10486413
Cormier MCohen RMann RMoffatt KVogel DLiu MZheng S(2023)Validation of an improved vision-based web page parsing pipelineACM Transactions on the Web10.1145/3580519Online publication date: 21-Jan-2023
https://doi.org/10.1145/3580519
Kravchenko A(2019)Large-scale holistic approach to Web block classificationWorld Wide Web10.1007/s11280-018-0634-622:5(1999-2015)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.1007/s11280-018-0634-6
Potvin BVillemaire R(2019)Robust Web Data Extraction Based on Unsupervised Visual ValidationIntelligent Information and Database Systems10.1007/978-3-030-14799-0_7(77-89)Online publication date: 7-Mar-2019
https://doi.org/10.1007/978-3-030-14799-0_7
Fayzrakhmanov RSallinger ESpencer BFurche TGottlob GChampin PGandon FMédini LLalmas MIpeirotis P(2018)Browserless Web Data ExtractionProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186008(1095-1104)Online publication date: 10-Apr-2018
https://dl.acm.org/doi/10.1145/3178876.3186008
Kravchenko A(2018)$${{\textsc {ber}}}_{y}{\textsc {l}}$$BERyL: A System for Web Block ClassificationTransactions on Computational Science XXXIII10.1007/978-3-662-58039-4_4(61-78)Online publication date: 16-Sep-2018
https://doi.org/10.1007/978-3-662-58039-4_4
Potvin BVillemaire R(2018)When Different Is Wrong: Visual Unsupervised Validation for Web Information ExtractionMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96133-0_10(132-146)Online publication date: 8-Jul-2018
https://doi.org/10.1007/978-3-319-96133-0_10
Rasku JPuranen TKalmbach AKärkkäinen T(2017)Automatic Customization Framework for Efficient Vehicle Routing System DeploymentComputational Methods and Models for Transport10.1007/978-3-319-54490-8_8(105-120)Online publication date: 30-Jun-2017
https://doi.org/10.1007/978-3-319-54490-8_8
Cormier MMoffatt KCohen RMann R(2016)Purely vision-based segmentation of web pages for assistive technologyComputer Vision and Image Understanding10.5555/2951132.2951430148:C(46-66)Online publication date: 1-Jul-2016
https://dl.acm.org/doi/10.5555/2951132.2951430
Fayzrakhmanov R(2015)Models and Approaches for Web Information Extraction and Web Page UnderstandingThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch002(25-50)Online publication date: 2015
https://doi.org/10.4018/978-1-4666-7262-8.ch002
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents