research-article

Lossless Separation of Web Pages into Layout Code and Data

Authors:

Benny Kimelfeld,

Sharon ShohamAuthors Info & Claims

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1805 - 1814

https://doi.org/10.1145/2939672.2939858

Published: 13 August 2016 Publication History

Abstract

A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting site updates and redesign).

We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page.

We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.

References

[1]

Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In SIGMOD (2003).

Digital Library

[2]

Bar-Yossef, Z., and Rajagopalan, S. Template detection via data mining and its applications. In WWW02.

Digital Library

[3]

Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. Vips: a vision-based page segmentation algorithm. Tech. rep., Microsoft technical report, MSR-TR-2003-79, 2003.

[4]

Chakrabarti, D., Kumar, R., and Punera, K. Page-level template detection via isotonic smoothing. In Proc. of the international conf. on World Wide Web (2007).

Digital Library

[5]

Chang, C. H., Kayed, M., Girgis, M., and Shaalan, K. A survey of web information extraction systems. IEEE Trans. on Knowledge and Data Engineering 18, 10 (2006).

Digital Library

[6]

Chang, C.-H., and Lui, S.-C. IEPAD: information extraction based on pattern discovery. In WWW (2001).

Digital Library

[7]

Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. In VLDB (2001).

Digital Library

[8]

Dalvi, N., Bohannon, P., and Sha, F. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD (2009).

Digital Library

[9]

Dholi, M. P. R., and Chaudhari, K. Template extraction from heterogeneous web pages using MDL principle.

[10]

Fazzinga, B., Flesca, S., and Tagarelli, A. Learning robust web wrappers. In Database and Expert Systems Applications (2005), Springer, pp. 736--745.

Digital Library

[11]

Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. Extracting general lists from web documents: A hybrid approach. In IEA/AIE'11 (2011).

Digital Library

[12]

Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. Hylien: a hybrid approach to general list extraction on the web. In WWW (2011).

Digital Library

[13]

Gao, B., and Fan, Q. Multiple template detection based on segments. In Advances in Data Mining. Applications and Theoretical Aspects. Springer, 2014, pp. 24--38.

[14]

Geraci, F., and Maggini, M. A fast method for web template extraction via a multi-sequence alignment approach. In KIC3K. Springer, 2013, pp. 172--184.

[15]

Gibson, D., Punera, K., and Tomkins, A. The volume and evolution of web page templates. In Special interest tracks and posters of WWW (2005).

Digital Library

[16]

Grünwald, P. D. The minimum description length principle. MIT press, 2007.

[17]

Hansen, M. H., and Yu, B. Model selection and the principle of minimum description length. Journal of the American Statistical Association 96, 454 (2001), 746--774.

[18]

Hao, Q., Cai, R., Pang, Y., and Zhang, L. From one tree to a forest: a unified solution for structured web data extraction. In SIGIR (2011).

Digital Library

[19]

Kayed, M., and Chang, C.-H. Fivatech: Page-level web data extraction from template pages. Knowledge and Data Engineering, IEEE Transactions on 22, 2 (2010), 249--263.

Digital Library

[20]

Kim, C., and Shim, K. Text: Automatic template extraction from heterogeneous web pages. Knowledge and Data Engineering, IEEE Transactions on 23, 4 (2011).

Digital Library

[21]

Kohlschütter, C., Fankhauser, P., and Nejdl, W. Boilerplate detection using shallow text features. In Web Search and Data Mining (WSDM) (2010).

Digital Library

[22]

Kushmerick, N., Weld, D. S., and Doorenbos, R. B. Wrapper induction for information extraction. In IJCAI'97.

[23]

Li, J., Liu, C., Yu, J. X., and Zhou, R. Efficient top-k search across heterogeneous XML data sources. In Database Systems for Advanced Applications (DASFAA) (2008).

Digital Library

[24]

Liu, B., Grossman, R., and Zhai, Y. Mining data records in web pages. In KDD (2003).

Digital Library

[25]

Liu, D., Wang, X., Li, H., and Yan, Z. Robust web extraction based on minimum cost script edit model. Procedia Engineering 29 (2012), 1119--1125.

[26]

Liu, W., Meng, X., and Meng, W. Vision-based web data records extraction. In Proc. 9th International Workshop on the Web and Databases (2006), pp. 20--25.

[27]

Liu, W., Meng, X., and Meng, W. Vide: A vision-based approach for deep web data extraction. Knowledge and Data Engineering, IEEE Transactions on 22, 3 (2010), 447--460.

Digital Library

[28]

Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. Extracting data records from the web using tag path clustering. In WWW (2009).

Digital Library

[29]

Reis, D. D. C., Golgher, P. B., Silva, A. S., and Laender, A. Automatic web news extraction using tree edit distance. In WWW (2004).

Digital Library

[30]

RISE. Rise: A repository of online information sources used in information extraction tasks. {http://www.isi.edu/integration/RISE/index.html} (1998).

[31]

Rissanen, J. Modeling by shortest data description. Automatica 14, 5 (1978).

Digital Library

[32]

Simon, K., and Lausen, G. Viper: augmenting automatic information extraction with visual perceptions. In Information and knowledge management (2005).

Digital Library

[33]

Sleiman, H., Corchuelo, R., et al. Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. on Knowledge and Data Engineering 26, 6 (2014).

Digital Library

[34]

Sleiman, H. A., and Corchuelo, R. Tex: An efficient and effective unsupervised web information extractor. Knowledge-Based Systems 39 (2013).

Digital Library

[35]

Thamviset, W., and Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web (2013).

Digital Library

[36]

Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J., and Freire, J. A fast and robust method for web page template detection and removal. In Information and knowledge management (2006).

Digital Library

[37]

Wang, J., and Lochovsky, F. H. Data extraction and label assignment for web databases. In WWW (2003).

Digital Library

[38]

Weninger, T., Palácios, R., Crescenzi, V., Gottron, T., and Merialdo, P. Web content extraction - a meta-analysis of its past and thoughts on its future. CoRR abs/1508.04066 (2015).

Digital Library

[39]

Wood, L., et al. Document object model (dom) level 1 specification. W3C Recommendation 1 (1998).

[40]

Wu, S., Liu, J., and Fan, J. Automatic web content extraction by combination of learning and grouping. In WWW (2015).

Digital Library

[41]

Yamada, Y., Craswell, N., Nakatoh, T., and Hirokawa, S. Testbed for information extraction from deep web. In Proc. of the WWW conf. - papers & posters (2004).

Digital Library

[42]

Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. In WWW (2005).

Digital Library

Cited By

Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Yuliana OChittayasothorn S(2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
https://doi.org/10.1109/ICEAST52143.2021.9426306
Lin BSheng YVo NTata SGupta RLiu YShah MRajan STang JPrakash B(2020)FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web DocumentsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403153(1092-1102)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403153
Show More Cited By

Index Terms

Lossless Separation of Web Pages into Layout Code and Data
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Programming by example

Recommendations

Synthesis of Forgiving Data Extractors
WSDM '17: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining

We address the problem of synthesizing a robust data-extractor from a family of websites that contain the same kind of information. This problem is common when trying to aggregate information from many web sites, for example, when extracting information ...
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications Symposium

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...
Browser GUI for generating web data extraction rules in Ducky
iiWAS '15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications & Services

To benefit from the invaluable data in the World Wide Web, manual extraction or creation of web scraping programs may be necessary. However, these processes can be tedious and complicated. To address these, we have proposed Ducky, which is a Web data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2016

2176 pages

ISBN:9781450342322

DOI:10.1145/2939672

General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Israeli Science Foundation
European Union's Seventh Framework Programme
Israel Ministry of Science and Technology

Conference

KDD '16

Sponsor:

KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2016

California, San Francisco, USA

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
172
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZMeng WDragut E(2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.14778/3574245.3574276
Yuliana OChittayasothorn S(2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
https://doi.org/10.1109/ICEAST52143.2021.9426306
Lin BSheng YVo NTata SGupta RLiu YShah MRajan STang JPrakash B(2020)FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web DocumentsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403153(1092-1102)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1145/3394486.3403153
Yuliana OChang C(2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10489-019-01499-0
Zhao CZhang RQi J(2018)Web Page Template and Data Separation for Better MaintainabilityWeb Information Systems Engineering – WISE 201810.1007/978-3-030-02922-7_30(439-449)Online publication date: 20-Oct-2018
https://doi.org/10.1007/978-3-030-02922-7_30
Dimitrov D(2017)A System for Website Data Management in a Website Building SystemProceedings of the 18th International Conference on Computer Systems and Technologies10.1145/3134302.3134313(211-218)Online publication date: 23-Jun-2017
https://dl.acm.org/doi/10.1145/3134302.3134313
Omari AShoham SYahav Ede Rijke MShokouhi MTomkins AZhang M(2017)Synthesis of Forgiving Data ExtractorsProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018740(385-394)Online publication date: 2-Feb-2017
https://dl.acm.org/doi/10.1145/3018661.3018740

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents