Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2939672.2939858acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Lossless Separation of Web Pages into Layout Code and Data

Published: 13 August 2016 Publication History

Abstract

A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting site updates and redesign).
We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page.
We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.

References

[1]
Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In SIGMOD (2003).
[2]
Bar-Yossef, Z., and Rajagopalan, S. Template detection via data mining and its applications. In WWW02.
[3]
Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. Vips: a vision-based page segmentation algorithm. Tech. rep., Microsoft technical report, MSR-TR-2003-79, 2003.
[4]
Chakrabarti, D., Kumar, R., and Punera, K. Page-level template detection via isotonic smoothing. In Proc. of the international conf. on World Wide Web (2007).
[5]
Chang, C. H., Kayed, M., Girgis, M., and Shaalan, K. A survey of web information extraction systems. IEEE Trans. on Knowledge and Data Engineering 18, 10 (2006).
[6]
Chang, C.-H., and Lui, S.-C. IEPAD: information extraction based on pattern discovery. In WWW (2001).
[7]
Crescenzi, V., Mecca, G., and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. In VLDB (2001).
[8]
Dalvi, N., Bohannon, P., and Sha, F. Robust web extraction: an approach based on a probabilistic tree-edit model. In SIGMOD (2009).
[9]
Dholi, M. P. R., and Chaudhari, K. Template extraction from heterogeneous web pages using MDL principle.
[10]
Fazzinga, B., Flesca, S., and Tagarelli, A. Learning robust web wrappers. In Database and Expert Systems Applications (2005), Springer, pp. 736--745.
[11]
Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. Extracting general lists from web documents: A hybrid approach. In IEA/AIE'11 (2011).
[12]
Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. Hylien: a hybrid approach to general list extraction on the web. In WWW (2011).
[13]
Gao, B., and Fan, Q. Multiple template detection based on segments. In Advances in Data Mining. Applications and Theoretical Aspects. Springer, 2014, pp. 24--38.
[14]
Geraci, F., and Maggini, M. A fast method for web template extraction via a multi-sequence alignment approach. In KIC3K. Springer, 2013, pp. 172--184.
[15]
Gibson, D., Punera, K., and Tomkins, A. The volume and evolution of web page templates. In Special interest tracks and posters of WWW (2005).
[16]
Grünwald, P. D. The minimum description length principle. MIT press, 2007.
[17]
Hansen, M. H., and Yu, B. Model selection and the principle of minimum description length. Journal of the American Statistical Association 96, 454 (2001), 746--774.
[18]
Hao, Q., Cai, R., Pang, Y., and Zhang, L. From one tree to a forest: a unified solution for structured web data extraction. In SIGIR (2011).
[19]
Kayed, M., and Chang, C.-H. Fivatech: Page-level web data extraction from template pages. Knowledge and Data Engineering, IEEE Transactions on 22, 2 (2010), 249--263.
[20]
Kim, C., and Shim, K. Text: Automatic template extraction from heterogeneous web pages. Knowledge and Data Engineering, IEEE Transactions on 23, 4 (2011).
[21]
Kohlschütter, C., Fankhauser, P., and Nejdl, W. Boilerplate detection using shallow text features. In Web Search and Data Mining (WSDM) (2010).
[22]
Kushmerick, N., Weld, D. S., and Doorenbos, R. B. Wrapper induction for information extraction. In IJCAI'97.
[23]
Li, J., Liu, C., Yu, J. X., and Zhou, R. Efficient top-k search across heterogeneous XML data sources. In Database Systems for Advanced Applications (DASFAA) (2008).
[24]
Liu, B., Grossman, R., and Zhai, Y. Mining data records in web pages. In KDD (2003).
[25]
Liu, D., Wang, X., Li, H., and Yan, Z. Robust web extraction based on minimum cost script edit model. Procedia Engineering 29 (2012), 1119--1125.
[26]
Liu, W., Meng, X., and Meng, W. Vision-based web data records extraction. In Proc. 9th International Workshop on the Web and Databases (2006), pp. 20--25.
[27]
Liu, W., Meng, X., and Meng, W. Vide: A vision-based approach for deep web data extraction. Knowledge and Data Engineering, IEEE Transactions on 22, 3 (2010), 447--460.
[28]
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. Extracting data records from the web using tag path clustering. In WWW (2009).
[29]
Reis, D. D. C., Golgher, P. B., Silva, A. S., and Laender, A. Automatic web news extraction using tree edit distance. In WWW (2004).
[30]
RISE. Rise: A repository of online information sources used in information extraction tasks. {http://www.isi.edu/integration/RISE/index.html} (1998).
[31]
Rissanen, J. Modeling by shortest data description. Automatica 14, 5 (1978).
[32]
Simon, K., and Lausen, G. Viper: augmenting automatic information extraction with visual perceptions. In Information and knowledge management (2005).
[33]
Sleiman, H., Corchuelo, R., et al. Trinity: on using trinary trees for unsupervised web data extraction. IEEE Trans. on Knowledge and Data Engineering 26, 6 (2014).
[34]
Sleiman, H. A., and Corchuelo, R. Tex: An efficient and effective unsupervised web information extractor. Knowledge-Based Systems 39 (2013).
[35]
Thamviset, W., and Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web (2013).
[36]
Vieira, K., da Silva, A. S., Pinto, N., de Moura, E. S., Cavalcanti, J., and Freire, J. A fast and robust method for web page template detection and removal. In Information and knowledge management (2006).
[37]
Wang, J., and Lochovsky, F. H. Data extraction and label assignment for web databases. In WWW (2003).
[38]
Weninger, T., Palácios, R., Crescenzi, V., Gottron, T., and Merialdo, P. Web content extraction - a meta-analysis of its past and thoughts on its future. CoRR abs/1508.04066 (2015).
[39]
Wood, L., et al. Document object model (dom) level 1 specification. W3C Recommendation 1 (1998).
[40]
Wu, S., Liu, J., and Fan, J. Automatic web content extraction by combination of learning and grouping. In WWW (2015).
[41]
Yamada, Y., Craswell, N., Nakatoh, T., and Hirokawa, S. Testbed for information extraction from deep web. In Proc. of the WWW conf. - papers & posters (2004).
[42]
Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. In WWW (2005).

Cited By

View all
  • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
  • (2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
  • (2020)FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web DocumentsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403153(1092-1102)Online publication date: 23-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data extraction
  2. data mining
  3. json
  4. lossless
  5. program synthesis
  6. separation
  7. tree alignment
  8. wrapper induction

Qualifiers

  • Research-article

Funding Sources

  • Israeli Science Foundation
  • European Union's Seventh Framework Programme
  • Israel Ministry of Science and Technology

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Web Record Extraction with InvariantsProceedings of the VLDB Endowment10.14778/3574245.357427616:4(959-972)Online publication date: 1-Dec-2022
  • (2021)Transformation from Web Pages to Optimal Normal Form Database Schema Using a Conceptual Schema Approach2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST)10.1109/ICEAST52143.2021.9426306(112-116)Online publication date: 1-Apr-2021
  • (2020)FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web DocumentsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403153(1092-1102)Online publication date: 23-Aug-2020
  • (2020)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-050:2(271-295)Online publication date: 1-Feb-2020
  • (2018)Web Page Template and Data Separation for Better MaintainabilityWeb Information Systems Engineering – WISE 201810.1007/978-3-030-02922-7_30(439-449)Online publication date: 20-Oct-2018
  • (2017)A System for Website Data Management in a Website Building SystemProceedings of the 18th International Conference on Computer Systems and Technologies10.1145/3134302.3134313(211-218)Online publication date: 23-Jun-2017
  • (2017)Synthesis of Forgiving Data ExtractorsProceedings of the Tenth ACM International Conference on Web Search and Data Mining10.1145/3018661.3018740(385-394)Online publication date: 2-Feb-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media