Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3486622.3493938acmconferencesArticle/Chapter ViewAbstractPublication PageswiConference Proceedingsconference-collections
research-article

Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal

Published: 13 April 2022 Publication History

Abstract

Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.

References

[1]
Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), Marrakech, Morocco, 6. http://www.lrec-conf.org/proceedings/lrec2008/pdf/162_paper.pdf
[2]
Douglas Burdick, Marina Danilevsky, Alexandre V Evfimievski, Yannis Katsis, and Nancy Wang. 2020. Table Extraction and Understanding for Scientific and Enterprise Applications. Proc. VLDB Endow. 13, 12 (Aug. 2020), 3433–3436. https://doi.org/10.14778/3415478.3415563
[3]
Michael J. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(2008), 538–549.
[4]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2004. Block-Based Web Search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(Sheffield, United Kingdom) (SIGIR ’04). Association for Computing Machinery, New York, NY, USA, 456–463. https://doi.org/10.1145/1008992.1009070
[5]
Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, and Khaled F. Shaalan. 2006. A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge & Data Engineering 18, 10 (Oct 2006), 1411–1428. https://doi.org/10.1109/TKDE.2006.152
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
[7]
Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web Data Extraction, Applications and Techniques. Know.-Based Syst. 70, C (Nov. 2014), 301–323. https://doi.org/10.1016/j.knosys.2014.07.007
[8]
J. Foley, M. Bendersky, and V. Josifovski. 2015. Learning to Extract Local Events from the Web. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR15, 09-13). ACM, Santiago, Chile, 423–432. https://doi.org/10.1145/2766462.2767739
[9]
Yarin Gal and Zoubin Ghahramani. 2016. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems(Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 1027–1035.
[10]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 17, 1 (Jan. 2016), 2096–2030.
[11]
Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-Based Content Extraction of HTML Documents. In Proceedings of the 12th International Conference on World Wide Web (Budapest, Hungary) (WWW ’03). Association for Computing Machinery, New York, NY, USA, 207–214. https://doi.org/10.1145/775152.775182
[12]
S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9(1997), 1735–1780.
[13]
Myungwon Kim, Youngjin Kim, Wonmoon Song, and Ara Khil. 2013. Main Content Extraction from Web Documents Using Text Block Context. In Database and Expert Systems Applications, Hendrik Decker, Lenka Lhotská, Sebastian Link, Josef Basl, and A. Min Tjoa (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 81–93.
[14]
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate Detection Using Shallow Text Features. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (New York, New York, USA) (WSDM ’10). Association for Computing Machinery, New York, NY, USA, 441–450. https://doi.org/10.1145/1718487.1718542
[15]
Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee. 2019. Survey of Dropout Methods for Deep Neural Networks. arXiv e-prints abs/1904.13310, Article arXiv:1904.13310 (April 2019), 13 pages. arxiv:1904.13310 [cs.NE]
[16]
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, and Juliana S. Teixeira. 2002. A Brief Survey of Web Data Extraction Tools. SIGMOD Rec. 31, 2 (June 2002), 84–93. https://doi.org/10.1145/565117.565137
[17]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning(ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289.
[18]
Jurek Leonhardt, Avishek Anand, and Megha Khosla. 2020. Boilerplate Removal Using a Neural Sequence Labeling Model. In Companion Proceedings of the Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 226–229. https://doi.org/10.1145/3366424.3383547
[19]
A. Pivk, P. Cimiano, York Sure-Vetter, M. Gams, Vladislav Rajkovic, and R. Studer. 2007. Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(2007), 567–595.
[20]
Roland Schäfer. 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Language Resources and Evaluation 51, 3 (2017), 873–889.
[21]
Miroslav Spousta, M. Marek, and Pavel Pecina. 2008. Victor : the Web-Page Cleaning Tool. In The 4th Web as Corpus Workshop (WAC4)-Can we beat Google. European Language Resources Association (ELRA), Marrakech, Morocco, 12–17.
[22]
Ashwin Tengli, Yiming Yang, and Nian Li Ma. 2004. Learning Table Extraction from Examples. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. COLING, Geneva, Switzerland, 987–993. https://www.aclweb.org/anthology/C04-1142
[23]
Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2text: Deep structured boilerplate removal. In European Conference on Information Retrieval. Springer, Springer, Grenoble, France, 167–179.
[24]
Qifan Wang, Bhargav Kanagal, Vijay Garg, and D. Sivakumar. 2019. Constructing a Comprehensive Events Database from the Web. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 229–238. https://doi.org/10.1145/3357384.3357986
[25]
Kaihang Zhang, Chuang Zhang, Xiaojun Chen, and Jianlong Tan. 2018. Automatic Web News Extraction Based on DS Theory Considering Content Topics. In Computational Science – ICCS 2018, Yong Shi, Haohuan Fu, Yingjie Tian, Valeria V. Krzhizhanovskaya, Michael Harold Lees, Jack Dongarra, and Peter M. A. Sloot (Eds.). Springer International Publishing, Cham, 194–207.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
December 2021
698 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. boilerplate removal
  2. cross-lingual model
  3. multi-task learning
  4. tag embedding
  5. zero-shot learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

WI-IAT '21
Sponsor:
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence
December 14 - 17, 2021
VIC, Melbourne, Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 94
    Total Downloads
  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media