research-article

Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal

Authors:

Yu-Hao Wu,

Chia-Hui ChangAuthors Info & Claims

WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Pages 326 - 334

https://doi.org/10.1145/3486622.3493938

Published: 13 April 2022 Publication History

Get Access

Abstract

Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the main content and removing irrelevant information from web pages. The common solution is to classify each web component into boilerplate (noise) or main content. State-of-the-art approaches such as BoilerNet use neural sequence labeling to achieve an impressive score in CleanEval EN dataset. However, the model uses only the top 50 HTML tags as input features, which does not fully utilize the power of tag information. In addition, the most frequent 1,000 words used for text content representation cannot effectively support a real-world environment in which web pages appear in multiple languages. In this paper, we propose a multi-task learning framework based on two auxiliary tasks: depth prediction and position prediction. We explore HTML tag embedding for tag path representation learning. Further, we employ multilingual Bidirectional Encoder Representations from Transformers (BERT) for text content representation to deal with any web pages without language limitations. The experiments show that HTML tag embedding and multi-task learning frameworks achieve much higher scores than using BoilerNet on CleanEval EN datasets. Secondly, the pre-trained text block representation based on multilingual BERT will degrade the performance on EN test sets; however, zero-shot experiments on three languages (Chinese, Japanese, and Thai) have a performance consistent with the five-fold cross-validation of the respective language, which indicates the possibility of providing cross-lingual support in one model.

References

[1]

Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), Marrakech, Morocco, 6. http://www.lrec-conf.org/proceedings/lrec2008/pdf/162_paper.pdf

Abstract

References

Cited By

Recommendations

Multi-label Generalized Zero-Shot Learning Using Identifiable Variational Autoencoders

Consistency-guided pseudo labeling for transductive zero-shot learning

Collaborative learning of supervision and correlation for generalized zero-shot extreme multi-label learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations