research-article

Boilerplate Removal using a Neural Sequence Labeling Model

Authors:

Jurek Leonhardt,

Megha KhoslaAuthors Info & Claims

WWW '20: Companion Proceedings of the Web Conference 2020

Pages 226 - 229

https://doi.org/10.1145/3366424.3383547

Published: 20 April 2020 Publication History

Abstract

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

References

[1]

Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template Detection via Data Mining and Its Applications. In Proceedings of the 11th International Conference on World Wide Web(WWW ’02). 580–591.

Digital Library

[2]

Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a Competition for Cleaning Web Pages. In LREC.

[3]

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2004. Block-based web search. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 456–463.

Digital Library

[4]

Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web. 61–70.

Digital Library

[5]

Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Schölkopf. 2017. Fidelity-weighted learning. arXiv preprint arXiv:1711.02799(2017).

[6]

Aidan Finn, Nicholas Kushmerick, and Barry Smyth. 2001. Fact or fiction: Content classification for digital libraries. (2001).

[7]

John Gibson, Ben Wellner, and Susan Lubar. 2007. Adaptive web-page content identification. In Proceedings of the 9th annual ACM international workshop on Web information and data management. ACM, 105–112.

Digital Library

[8]

Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based Content Extraction of HTML Documents. In Proceedings of the 12th International Conference on World Wide Web(WWW ’03). 207–214.

Digital Library

[9]

Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining. ACM, 441–450.

Digital Library

[10]

Shian-Hua Lin and Jan-Ming Ho. 2002. Discovering Informative Content Blocks from Web Documents. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining(KDD ’02). 588–593.

Digital Library

[11]

Jeff Pasternack and Dan Roth. 2009. Extracting Article Text from the Web with Maximum Subsequence Segmentation. In Proceedings of the 18th International Conference on World Wide Web(WWW ’09). 971–980.

Digital Library

[12]

Miroslav Spousta, Michal Marek, and Pavel Pecina. 2008. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google. 12–17.

[13]

Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245–254.

Digital Library

[14]

Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text: Deep Structured Boilerplate Removal. In European Conference on Information Retrieval. Springer, 167–179.

[15]

Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang. 2009. Can We Learn a Template-independent Wrapper for News Article Extraction from a Single Training Site?. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’09). 1345–1354.

Digital Library

[16]

Shanchan Wu, Jerry Liu, and Jian Fan. 2015. Automatic web content extraction by combination of learning and grouping. In Proceedings of the 24th international conference on World Wide Web. 1264–1274.

Digital Library

[17]

Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’03). 296–305.

Digital Library

Cited By

Leonhardt JMüller HRudra KKhosla MAnand AAnand A(2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631939
Alarte JGalindo CMartín CSilva JSerra ESpezzano F(2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679214
Rábeková ZAndicsová VOravec MPavlovičová JHintoš P(2024)A Database of Slovak News Articles for Boilerplate Removal2024 International Symposium ELMAR10.1109/ELMAR62909.2024.10694090(251-254)Online publication date: 16-Sep-2024
https://doi.org/10.1109/ELMAR62909.2024.10694090
Show More Cited By

Index Terms

Boilerplate Removal using a Neural Sequence Labeling Model
1. Theory of computation
  1. Theory and algorithms for application domains

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-Task Neural Sequence Labeling for Zero-Shot Cross-Language Boilerplate Removal
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Although web pages are rich in resources, they are usually intertwined with advertisements, banners, navigation bars, footer copyrights and other templates, which are often not of interest to users. In this paper, we study the problem of extracting the ...
De-duping URLs with Sequence-to-Sequence Neural Networks
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Many URLs on the Internet point to identical contents, which increase the burden of web crawlers. Techniques that detect such URLs (known as URL de-duping) can greatly save resources such as bandwidth and storage for crawlers. Traditional de-duping ...
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

Fake medical Web sites have become increasingly prevalent. Consequently, much of the health-related information and advice available online is inaccurate and/or misleading. Scores of medical institution Web sites are for organizations that do not exist ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Companion Proceedings of the Web Conference 2020

April 2020

854 pages

ISBN:9781450370240

DOI:10.1145/3366424

Editors:
Amal El Fallah Seghrouchni
Sorbonne University, France
,
Gita Sukthankar
University of Central Florida, United States
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)11

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Leonhardt JMüller HRudra KKhosla MAnand AAnand A(2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3631939
Alarte JGalindo CMartín CSilva JSerra ESpezzano F(2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679214
Rábeková ZAndicsová VOravec MPavlovičová JHintoš P(2024)A Database of Slovak News Articles for Boilerplate Removal2024 International Symposium ELMAR10.1109/ELMAR62909.2024.10694090(251-254)Online publication date: 16-Sep-2024
https://doi.org/10.1109/ELMAR62909.2024.10694090
Jung GCha J(2023)New Visual Features for HTML Main Content ExtractionJournal of Digital Contents Society10.9728/dcs.2023.24.4.69124:4(691-699)Online publication date: 30-Apr-2023
https://doi.org/10.9728/dcs.2023.24.4.691
Amaral GRodrigues OSimperl E(2023)ProVe: A pipeline for automated provenance verification of knowledge graphs against textual sourcesSemantic Web10.3233/SW-233467(1-34)Online publication date: 12-Sep-2023
https://doi.org/10.3233/SW-233467
Bevendorff JGupta SKiesel JStein BChen HDuh WHuang HKato MMothe JPoblete B(2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591920
Fernández-Pichel MPrada-Corral MLosada DPichel JGamallo P(2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S1351324923000049(1-18)Online publication date: 21-Feb-2023
https://doi.org/10.1017/S1351324923000049
Wang ZGuo YXu YXue YLiu YShen HCheng X(2023)SCIEnt: A Semantic-Feature-Based Framework for Core Information Extraction from Web PagesNeural Information Processing10.1007/978-3-031-30111-7_27(311-323)Online publication date: 13-Apr-2023
https://doi.org/10.1007/978-3-031-30111-7_27
Yeoh BWang HAl Hasan MXiong L(2022)GROWN+UPProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557340(2372-2382)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557340
Alarte JSilva J(2022)HybEx: A Hybrid Tool for Template ExtractionCompanion Proceedings of the Web Conference 202210.1145/3487553.3524242(205-209)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3487553.3524242
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents