Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3366424.3383547acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Boilerplate Removal using a Neural Sequence Labeling Model

Published: 20 April 2020 Publication History

Abstract

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

References

[1]
Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template Detection via Data Mining and Its Applications. In Proceedings of the 11th International Conference on World Wide Web(WWW ’02). 580–591.
[2]
Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: a Competition for Cleaning Web Pages. In LREC.
[3]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2004. Block-based web search. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 456–463.
[4]
Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web. 61–70.
[5]
Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Schölkopf. 2017. Fidelity-weighted learning. arXiv preprint arXiv:1711.02799(2017).
[6]
Aidan Finn, Nicholas Kushmerick, and Barry Smyth. 2001. Fact or fiction: Content classification for digital libraries. (2001).
[7]
John Gibson, Ben Wellner, and Susan Lubar. 2007. Adaptive web-page content identification. In Proceedings of the 9th annual ACM international workshop on Web information and data management. ACM, 105–112.
[8]
Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based Content Extraction of HTML Documents. In Proceedings of the 12th International Conference on World Wide Web(WWW ’03). 207–214.
[9]
Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining. ACM, 441–450.
[10]
Shian-Hua Lin and Jan-Ming Ho. 2002. Discovering Informative Content Blocks from Web Documents. In Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining(KDD ’02). 588–593.
[11]
Jeff Pasternack and Dan Roth. 2009. Extracting Article Text from the Web with Maximum Subsequence Segmentation. In Proceedings of the 18th International Conference on World Wide Web(WWW ’09). 971–980.
[12]
Miroslav Spousta, Michal Marek, and Pavel Pecina. 2008. Victor: the web-page cleaning tool. In 4th Web as Corpus Workshop (WAC4)-Can we beat Google. 12–17.
[13]
Fei Sun, Dandan Song, and Lejian Liao. 2011. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 245–254.
[14]
Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text: Deep Structured Boilerplate Removal. In European Conference on Information Retrieval. Springer, 167–179.
[15]
Junfeng Wang, Chun Chen, Can Wang, Jian Pei, Jiajun Bu, Ziyu Guan, and Wei Vivian Zhang. 2009. Can We Learn a Template-independent Wrapper for News Article Extraction from a Single Training Site?. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’09). 1345–1354.
[16]
Shanchan Wu, Jerry Liu, and Jian Fan. 2015. Automatic web content extraction by combination of learning and grouping. In Proceedings of the 24th international conference on World Wide Web. 1264–1274.
[17]
Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating Noisy Information in Web Pages for Data Mining. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’03). 296–305.

Cited By

View all
  • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
  • (2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
  • (2024)A Database of Slovak News Articles for Boilerplate Removal2024 International Symposium ELMAR10.1109/ELMAR62909.2024.10694090(251-254)Online publication date: 16-Sep-2024
  • Show More Cited By

Index Terms

  1. Boilerplate Removal using a Neural Sequence Labeling Model
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '20: Companion Proceedings of the Web Conference 2020
    April 2020
    854 pages
    ISBN:9781450370240
    DOI:10.1145/3366424
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '20
    Sponsor:
    WWW '20: The Web Conference 2020
    April 20 - 24, 2020
    Taipei, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)78
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Neural Ranking Using Forward Indexes and Lightweight EncodersACM Transactions on Information Systems10.1145/363193942:5(1-34)Online publication date: 29-Apr-2024
    • (2024)RevEx: An Online Consumer Reviews Extraction ToolProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679214(5169-5173)Online publication date: 21-Oct-2024
    • (2024)A Database of Slovak News Articles for Boilerplate Removal2024 International Symposium ELMAR10.1109/ELMAR62909.2024.10694090(251-254)Online publication date: 16-Sep-2024
    • (2023)New Visual Features for HTML Main Content ExtractionJournal of Digital Contents Society10.9728/dcs.2023.24.4.69124:4(691-699)Online publication date: 30-Apr-2023
    • (2023)ProVe: A pipeline for automated provenance verification of knowledge graphs against textual sourcesSemantic Web10.3233/SW-233467(1-34)Online publication date: 12-Sep-2023
    • (2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
    • (2023)An unsupervised perplexity-based method for boilerplate removalNatural Language Engineering10.1017/S1351324923000049(1-18)Online publication date: 21-Feb-2023
    • (2023)SCIEnt: A Semantic-Feature-Based Framework for Core Information Extraction from Web PagesNeural Information Processing10.1007/978-3-031-30111-7_27(311-323)Online publication date: 13-Apr-2023
    • (2022)GROWN+UPProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557340(2372-2382)Online publication date: 17-Oct-2022
    • (2022)HybEx: A Hybrid Tool for Template ExtractionCompanion Proceedings of the Web Conference 202210.1145/3487553.3524242(205-209)Online publication date: 25-Apr-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media