OpenPSS: An Open Page Stream Segmentation Benchmark

Heusden, Ruben van; Kamps, Jaap; Marx, Maarten

doi:10.1007/978-3-031-72437-4_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15177))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

281 Accesses

Abstract

In recent years, an increasing number of companies and institutions have begun the process of digitizing their physical records to promote digital access and searchability of their collections. For cost-efficiency, documents are often scanned in consecutively, resulting in large PDF files consisting of many documents. Although cost-effective, this practice can be harmful for searchability when these concatenated documents are used to build a search engine. The task of Page Stream Segmentation is concerned with recovering the original document boundaries through the analysis of the text and/or images of these PDF files. Currently, many of the approaches to solving this problem make use of machine learning techniques that require significant amounts of training data. However, due to the sometimes sensitive nature of the data, few large datasets exist, and there is a lack of agreed-upon metrics to measure system performance.

In an effort to resolve these issues and provide a comprehensive overview of the state of the field, we constructed the OpenPSS benchmark, consisting of two large public datasets and a comprehensive study of various types of approaches, evaluated using multiple evaluation metrics. The datasets originated from several Dutch government institutions, cover a heterogeneous set of topics, and total roughly 141 thousand pages from around 32 thousand documents.

The experimental results show that ensemble methods using both the text and image representations of pages are superior to uni-modal methods, and that image-based neural methods are not as robust as text models when evaluated on out-of-distribution data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Empirical Comparison of Web Page Segmentation Algorithms

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation

Notes

1.
https://github.com/tesseract-ocr/tesseract.
2.
https://github.com/wietsedv/bertje.
3.
https://anonymous.4open.science/r/OpenPSSbenchmarkTPDL-D851/.
4.
Kirilov et al. call the unweighted F1 the recognition quality RQ, and the weighted F1, which equals $RQ\times SQ$ the Panoptic Quality PQ.

References

Agam, G., Argamon, S., Frieder, O., Grossman, D., Lewis, D.: The Complex Document Image Processing (CDIP) Test Collection Project. Illinois Institute of Technology, Chicago (2006)
Google Scholar
Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D., Resnik, P.: A joint model for document segmentation and segment labeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 313–322. Association for Computational Linguistics, New York, USA (2020). https://aclanthology.org/2020.acl-main.29
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1), 177–210 (1999)
Article Google Scholar
Braz, F.A., da Silva, N.C., Lima, J.A.S.: Leveraging effectiveness and efficiency in page stream deep segmentation. Eng. Appl. Artif. Intell. 105, 104394 (2021)
Article Google Scholar
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (2000). https://aclanthology.org/A00-2004
Daher, H., Belaïd, A.: Document flow segmentation for business applications. In: Proceedings of the Document Recognition and Retrieval (DRR) XXI, vol. 9021, p. 90210G. International Society for Optics and Photonics (2014)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Association for Computational Linguistics, Minneapolis, MN, USA (2019). https://doi.org/10.18653/v1/n19-1423
Guha, A., Alahmadi, A., Samanta, D., Khan, M.Z., Alahmadi, A.H.: A multi-modal approach to digital document stream segmentation for title insurance domain. IEEE Access 10, 11341–11353 (2022)
Article Google Scholar
Hamdi, A., Coustaty, M., Joseph, A., d’Andecy, V.P., Doucet, A., Ogier, J.M.: Feature selection for document flow segmentation. In: Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 245–250 (2018)
Google Scholar
Hernault, H., Bollegala, D., Ishizuka, M.: A sequential model for discourse segmentation. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 315–326. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12116-6_26
Chapter Google Scholar
van Heusden, R., Kamps, J., Marx, M.: WooIR: a new open page stream segmentation dataset. In: Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 24–33 (2022)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CCVPR), pp. 9404–9413 (2019)
Google Scholar
Koshorek, O., Cohen, A., Mor, N., Rotman, M., Berant, J.: Text segmentation as a supervised Learning task. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 469–473. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://aclanthology.org/N18-2075
Kuncheva, L.I.: A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 281–286 (2002)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the International Conference on Machine Learning, pp. 1188–1196. PMLR (2014)
Google Scholar
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666. Association for Computing Machinery, New York, NY, USA (2006). https://doi.org/10.1145/1148170.1148307
Lukasik, M., Dadachev, B., Papineni, K., Simões, G.: Text segmentation by cross segment attention. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4707–4716. Association for Computational Linguistics (2020). https://aclanthology.org/2020.emnlp-main.380
Meilender, T., Belaïd, A.: Segmentation of continuous document flow by a modified backward-forward algorithm. In: Proceedings of the Document Recognition and Retrieval (DRR) XVI, vol. 7247, p. 724705. International Society for Optics and Photonics (2009)
Google Scholar
Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)
Article Google Scholar
Reynar, J.C.: An automatic method of finding topic boundaries. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 331–333. Association for Computational Linguistics, Las Cruces, New Mexico, USA (1994). https://aclanthology.org/P94-1050
Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000). https://doi.org/10.1016/S0031-3203(99)00055-2
Article Google Scholar
Sharkey, A.J.: Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 1–30. Springer (1999). https://doi.org/10.1007/978-1-4471-0793-4
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 6105–6114. PMLR (2019)
Google Scholar
Wang, Y., Li, S., Yang, J.: Toward fast and accurate neural discourse segmentation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 962–967. Association for Computational Linguistics, Brussels, Belgium (2018). https://aclanthology.org/D18-1116
Wiedemann, G., Heyer, G.: Multi-modal page stream segmentation with convolutional neural networks. Lang. Resour. Eval. 55(1), 127–150 (2021)
Article Google Scholar
Zhu, G., Doermann, D.: Automatic document logo detection. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, vol. 2, pp. 864–868. IEEE (2007)
Google Scholar
Zhu, G., Zheng, Y., Doermann, D., Jaeger, S.: Multi-scale structural saliency for signature detection. In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CCVPR), pp. 1–8. IEEE (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Information Retrieval Lab, University of Amsterdam, Amsterdam, The Netherlands
Ruben van Heusden & Maarten Marx
Faculty of Humanities, University of Amsterdam, Amsterdam, The Netherlands
Jaap Kamps

Authors

Ruben van Heusden
View author publications
You can also search for this author in PubMed Google Scholar
Jaap Kamps
View author publications
You can also search for this author in PubMed Google Scholar
Maarten Marx
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruben van Heusden .

Editor information

Editors and Affiliations

University of Salford, Salford, UK
Apostolos Antonacopoulos
University of Waikato, Hamilton, New Zealand
Annika Hinze
Sorbonne University (CNRS), Paris, France
Benjamin Piwowarski
University of La Rochelle (L3i Laboratory), La Rochelle, France
Mickaël Coustaty
University of Padova, Padua, Italy
Giorgio Maria Di Nunzio
University of Hamburg, Hamburg, Germany
Francesco Gelati
University of Waikato, Hamilton, New Zealand
Nicholas Vanderschantz

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heusden, R.v., Kamps, J., Marx, M. (2024). OpenPSS: An Open Page Stream Segmentation Benchmark. In: Antonacopoulos, A., et al. Linking Theory and Practice of Digital Libraries. TPDL 2024. Lecture Notes in Computer Science, vol 15177. Springer, Cham. https://doi.org/10.1007/978-3-031-72437-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-72437-4_24
Published: 26 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72436-7
Online ISBN: 978-3-031-72437-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OpenPSS: An Open Page Stream Segmentation Benchmark

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Empirical Comparison of Web Page Segmentation Algorithms

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

OpenPSS: An Open Page Stream Segmentation Benchmark

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Empirical Comparison of Web Page Segmentation Algorithms

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation