Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Experimenting with Unsupervised Multilingual Event Detection in Historical Newspapers

  • Conference paper
  • First Online:
From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries (ICADL 2022)

Abstract

To prevent historical knowledge’s fading, research in event detection could facilitate access to digitized collections. In this paper, we propose a method for annotating multilingual historical documents for event detection in an unsupervised manner by leveraging entities and semantic notions of event types. We automatically annotate the documents by relying on dependency parse trees and automatic semantic mapping to event-based frames, with a focus on the multilingual transfer between frames and candidate events. The documents are afterward verified by native speakers, Digital Humanities researchers. We also report on experimental results of event detection in historical newspapers with a state-of-the-art model. We demonstrate that our approach allows for easy language adaptation by presenting two study cases with knowledge extracted from German newspapers from 1911 to 1933 regarding events surrounding International Women’s Day and from French newspapers between 1900 and 1944 related to the abolition of guillotine executions in France. Our preliminary findings show that this type of approach could alleviate the need for manual annotation by also providing a practical course of action toward unsupervised event detection from multilingual digitized and historical documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.newseye.eu/.

  2. 2.

    https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf.

  3. 3.

    An example of the Execution frame can be viewed at Framenet2 website.

  4. 4.

    We chose movements, conflictual events, and membership in organizations.

  5. 5.

    We used spaCy 3.1+ [23] with the model xx_ent_wiki_sm https://spacy.io/models/xx.

  6. 6.

    We are aware that BERT was trained for representing true sentences rather than pseudo-sentences. However, we consider that BERT might generate an embedding that represents the context in which all the event triggers are frequently used.

  7. 7.

    As we use multilingual BERT, even when the training is English, the model should be able to predict events in other languages in a zero-shot manner.

  8. 8.

    Translation: In order to bring about this desired state, we send our greetings to our sisters all over the world and call on them to demonstrate together with us against the continuation of the war on International Women’s Day.

  9. 9.

    This threshold was chosen experimentally after we verified the dismissed articles.

References

  1. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley frameNet project. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 86–90 (1998)

    Google Scholar 

  2. Bedi, H., Patil, S., Hingmire, S., Palshikar, G.: Event timeline generation from history textbooks. In: Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 69–77 (2017)

    Google Scholar 

  3. Boros, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 431–441. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.conll-1.35

  4. Boros, E., et al.: Robust named entity recognition and linking on historical multilingual documents. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020)

    Google Scholar 

  5. Boros, E., Moreno, J.G., Doucet, A.: Event detection with entity markers. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 233–240. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_20

    Chapter  Google Scholar 

  6. Boschee, E., Natarajan, P., Weischedel, R.: Automatic extraction of events from open source text for predictive forecasting. In: Subrahmanian, V. (ed.) Handbook of Computational Approaches to Counterterrorism, pp. 51–67. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-5311-6_3

  7. Boschetti, F., et al.: Computational analysis of historical documents: an application to Italian war bulletins in World War I and II. In: Workshop on Language resources and technologies for processing and linking historical documents and archives (LRT4HDA 2014), pp. 70–75. ELRA (2014)

    Google Scholar 

  8. Bronstein, O., Dagan, I., Li, Q., Ji, H., Frank, A.: Seed-based event trigger labeling: how far can event descriptions get us? In: ACL, vol. 2, pp. 372–376 (2015)

    Google Scholar 

  9. Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, pp. 167–176 (2015)

    Google Scholar 

  10. Cybulska, A., Vossen, P.: Historical event extraction from text. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 39–43 (2011)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Ehrmann, M., Romanello, M., Bircher, S., Clematide, S.: Introducing the CLEF 2020 HIPE shared task: named entity recognition and linking on historical newspapers. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 524–532. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_68

    Chapter  Google Scholar 

  13. Ehrmann, M., Romanello, M., Doucet, A., Clematide, S.: Introducing the HIPE 2022 shared task: named entity recognition and linking in multilingual historical documents. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13186, pp. 347–354. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99739-7_44

    Chapter  Google Scholar 

  14. Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of CLEF HIPE 2020: named entity recognition and linking on historical newspapers. In: Arampatzis, A., et al. (eds.) CLEF 2020. LNCS, vol. 12260, pp. 288–310. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58219-7_21

    Chapter  Google Scholar 

  15. Ehrmann, M., Romanello, M., Najem-Meyer, S., Doucet, A., Clematide, S.: Overview of HIPE-2022: named entity recognition and linking in multilingual historical documents. In: Barrón-Cedeño, A., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2022. LNCS , vol. 13390. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-13643-6_26

  16. Fellbaum, C.: Wordnet. In: Poli, R., Healy, M., Kameas, A. (eds) Theory and Applications of Ontology: Computer Applications, pp. 231–243. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-8847-5_10

  17. Feng, X., Huang, L., Tang, D., Ji, H., Qin, B., Liu, T.: A language-independent neural network for event detection. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), vol. 2, pp. 66–71 (2016)

    Google Scholar 

  18. Filatova, E., Hatzivassiloglou, V.: Event-based extractive summarization (2004)

    Google Scholar 

  19. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING 1996, pp. 466–471 (1996)

    Google Scholar 

  20. Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over OCRed documents. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 333–334. IEEE, Illinois, USA (2019)

    Google Scholar 

  21. Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., Zhu, Q.: Using cross-entity inference to improve event extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-vol. 1, pp. 1127–1136. Association for Computational Linguistics (2011)

    Google Scholar 

  22. Hong, Y., Zhou, W., Zhang, J., Zhou, G., Zhu, Q.: Self-regulation: employing a generative adversarial network to improve event detection. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 515–526 (2018)

    Google Scholar 

  23. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in python (2020). https://doi.org/10.5281/zenodo.1212303

  24. Huang, R., Riloff, E.: Peeling back the layers: detecting event role fillers in secondary contexts. In: ACL 2011, pp. 1137–1147 (2011)

    Google Scholar 

  25. Ide, N., Woolner, D.: Exploiting semantic web technologies for intelligent access to historical documents. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA), Lisbon, Portugal (2004). https://www.lrec-conf.org/proceedings/lrec2004/pdf/248.pdf

  26. Jean-Caurant, A., Doucet, A.: Accessing and investigating large collections of historical newspapers with the NewsEye platform. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 531–532 (2020)

    Google Scholar 

  27. Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 73–82. Association for Computational Linguistics, Sofia, Bulgaria (2013). https://www.aclweb.org/anthology/P13-1008

  28. Li, W., Cheng, D., He, L., Wang, Y., Jin, X.: Joint event extraction based on hierarchical event schemas from FrameNet. IEEE Access 7, 25001–25015 (2019)

    Article  Google Scholar 

  29. Liu, J., Chen, Y., Liu, K., Bi, W., Liu, X.: Event extraction as machine reading comprehension. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1641–1651 (2020)

    Google Scholar 

  30. Liu, M., Li, W., Wu, M., Lu, Q.: Extractive summarization based on event term clustering. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 185–188 (2007)

    Google Scholar 

  31. Liu, S., et al.: Leveraging FrameNet to improve automatic event detection (2016)

    Google Scholar 

  32. Liu, S., Chen, Y., Liu, K., Zhao, J.: Exploiting argument information to improve event detection via supervised attention mechanisms. In: 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 1789–1798. Vancouver, Canada (2017)

    Google Scholar 

  33. Liu, S., et al.: Exploiting argument information to improve event detection via supervised attention mechanisms (2017)

    Google Scholar 

  34. Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: speech and OCR. In: Proceedings of the sixth conference on Applied natural language processing, pp. 316–324. Association for Computational Linguistics, Seattle, Washington, USA (2000)

    Google Scholar 

  35. Mutuvi, S., Doucet, A., Odeo, M., Jatowt, A.: Evaluating the impact of ocr errors on topic modeling. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 3–14. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_1

    Chapter  Google Scholar 

  36. Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 300–309 (2016)

    Google Scholar 

  37. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (vol. 2: Short Papers), pp. 365–371. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-2060

  38. Oberbichler, S., et al.: Integrated interdisciplinary workflows for research on historical newspapers: perspectives from humanities scholars, computer scientists, and librarians. J. Assoc. Inf. Sci. Technol. 73(2), 225–239 (2021)

    Google Scholar 

  39. Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI1996, pp. 1044–1049 (1996)

    Google Scholar 

  40. Riloff, E.: An empirical study of automated dictionary construction for information extraction in three domains. Artif. Intell. 85(1), 101–134 (1996)

    Article  Google Scholar 

  41. Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw OCR text. In: Jancsary, J. (ed.) 11th Conference on Natural Language Processing, KONVENS 2012, Empirical Methods in Natural Language Processing, 19–21 Sept 2012. Scientific series of the ÖGAI, vol. 5, pp. 410–414. ÖGAI, Wien, Österreich, Vienna, Austria (2012). https://www.oegai.at/konvens2012/proceedings/60_rodriquez12w/

  42. Rovera, M., Nanni, F., Ponzetto, S.P.: Event-Based access to historical Italian war memoirs. J. Comput. Cult. Heritage 14(1), 1-23 (2021). https://doi.org/10.1145/3406210

  43. Saurí, R., Knippen, R., Verhagen, M., Pustejovsky, J.: Evita: a robust event recognizer for QA systems. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 700–707. Association for Computational Linguistics, Vancouver, British Columbia, Canada (2005). https://aclanthology.org/H05-1088

  44. Shaw, R.B.: Events and periods as concepts for organizing historical knowledge. University of California, Berkeley (2010)

    Google Scholar 

  45. Sprugnoli, R.: Event Detection and Classification for the Digital Humanities, Ph. D. thesis, University of Trento (2018)

    Google Scholar 

  46. van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: ICAART 2020 - Proceedings of the 12th International Conference on Agents and Artificial Intelligence vol. 1, pp. 484–496 (2020)

    Google Scholar 

  47. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  48. Walker, C., Stephanie, S., Julie, M., Kazuaki, M.: ACE 2005 multilingual training corpus. Linguistic Data Consortium, Technical report (2005)

    Google Scholar 

  49. Yang, S., Feng, D., Qiao, L., Kan, Z., Li, D.: Exploring pre-trained language models for event extraction and generation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5284–5294 (2019)

    Google Scholar 

  50. Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Automatic acquisition of domain knowledge for information extraction. In: 18th International Conference on Computational Linguistics (COLING 2000), pp. 940–946 (2000)

    Google Scholar 

  51. Zhang, T., Ji, H., Sil, A.: Joint entity and event extraction with generative adversarial imitation learning. Data Intell. 1(2), 99–120 (2019)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the European Union’s Horizon 2020 research and innovation program under grants 770299 (NewsEye) and 825153 (Embeddia). Also, it has been supported by the ANNA (2019-1R40226) and TERMITRAD (2020–2019-8510010) projects funded by the Nouvelle-Aquitaine Region, France.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emanuela Boros .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boros, E., Cabrera-Diego, L.A., Doucet, A. (2022). Experimenting with Unsupervised Multilingual Event Detection in Historical Newspapers. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21756-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21755-5

  • Online ISBN: 978-3-031-21756-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics