Abstract
Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The public agency for law enforcement and prosecution of crimes in the Brazilian state of Mato Grosso do Sul.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
The results presented in the LeNER-Br paper are based on the token-level evaluation, which is not standard in the literature and provides much higher numbers.
- 10.
References
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. CoRR, abs/1904.03323 (2019). http://arxiv.org/abs/1904.03323
Angelidis, I., Chalkidis, I., Koubarakis, M.: Named entity recognition, linking and generation for Greek legislation. In: Proceedings of JURIX 2018 (2018)
Badji, I.: Legal entity extraction with NER systems, June 2018. http://oa.upm.es/51740/
de Castro, P.V.Q.: Aprendizagem profunda para reconhecimento de entidades nomeadas em domínio jurídico. Master’s thesis, Programa de Pós-graduação em Ciência da Computação (INF) (2019). http://repositorio.bc.ufg.br/tede/handle/tede/10276. Instituto de Informática - INF (RG)
Quinta de Castro, P.V., Félix Felipe da Silva, N., da Silva Soares, A.: Portuguese named entity recognition using LSTM-CRF. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 83–92. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_9
de Castro, P.V.Q., da Silva, N.F.F., da Silva Soares, A.: Contextual representations and semi-supervised named entity recognition for Portuguese language. In: Proceedings of IberLEF@SEPLN 2019 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). http://arxiv.org/abs/1810.04805
do Amaral, D.O.F., Vieira, R.: NERP-CRF: uma ferramenta para o reconhecimento de entidades nomeadas por meio de conditional random fields. Linguamática 6, 41–49 (2014)
Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., Wudali, R.: Named entity recognition and resolution in legal text. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 27–43. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_2
Freitas, C., Mota, C., Santos, D., Oliveira, H.G., Carvalho, P.: Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In: Proceedings of LREC 2010 (2010)
Hakala, K., Pyysalo, S.: Biomedical named entity recognition with multilingual BERT. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Hong Kong, China, November 2019, pp. 56–61. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-5709. https://www.aclweb.org/anthology/D19-5709
Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR, abs/1801.06146 (2018). http://arxiv.org/abs/1801.06146
Lample, G., Conneau, A.: Cross-lingual language model pretraining. CoRR, abs/1901.07291 (2019). http://arxiv.org/abs/1901.07291
Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: Villavicencio, A., et al. (eds.) PROPOR 2018. LNCS (LNAI), vol. 11122, pp. 313–323. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99722-3_32
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Peters, M.E., et al.: Deep contextualized word representations. CoRR, abs/1802.05365 (2018). http://arxiv.org/abs/1802.05365
Pirovani, J., Oliveira, E.: Portuguese named entity recognition using conditional random fields and local grammars. In: Proceedings of LREC 2018, May 2018
Polignano, M., Basile, P., de Gemmis, M., Semeraro, G., Basile, V.: AlBERTo - Italian BERT language understanding model for NLP challenging tasks based on tweets. In: CLiC-it (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Rother, K., Rettberg, A: ULMFiT at GermEval-2018: a deep neural language model for the classification of hate speech in German tweets. In: Proceedings of the GermEval 2018 Workshop, September 2018
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. arXiv:1909.10649 (2020)
Vaswani, A., et al.: Attention is all you need. CoRR, abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA) (2018). https://www.aclweb.org/anthology/L18-1686
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bonifacio, L.H., Vilela, P.A., Lobato, G.R., Fernandes, E.R. (2020). A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-61377-8_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)