Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

da Silva, Joaquim Ferreira; Dias, Gaël; Guilloré, Sylvie; Pereira Lopes, José Gabriel

doi:10.1007/3-540-48159-1_9

Joaquim Ferreira da Silva³,
Gaël Dias³,
Sylvie Guilloré⁴ &
…
José Gabriel Pereira Lopes³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1695))

Included in the following conference series:

Portuguese Conference on Artificial Intelligence

652 Accesses

Abstract

The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, IR and IE. In this paper we propose two new association measures, the Symmetric Conditional Probability (SCP) and the Mutual Expectation (ME) for the extraction of contiguous and non-contiguous MWUs. Both measures are used by a new algorithm, the LocalMaxs, that requires neither empirically obtained thresholds nor complex linguistic filters. We assess the results obtained by both measures by comparing them with reference association measures (Specific Mutual Information, ø ², Dice and Log-Likelihood coefficients) over a multilingual parallel corpus. An additional experiment has been carried out over a part-of-speech tagged Portuguese corpus for extracting contiguous compound verbs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Semantic Relation Extraction. Resources, Tools and Strategies

Hindi MWE Detection by Learning Phraseology from Corpora

Article 10 August 2024

Syntax Deep Explorer

References

Abeille, A.: Les nouvelles syntaxes: Grammaires d’unification et Analyse du Français, Armand Colin, Paris (1993)
Google Scholar
Bahl, L., & Brown, P., Sousa, P., Mercer, R.: Maximum Mutual Information of Hidden Markov Model Parameters for Speech Recognition. In Proceedings, International Conference on Acoustics, Speech, and Signal Processing Society, Institute of Electronics and Communication Engineers of Japan, and Acoustical Society of Japan (1986)
Google Scholar
Blank, I.: Computer-Aided Analysis of Multilingual Patent Documentation, First LREC, (1998) 765–771
Google Scholar
Barkema, H.: Determining the Syntactic Flexibility of Idioms, in Fries U., Tottie G., Shneider P. (eds.): Creating and Using English Language Corpora, Rodopi, Amsterdam, (1994), 39–52
Google Scholar
Barkema, H.: Idiomaticy in English Nps, in Aarts J., de Haan P., Oostdijk N. (eds.): English Language Corpora: Design, Analysis and Exploitation, Rodopi, Amsterdam, (1993), 257–278
Google Scholar
Bourigault, D., Jacquemin, C.: Term Extraction and Term Clustering: an Integrated Platform for Computer Aided Terminology. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, p. 15–22, Bergen, Norway June (1999)
Google Scholar
Bourigault, D.: Lexter, a Natural Language Processing Tool for Terminology Extraction, 7^th EURALEX International Congress, (1996)
Google Scholar
Chengxiang, Z.: Exploiting Context to Identify Lexical Atoms: a Statistical View of Linguistic Context, cmp-lg/9701001, 2 Jan 1997, (1997)
Google Scholar
Church, K. et al.: Word Association Norms Mutual Information and Lexicography, Computational Linguistics, Vol. 16(1). (1990) 23–29
Google Scholar
Church, K., Gale, W., Hanks, P., Hindle, D.: Using Statistical Linguistics in Lexical Analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon, edited by Uri Zernik. Lawrence Erlbaum, Hilldale, New Jersey (1991) 115–165
Google Scholar
Dagan, I.: Termight: Identifying and Translating Technical Terminology, 4^th Conference on Applied Natural Language Processing, ACL Proceedings (1994)
Google Scholar
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. The Balancing Act Combining Symbolic and Statistical Approaches to Language, MIT Press (1995)
Google Scholar
Dias, G., Gilloré, S., Lopes, G.: Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text corpora. In Proceedings of the TALN’99 (1999).
Google Scholar
Dias, G., Gilloré, S., Lopes, G.: Multilingual Aspects of Multiword Lexical Units. In Proceedings of the Workshop Language Technologies-Multilingual Aspects, Faculty of Arts, 8–11 July (1999), Ljubljana, Slovenia
Google Scholar
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence, Association for Computational Linguistics, Vol. 19-1. (1993)
Google Scholar
Enguehard, C.: Acquisition de Terminologie à partir de Gros Corpus, Informatique & Langue Naturelle, ILN’93 (1993) 373–384
Google Scholar
Gale, W.: Concordances for Parallel Texts, Proceedings of Seventh Annual Conference of the UW Centre for the New OED and Text Research, Using Corpora, Oxford (1991)
Google Scholar
Habert, B. et al.: Les linguistiques du Corpus, Armand Colin, Paris (1997)
Google Scholar
Herviou-Picard et al.: Informatiques, Statistiques et Langue Naturelle pour Automatiser la Constitution de Terminologies, In Proc. ILN’96 (1996)
Google Scholar
Jacquemin, C., Royauté, J.: Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework, in: SIGIR’94, Dublin, (1994) 132–141
Google Scholar
Justeson, J.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, IBM Research Report, RC 18906 (82591) 5/18/93 (1993)
Google Scholar
Marques, N.: Metodologia para a Modelação Estatística da Subcategorização Verbal. Ph.D. Thesis. Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia, Lisbon, Portugal, Previewed Presentation (1999) (In Portuguese)
Google Scholar
Shimohata, S.: Retrieving Collocations by Co-occurrences and Word Order Constraints, Proceedings ACL-EACL’97 (1997) 476–481
Google Scholar
Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6^th Meeting on the Mathematics of Language, Orlando, July 23–25 (1999)
Google Scholar
Silva, J., Lopes, G.: Extracting Multiword Terms from Document Collections. In Proceedings of the VExTAL, Venezia per il Trattamento Automatico delle Lingu, Universiá Cá Foscari, Venezia November 22–24 (1999)
Google Scholar
Silva, J., Lopes, G., Xavier, M., Vicente, G.: Relevant Expressions in Large Corpora. In Proceedings of the Atelier-TALN99, Corse, july 12–17 (1999)
Google Scholar
Smadja, F. et al.: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Association for Computational Linguistics, Vol. 22(1) (1996)
Google Scholar
Smadja, F.: From N-grams to Collocations: An Evaluation of Extract. In Proceedings, 29^th Annual Meeting of the ACL (1991). Berkeley, Calif., 279–284
Google Scholar
Smadja, F.: Retrieving Collocations From Text: XTRACT, Computational Linguistics, Vol. 19(1). (1993) 143–177
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Informática, Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia, Quinta da Torre, 2825-114, Monte da Caparica, Portugal
Joaquim Ferreira da Silva, Gaël Dias & José Gabriel Pereira Lopes
Laboratoire d’Informatique Fondamentale d’Orléans, Université d’Orléans, BP 6102, 45061, Orléans Cédex 2, France
Sylvie Guilloré

Authors

Joaquim Ferreira da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Gaël Dias
View author publications
You can also search for this author in PubMed Google Scholar
Sylvie Guilloré
View author publications
You can also search for this author in PubMed Google Scholar
José Gabriel Pereira Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Informática, Universidade Nova de Lisboa, Quinta da Torre, P-2825-114, Caparica, Portugal
Pedro Barahona
Departamento de Matemática, Universidade de Évora, R. Romão Ramalho 59, P-7000, Évora, Portugal
José J. Alferes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G. (1999). Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds) Progress in Artificial Intelligence. EPIA 1999. Lecture Notes in Computer Science(), vol 1695. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48159-1_9

Download citation

DOI: https://doi.org/10.1007/3-540-48159-1_9
Published: 11 February 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66548-9
Online ISBN: 978-3-540-48159-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics