Abstract
The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, IR and IE. In this paper we propose two new association measures, the Symmetric Conditional Probability (SCP) and the Mutual Expectation (ME) for the extraction of contiguous and non-contiguous MWUs. Both measures are used by a new algorithm, the LocalMaxs, that requires neither empirically obtained thresholds nor complex linguistic filters. We assess the results obtained by both measures by comparing them with reference association measures (Specific Mutual Information, ø 2, Dice and Log-Likelihood coefficients) over a multilingual parallel corpus. An additional experiment has been carried out over a part-of-speech tagged Portuguese corpus for extracting contiguous compound verbs.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abeille, A.: Les nouvelles syntaxes: Grammaires d’unification et Analyse du Français, Armand Colin, Paris (1993)
Bahl, L., & Brown, P., Sousa, P., Mercer, R.: Maximum Mutual Information of Hidden Markov Model Parameters for Speech Recognition. In Proceedings, International Conference on Acoustics, Speech, and Signal Processing Society, Institute of Electronics and Communication Engineers of Japan, and Acoustical Society of Japan (1986)
Blank, I.: Computer-Aided Analysis of Multilingual Patent Documentation, First LREC, (1998) 765–771
Barkema, H.: Determining the Syntactic Flexibility of Idioms, in Fries U., Tottie G., Shneider P. (eds.): Creating and Using English Language Corpora, Rodopi, Amsterdam, (1994), 39–52
Barkema, H.: Idiomaticy in English Nps, in Aarts J., de Haan P., Oostdijk N. (eds.): English Language Corpora: Design, Analysis and Exploitation, Rodopi, Amsterdam, (1993), 257–278
Bourigault, D., Jacquemin, C.: Term Extraction and Term Clustering: an Integrated Platform for Computer Aided Terminology. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, p. 15–22, Bergen, Norway June (1999)
Bourigault, D.: Lexter, a Natural Language Processing Tool for Terminology Extraction, 7th EURALEX International Congress, (1996)
Chengxiang, Z.: Exploiting Context to Identify Lexical Atoms: a Statistical View of Linguistic Context, cmp-lg/9701001, 2 Jan 1997, (1997)
Church, K. et al.: Word Association Norms Mutual Information and Lexicography, Computational Linguistics, Vol. 16(1). (1990) 23–29
Church, K., Gale, W., Hanks, P., Hindle, D.: Using Statistical Linguistics in Lexical Analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon, edited by Uri Zernik. Lawrence Erlbaum, Hilldale, New Jersey (1991) 115–165
Dagan, I.: Termight: Identifying and Translating Technical Terminology, 4th Conference on Applied Natural Language Processing, ACL Proceedings (1994)
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. The Balancing Act Combining Symbolic and Statistical Approaches to Language, MIT Press (1995)
Dias, G., Gilloré, S., Lopes, G.: Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text corpora. In Proceedings of the TALN’99 (1999).
Dias, G., Gilloré, S., Lopes, G.: Multilingual Aspects of Multiword Lexical Units. In Proceedings of the Workshop Language Technologies-Multilingual Aspects, Faculty of Arts, 8–11 July (1999), Ljubljana, Slovenia
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence, Association for Computational Linguistics, Vol. 19-1. (1993)
Enguehard, C.: Acquisition de Terminologie à partir de Gros Corpus, Informatique & Langue Naturelle, ILN’93 (1993) 373–384
Gale, W.: Concordances for Parallel Texts, Proceedings of Seventh Annual Conference of the UW Centre for the New OED and Text Research, Using Corpora, Oxford (1991)
Habert, B. et al.: Les linguistiques du Corpus, Armand Colin, Paris (1997)
Herviou-Picard et al.: Informatiques, Statistiques et Langue Naturelle pour Automatiser la Constitution de Terminologies, In Proc. ILN’96 (1996)
Jacquemin, C., Royauté, J.: Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework, in: SIGIR’94, Dublin, (1994) 132–141
Justeson, J.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, IBM Research Report, RC 18906 (82591) 5/18/93 (1993)
Marques, N.: Metodologia para a Modelação Estatística da Subcategorização Verbal. Ph.D. Thesis. Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia, Lisbon, Portugal, Previewed Presentation (1999) (In Portuguese)
Shimohata, S.: Retrieving Collocations by Co-occurrences and Word Order Constraints, Proceedings ACL-EACL’97 (1997) 476–481
Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, July 23–25 (1999)
Silva, J., Lopes, G.: Extracting Multiword Terms from Document Collections. In Proceedings of the VExTAL, Venezia per il Trattamento Automatico delle Lingu, Universiá Cá Foscari, Venezia November 22–24 (1999)
Silva, J., Lopes, G., Xavier, M., Vicente, G.: Relevant Expressions in Large Corpora. In Proceedings of the Atelier-TALN99, Corse, july 12–17 (1999)
Smadja, F. et al.: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Association for Computational Linguistics, Vol. 22(1) (1996)
Smadja, F.: From N-grams to Collocations: An Evaluation of Extract. In Proceedings, 29th Annual Meeting of the ACL (1991). Berkeley, Calif., 279–284
Smadja, F.: Retrieving Collocations From Text: XTRACT, Computational Linguistics, Vol. 19(1). (1993) 143–177
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G. (1999). Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds) Progress in Artificial Intelligence. EPIA 1999. Lecture Notes in Computer Science(), vol 1695. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48159-1_9
Download citation
DOI: https://doi.org/10.1007/3-540-48159-1_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66548-9
Online ISBN: 978-3-540-48159-1
eBook Packages: Springer Book Archive