Abstract
In this article, we are studying the differences between the European languages using statistical and unsupervised methods. The analysis is conducted in different levels of language, lexical, morphological and syntactic. Our premise is that the difficulty of the translation could be perceived as differences or similarities in different levels of language. The results are compared to linguistic groupings. The analyses of this paper are based on the concept of Kolmogorov complexity, which is used to compare the language structure in syntactic and morphological levels. The way the languages convey information in these levels is taken as a measure of similarity or dissimilarity between languages and the results are compared to classical linguistic classification. The results will serve as a tool in developing machine translation system(s), e.g., in the following way: if source language conveys more information in the morphological level and the target language more in the syntactic level, it is clear that the (machine) translator must be able to transfer the information from one level to another.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gordon Jr., R.G. (ed.): Ethnologue: Languages of the World, 15th edn. SIL International, Dallas (2005), http://www.ethnologue.com/
Haarman, H.: Kleines Lexikon der Sprachen. Von Albanisch bis Zulu. Verlag C.H. Beck, München, 2, überarbeitete Auflage (2002)
Tiedemann, J., Nygaard, L.: The OPUS Corpus - Parallel & Free. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, May 26-28 (2004) (accessed January 30, 2006), http://www.let.rug.nl/~tiedeman/blog/paper/opus_lrec04.pdf
Juola, P.: Measuring Linguistic Complexity: the Morphological Tier. Journal of Quantitative Linguistics 5, 206–213 (1998)
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applicatrions. Springer, Heidelberg (1994)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The Similarity Metric. IEEE Transactions on Information Theory 50, 3250–3264 (2004)
Bennet, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.H.: Information Distance. IEEE Transactions on Information Theory 44, 1407–1423 (1998)
Juola, P.: Compression-Based Analysis of Language Complexity. Approaches to Complexity in Language, abstracts (2005) (accessed January 15, 2006), http://www.ling.helsinki.fi/sky/tapahtumat/complexity/Abstracts.pdf
Bakker, D.: Flexibility and Consistency in Word Order Patterns in the Languages of Europe. In: Siewierska, A. (ed.) Constituent Order in the Languages of Europe. Empirical Approaches to Language Typology, pp. 381–419. Mouton de Gruyter, Berlin (1998)
Cilibrasi, R., Vitányi, P.M.B.: Clustering by Compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)
Creutz, M., Lagus, K.: Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Espoo: Publications in Computer and Information Science, Helsinki University of Technology, Report A81 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kettunen, K., Sadeniemi, M., Lindh-Knuutila, T., Honkela, T. (2006). Analysis of EU Languages Through Text Compression. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_12
Download citation
DOI: https://doi.org/10.1007/11816508_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)