Multilingualism and Language Diversity are the major keywords for my own professional interest as a linguist. In addition to documenting and describing endangered languages with contemporary e-science methods, I am specialized in the sociology of language and minority language policy, as well as language contacts and linguistic typology.
This article introduces a novel and creative application of the Constraint Grammar formalism, by ... more This article introduces a novel and creative application of the Constraint Grammar formalism, by presenting an automated method for pseudonymising a Zyrian Komi spoken language corpus in an effective, reliable and scalable manner. The method is intended to be used to minimize various kinds of personal information found in the corpus in order to make spoken language data available while preventing the spread of sensitive personal data about the recorded informants or other persons mentioned in the texts. In our implementation, a Constraint Grammar based pseudonymisation tool is used as an automatically applied shallow layer that derives from the original corpus data a version which can be shared for open research use.
Objectives:
Distinguishing between language mixing and language fusion is a non-trivial task, pa... more Objectives:
Distinguishing between language mixing and language fusion is a non-trivial task, particularly in situations of long-standing bilingualism. The main goal of this paper is thus to propose and test a methodology for discerning language fusion from conventionalized mixing. In addition, we examine the hypothesis that the fusion of unbound elements evolves from alternational mixing.
Design:
The paper addresses the goals through a distributional analysis of a vernacular variety of Kildin Saami, a seriously endangered East Saamic (Uralic) language spoken on the Kola Peninsula in Northwest Russia, as a partially fused lect due to contact with Russian.
Data and Analysis:
A one-hour recording of an informal group conversation with three native speakers, comprising some 10,000 word tokens, was transcribed and annotated for Russian-origin items. For comparison, other available speech samples, documenting the earlier stages of the language development, as well as the few existing grammatical descriptions and dictionaries were referred to.
Findings:
The paper develops and showcases three diagnostic criteria indicative of language fusion: (a) regularization of the donor language items’ usage patterns in the mixed variety; (b) functional reduction, or functional extension, of the donor language element, and/or of its inherited native equivalent; (c) the introduction of new constructions involving the donor language grammatical elements by way of loan translation. Finally, we report multiple parallels existing between the distribution of Russian-origin items in vernacular Kildin Saami and alternational mixing.
Originality:
This paper is the first to propose and systematically test diagnostic criteria indicative of language fusion in a situation of long-term bilingualism.
Significance:
The proposed criteria may reliably be employed as indicators of fusion in future studies of contact varieties with little, or undocumented, linguistic histories. Furthermore, in contrast to the mainstream assumption, this study also provides evidence for the claim that alternational mixing can be a starting point for the emergence of a fused lect.
Cultural and linguistic minorities in the Russian Federation and the European Union, ed. by Heiko F. Marten, Michael Rießler, Janne Saarikivi, Reetta Toivanen
The article describes the evolution of literary languages for four endangered indigenous language... more The article describes the evolution of literary languages for four endangered indigenous languages. Different paths of language standardization and revitalization in the Soviet Russian minority context are illustrated with case studies from Dolgan (Turkic), Forest Enets (Uralic), and Kildin Sámi (Uralic). The three cases offer an excellent comparative view of the origin and progress of literacy creation for small indigenous languages in the Russian Federation. The fourth language Skolt Sámi (Uralic) provides a comparative view beyond the border into the European Union. The different geographical and political settings of language planning attempts for the four languages has resulted in chronologically and substantially different developments. For Dolgan, Forest Enets and Kildin Sámi, the effect standardization has upon language survival has been very similar. In these languages, neither standardization nor the evolving written culture seem to inhibit language shift to any considerable degree. On the other hand, Skolt Sámi in Finland has undergone a slightly more successful process of revitalization, even though the language remains critically endangered.
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered and under-described Uralic speech communities. Applying NLP methods in language documentation – specifically rule-based morphological and syntactic analyzers – helps us to create more systematically annotated corpora, rather than eclectic data collections. We propose a step-by-step approach to reach higher-level annotations by using and improving truly computational methods. This is unlike the mainstream, which prefers manual or semi-manual work. Ultimately, the spoken corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Specifically, we describe a script providing interactivity between different morphosyntactic analysis modules implemented as Finite State Transducers and ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora. Ultimately, the spoken corpora created in our projects will be useful for scientifically significant quantitative investigations on these languages in the future. * The order of the authors' names is alphabetical.
The paper describes work-in-progress by the Izhva Komi language documentation project, which rec... more The paper describes work-in-progress by the Izhva Komi language documentation project, which records new spoken language data, digitizes available recordings and annotate these multimedia data in order to provide a comprehensive language corpus as a databases for future research on and this endangered – and under-described – Uralic speech community. While working with a spoken variety and in the framework of documentary linguistics, we apply language technology methods and tools, which have been applied so far only to normalized written languages. Specifically, we describe a script providing interactivity between ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora, and different morphosyntactic analysis modules implemented as Finite State Transducers and Constraint Grammar for rule-based morphosyntactic tagging and disambiguation. Our aim is to challenge current manual approaches in the annotation of language documentation corpora.
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out collaboratively in Uppsala, Tromsø, Syktyvkar and Freiburg. Our projects record and annotate spoken language data in order to provide comprehensive speech corpora as databases for future research on and for these endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Ultimately, the multimodal corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
Semantic functions of complementizers in European languages, ed. by Kasper Boye and Petar Kehayov, 2016
The present chapter focuses on complementizer constructions with complement clauses and morphosyn... more The present chapter focuses on complementizer constructions with complement clauses and morphosyntax devices identifying these constructions in three Saamic languages. The focus is on canonical complementizers, defined here as a subclass of complementizers marking finite complements, and we look specifically at the equivalents of English that and if. Complementizers marking non-finite complements as well as non-overt complement marking remain outside of the scope of the present paper and will not be described systematically.
The languages under investigation are Kildin Saami, Skolt Saami and North Saami. Our data supports an analysis of complementizers in Kildin, Skolt and North Saami in terms of epistemic contrast. In Skolt and North Saami, we have on the one hand an epistemically neutral marker which does not pose any semantic restrictions upon the truth value of the proposition in the clause it introduces. On the other hand, we have an emerging complementizer used in cases where the truth of the proposition in its clause is uncertain. In Kildin Saami, the complementizers are better described as indicating certainty vs. uncertainty.
Medien für Minderheitensprachen: Mediensprachliche Überlegungen zur Entwicklung von Minderheitensprachen, ed. by Mará Alba Niño and Rolf Kailuweit, 2015
Cultural and linguistic minorities in the Russian Federation and the European Union, 2015
The chapter describes the evolution of literary languages for four endangered indigenous language... more The chapter describes the evolution of literary languages for four endangered indigenous languages. Different paths of language standardization and revitalization in the Soviet Russian minority context are illustrated with case studies from Dolgan (Turkic), Forest Enets (Uralic), and Kildin Sáami (Uralic). The three cases offer an excellent comparative view of the origin and progress of literacy creation for small indigenous languages in the Russian Federation. The fourth language Skolt Sámi (Uralic) provides a comparative view beyond the border into the European Union. The different geographical and political settings of language planning attempts for the four languages has resulted in chronologically and substantially different developments. For Dolgan, Forest Enets and Kildin Sámi, the effect standardization has upon language survival has been very similar. In these languages, neither standardization nor the evolving written culture seem to inhibit language shift to any considerable degree. On the other hand, Skolt Sámi in Finland has undergone a slightly more successful process of revitalization, even if the language remains critically endangered.
Purism has been described as a barrier to endangered language revitalization, but paradoxically p... more Purism has been described as a barrier to endangered language revitalization, but paradoxically purists are often also key revitalizers. Language choices and inner-group language conflicts are sociopsychological phenomena and thus outside the core of descriptive linguistic research. The evolution of linguis- tic structure is nevertheless clearly related to the social function of language. The present chapter discusses purism at the interface between language soci- ology and sociolinguistics, specifically from the perspective of documentary linguistics.
Based on data about the use of exclusive focus particles in the seriously endangered Kildin Saami language, we show how puristic attitudes can affect the actual language performance of recorded speakers and potentially give rise to language variation and change. We also discuss the question of how this kind of synchronic variation can be accounted for in the documentation and description of an endangered language.
This article introduces a novel and creative application of the Constraint Grammar formalism, by ... more This article introduces a novel and creative application of the Constraint Grammar formalism, by presenting an automated method for pseudonymising a Zyrian Komi spoken language corpus in an effective, reliable and scalable manner. The method is intended to be used to minimize various kinds of personal information found in the corpus in order to make spoken language data available while preventing the spread of sensitive personal data about the recorded informants or other persons mentioned in the texts. In our implementation, a Constraint Grammar based pseudonymisation tool is used as an automatically applied shallow layer that derives from the original corpus data a version which can be shared for open research use.
Objectives:
Distinguishing between language mixing and language fusion is a non-trivial task, pa... more Objectives:
Distinguishing between language mixing and language fusion is a non-trivial task, particularly in situations of long-standing bilingualism. The main goal of this paper is thus to propose and test a methodology for discerning language fusion from conventionalized mixing. In addition, we examine the hypothesis that the fusion of unbound elements evolves from alternational mixing.
Design:
The paper addresses the goals through a distributional analysis of a vernacular variety of Kildin Saami, a seriously endangered East Saamic (Uralic) language spoken on the Kola Peninsula in Northwest Russia, as a partially fused lect due to contact with Russian.
Data and Analysis:
A one-hour recording of an informal group conversation with three native speakers, comprising some 10,000 word tokens, was transcribed and annotated for Russian-origin items. For comparison, other available speech samples, documenting the earlier stages of the language development, as well as the few existing grammatical descriptions and dictionaries were referred to.
Findings:
The paper develops and showcases three diagnostic criteria indicative of language fusion: (a) regularization of the donor language items’ usage patterns in the mixed variety; (b) functional reduction, or functional extension, of the donor language element, and/or of its inherited native equivalent; (c) the introduction of new constructions involving the donor language grammatical elements by way of loan translation. Finally, we report multiple parallels existing between the distribution of Russian-origin items in vernacular Kildin Saami and alternational mixing.
Originality:
This paper is the first to propose and systematically test diagnostic criteria indicative of language fusion in a situation of long-term bilingualism.
Significance:
The proposed criteria may reliably be employed as indicators of fusion in future studies of contact varieties with little, or undocumented, linguistic histories. Furthermore, in contrast to the mainstream assumption, this study also provides evidence for the claim that alternational mixing can be a starting point for the emergence of a fused lect.
Cultural and linguistic minorities in the Russian Federation and the European Union, ed. by Heiko F. Marten, Michael Rießler, Janne Saarikivi, Reetta Toivanen
The article describes the evolution of literary languages for four endangered indigenous language... more The article describes the evolution of literary languages for four endangered indigenous languages. Different paths of language standardization and revitalization in the Soviet Russian minority context are illustrated with case studies from Dolgan (Turkic), Forest Enets (Uralic), and Kildin Sámi (Uralic). The three cases offer an excellent comparative view of the origin and progress of literacy creation for small indigenous languages in the Russian Federation. The fourth language Skolt Sámi (Uralic) provides a comparative view beyond the border into the European Union. The different geographical and political settings of language planning attempts for the four languages has resulted in chronologically and substantially different developments. For Dolgan, Forest Enets and Kildin Sámi, the effect standardization has upon language survival has been very similar. In these languages, neither standardization nor the evolving written culture seem to inhibit language shift to any considerable degree. On the other hand, Skolt Sámi in Finland has undergone a slightly more successful process of revitalization, even though the language remains critically endangered.
Proceedings of the Third Workshop on Computational Linguistics for Uralic Languages
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered and under-described Uralic speech communities. Applying NLP methods in language documentation – specifically rule-based morphological and syntactic analyzers – helps us to create more systematically annotated corpora, rather than eclectic data collections. We propose a step-by-step approach to reach higher-level annotations by using and improving truly computational methods. This is unlike the mainstream, which prefers manual or semi-manual work. Ultimately, the spoken corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Specifically, we describe a script providing interactivity between different morphosyntactic analysis modules implemented as Finite State Transducers and ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora. Ultimately, the spoken corpora created in our projects will be useful for scientifically significant quantitative investigations on these languages in the future. * The order of the authors' names is alphabetical.
The paper describes work-in-progress by the Izhva Komi language documentation project, which rec... more The paper describes work-in-progress by the Izhva Komi language documentation project, which records new spoken language data, digitizes available recordings and annotate these multimedia data in order to provide a comprehensive language corpus as a databases for future research on and this endangered – and under-described – Uralic speech community. While working with a spoken variety and in the framework of documentary linguistics, we apply language technology methods and tools, which have been applied so far only to normalized written languages. Specifically, we describe a script providing interactivity between ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora, and different morphosyntactic analysis modules implemented as Finite State Transducers and Constraint Grammar for rule-based morphosyntactic tagging and disambiguation. Our aim is to challenge current manual approaches in the annotation of language documentation corpora.
The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language docume... more The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out collaboratively in Uppsala, Tromsø, Syktyvkar and Freiburg. Our projects record and annotate spoken language data in order to provide comprehensive speech corpora as databases for future research on and for these endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Ultimately, the multimodal corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
Semantic functions of complementizers in European languages, ed. by Kasper Boye and Petar Kehayov, 2016
The present chapter focuses on complementizer constructions with complement clauses and morphosyn... more The present chapter focuses on complementizer constructions with complement clauses and morphosyntax devices identifying these constructions in three Saamic languages. The focus is on canonical complementizers, defined here as a subclass of complementizers marking finite complements, and we look specifically at the equivalents of English that and if. Complementizers marking non-finite complements as well as non-overt complement marking remain outside of the scope of the present paper and will not be described systematically.
The languages under investigation are Kildin Saami, Skolt Saami and North Saami. Our data supports an analysis of complementizers in Kildin, Skolt and North Saami in terms of epistemic contrast. In Skolt and North Saami, we have on the one hand an epistemically neutral marker which does not pose any semantic restrictions upon the truth value of the proposition in the clause it introduces. On the other hand, we have an emerging complementizer used in cases where the truth of the proposition in its clause is uncertain. In Kildin Saami, the complementizers are better described as indicating certainty vs. uncertainty.
Medien für Minderheitensprachen: Mediensprachliche Überlegungen zur Entwicklung von Minderheitensprachen, ed. by Mará Alba Niño and Rolf Kailuweit, 2015
Cultural and linguistic minorities in the Russian Federation and the European Union, 2015
The chapter describes the evolution of literary languages for four endangered indigenous language... more The chapter describes the evolution of literary languages for four endangered indigenous languages. Different paths of language standardization and revitalization in the Soviet Russian minority context are illustrated with case studies from Dolgan (Turkic), Forest Enets (Uralic), and Kildin Sáami (Uralic). The three cases offer an excellent comparative view of the origin and progress of literacy creation for small indigenous languages in the Russian Federation. The fourth language Skolt Sámi (Uralic) provides a comparative view beyond the border into the European Union. The different geographical and political settings of language planning attempts for the four languages has resulted in chronologically and substantially different developments. For Dolgan, Forest Enets and Kildin Sámi, the effect standardization has upon language survival has been very similar. In these languages, neither standardization nor the evolving written culture seem to inhibit language shift to any considerable degree. On the other hand, Skolt Sámi in Finland has undergone a slightly more successful process of revitalization, even if the language remains critically endangered.
Purism has been described as a barrier to endangered language revitalization, but paradoxically p... more Purism has been described as a barrier to endangered language revitalization, but paradoxically purists are often also key revitalizers. Language choices and inner-group language conflicts are sociopsychological phenomena and thus outside the core of descriptive linguistic research. The evolution of linguis- tic structure is nevertheless clearly related to the social function of language. The present chapter discusses purism at the interface between language soci- ology and sociolinguistics, specifically from the perspective of documentary linguistics.
Based on data about the use of exclusive focus particles in the seriously endangered Kildin Saami language, we show how puristic attitudes can affect the actual language performance of recorded speakers and potentially give rise to language variation and change. We also discuss the question of how this kind of synchronic variation can be accounted for in the documentation and description of an endangered language.
In: Haspelmath, Martin & Tadmor, Uri (eds.) World Loanword Database, 2009
The vocabulary contains 1467 meaning-word pairs ("entries") corresponding to core LWT meanings fr... more The vocabulary contains 1467 meaning-word pairs ("entries") corresponding to core LWT meanings from the recipient language Kildin Saami. The corresponding text chapter was published in the book Loanwords in the World's Languages. The language page Kildin Saami contains a list of all loanwords arranged by donor languoid.
Brief essay with a commentary on Elisabeth Scheller’s recent observations regarding the (non-)rep... more Brief essay with a commentary on Elisabeth Scheller’s recent observations regarding the (non-)representation of partitive grammar in the limited teaching materials available for Standard Written Kildin Saami.
Christian texts have been known to be printed in Kola Saami languages since 1828; the most extens... more Christian texts have been known to be printed in Kola Saami languages since 1828; the most extensive publication is Gospel of Matthew, which has been translated three times, most recently in 2022. The Lord's Prayer was translated in several more versions in Kildin Saami and Skolt Saami. All of these texts seem to go back to translations from Russian. These characteristics make Kola Saami Christian publications just right for parallel text alignment. This paper describes my ongoing work with building a Kola Saami Christian Text Corpus, including conceptional and technical decisions as well as preliminary linguistic observations based on these data. Thus, I describe a resource for computational linguistics, rather than a computational study. However, computational studies based on these data will hopefully take place in the near future, after the Kildin Saami subset of this corpus is finished and published by the end of 2024. In addition to computation, this resource will also allow for comparative linguistic studies on diachronic and synchronic variation and change in the Kola Saami languages.
This chapter deals with digital visual methods in field research on indigenous Arctic societies. ... more This chapter deals with digital visual methods in field research on indigenous Arctic societies. It discusses specifically how digital humanities can contribute to the work done by linguists and social scientists in ways to make their research material have a more lasting legacy.
This is the first thorough English language description of Kildin Saami and the first attempt to ... more This is the first thorough English language description of Kildin Saami and the first attempt to systematically describe the basic phonological, morphological and syntactic features of this language from the perspective of general comparative linguistics. Kildin Saami is a critically endangered language of the Kola Peninsula in Northwest Russia with only about 100 active speakers. The original dialect areas have fragmented during the 20th century. Within the Saami group and Uralic in general, a remarkable feature of Kildin Saami is the high number of consonant phonemes. This is mostly due to the existence of palatalization as a distinct phonological feature. Kildin Saami has developed a very high degree of fusion in inflectional morphology, including the occurrence of several different kinds of nonlinear morphological marking. In verb inflection, there is a special form for impersonal passive. Adjectives are marked for attributive and predicative state and the language has split marking of plural, where numerals above six govern partitive case. Russian influence is found in essentially all components of Kildin Saami language structure, but is especially strong in discourse pragmatics. The chapter includes a glossed text example and extensive references to earlier linguistic literature and other sources on Kildin Saami.
This is a brief linguistic introduction to Kildin Saami, a seriously endangered indigenous Uralic... more This is a brief linguistic introduction to Kildin Saami, a seriously endangered indigenous Uralic language of Russia. The list of references is not exhaustive, but includes the most important works on the topic for further reading.
his paper is a result of collaborative work between language documentation, language technology a... more his paper is a result of collaborative work between language documentation, language technology and corpus-based sociolinguistics and includes both applied and theoretical research aspects. We aim at contributing to diversity linguistics by improving current methodology for the building of comprehensive databases for future research on endangered and little-studied speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Ultimately, the multimodal corpora we create will be useful for quantitative investigations into synchronic variation and change in Kildin Saami.
The quantitative evaluation of data from our corpus provides significant evidence for proving preliminary claims that the use of borrowed vs. native function words is not dialectal or the result of individual speaker preferences, but determined by the choice of text modi or registers: whereas the native formatives are used consistently in formal written texts, borrowed formatives occur most typically in informal spoken texts. Furthermore, the degree of speakers’ language loss is also significant: fully competent speakers use borrowed function words much less regularly, which is significant even in informal speech.
This is the first comprehensive volume to compare the sociolinguistic situations of minorities in... more This is the first comprehensive volume to compare the sociolinguistic situations of minorities in Russia and in Western Europe. As such, it provides insight into language policies, the ethnolinguistic vitality and the struggle for reversal of language shift, language revitalization and empowerment of minorities in Russia and the European Union. The volume shows that, even though largely unknown to a broader English-reading audience, the linguistic composition of Russia is by no means less diverse than multilingualism in the EU. It is therefore a valuable introduction into the historical backgrounds and current linguistic, social and legal affairs with regard to Russia’s manifold ethnic and linguistic minorities, mirrored on the discussion of recent issues in a number of well-known Western European minority situations.
Content Level » Research
Keywords » Aboriginal culture in Northern Russia - Basque language - Finnic minorities of Ingria - Frisian - Global biodiversity in the early 21st century - Global extinction of languages - Languages in the Russian Federation - Latgalian - Linguistic Rights of National Groups - Minority languages and cultures - Scottish Gaelic - Sociolinguistic ethnolinguistic variation - Sorbian languages in Germany - Sámi languages in Finland - languages in Mari El - languages in Udmurtia - linguistic and cultural diversity - minority language speakers - revitalization of endangered languages - saami languages
The existence of morphological partitive in the case inventory of the easternmost Saami varieties... more The existence of morphological partitive in the case inventory of the easternmost Saami varieties is a well-known phenomenon. But typically, descriptions focus on partitive use governed by higher numerals and other quantifiers and existing descriptions are incomplete regarding the morphology of this case and do not present complete inflectional paradigms for all types of nominals. They are missing particularly in the paradigms of various pronominal forms. The syntactic environments where partitive is governed have also not been described systematically yet. According to our data partitive is governed a) by quantifiers in noun phrases, b) by comparatives in adjective phrases, and c) by several adpositions. In addition, partitive can occur as an agreement feature inside noun phrases. Whereas the main aim of this paper is linguistically sound descriptive grammaticography, even the overlap between our work and prescriptive language planning will be discussed.
Slides for a presentation at the Workshop on Endangered Languages and their Literatures in Bielef... more Slides for a presentation at the Workshop on Endangered Languages and their Literatures in Bielefeld, 11 July 2016
This paper is a result of collaborative work between language documentation, language technology ... more This paper is a result of collaborative work between language documentation, language technology and corpus-based sociolinguistics and includes both applied and theoretical research aspects. We aim at contributing to diversity linguistics by improving current methodology for the building of comprehensive databases for future research on endangered and little-studied speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Ultimately, the multimodal corpora we create will be useful for quantitative investigations into synchronic variation and change in Kildin Saami.
Kildin Saami (Glottolog: kild1236) is a seriously endangered and under-described language spoken ac- tively by no more than a few hundred speakers on the Kola Peninsula in North-West Russia. Although the number of speakers is decreasing rapidly, recent language planning and revitalization attempts have opened new domains of language use, for instance in print, radio and social media (Rießler 2014). Changes induced by this language shift to Russian have been documented on all levels of linguistic structure in Kildin Saami: phonology, morphology, syntax, discourse-pragmatics and lexicon (Rießler 2007; Rießler 2009; Blokland and Rießler 2011) and variation and change in function words specifically are dealt with in Karvovskaya (2011), Zhivotova (2014a), Zhivotova (2014b), and Kotcheva and Rießler (2015). Rießler and Karvovskaya (2013) have argued (based on the use of focus particles) that not only shift-induced lan- guage attrition is responsible for synchronic variation and change in Kildin Saami, but language planning and the introduction of new (mostly written) domains of language use, typically occupied by language activists with puristic attitudes, can lead to “revitalization-induced” language change, too.
Our investigation into variation and change in function words is based on a comprehensive corpus of spoken or written text modi of formal or informal registers originating from speakers for whom we also have comprehensive speaker biographies. Our transcribed (in Standard Kildin Saami orthography) spoken text data as well as the written text data are stored in a similar XML format, using ELAN.1 The pro- gram allows for transcription, calculation of basic frequency statistics and the creation of concordances in addition to extracting, coding and preparing statistical analysis using R.2 Different to many endangered language documentation projects, which annotate manually or semi-manually, we apply more automated corpus data annotation, specifically a part-of-speech tagger based on finite state transducer technology and programmed in collaboration with the Center for Sámi Language Technology3 (cf. Blokland, Gersten- berger, Fedina, Partanen, Rießler, and Wilbur 2014+).
The quantitative evaluation of data from our corpus provides significant evidence for proving prelim- inary claims by Zhivotova (2014a) and Zhivotova (2014b) that the use of borrowed vs. native function words is not dialectal or the result of individual speaker preferences, but determined by the choice of text modi or registers: whereas the native formatives are used consistently in formal written texts, borrowed formatives occur most typically in informal spoken texts. Furthermore, the degree of speakers’ language loss is also significant: fully competent speakers use borrowed function words much less regularly, which is significant even in informal speech.
We can also show that “purist variants” (Rießler and Karvovskaya 2013) are relevant for sociolinguistic description. In small speech communities undergoing revitalization, such variants might, in fact, be the potential survivors of language variation and change. Therefore, the documentary linguist needs to be sensitive to all kinds of variation in the data in order to make the documentation as complete and useful as possible.
The talk describes work-in-progress by the Kola Saami language documentation project. Our project... more The talk describes work-in-progress by the Kola Saami language documentation project. Our project records and annotates spoken language data in order to provide comprehensive speech corpora as databases for future research on and for this endangered – and under-described – Uralic speech community. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Ultimately, the multimodal corpora created by our project will be useful for scientifically significant quantitative investigations on this languages in the future.
Uploads
Papers by Michael Rießler
Distinguishing between language mixing and language fusion is a non-trivial task, particularly in situations of long-standing bilingualism. The main goal of this paper is thus to propose and test a methodology for discerning language fusion from conventionalized mixing. In addition, we examine the hypothesis that the fusion of unbound elements evolves from alternational mixing.
Design:
The paper addresses the goals through a distributional analysis of a vernacular variety of Kildin Saami, a seriously endangered East Saamic (Uralic) language spoken on the Kola Peninsula in Northwest Russia, as a partially fused lect due to contact with Russian.
Data and Analysis:
A one-hour recording of an informal group conversation with three native speakers, comprising some 10,000 word tokens, was transcribed and annotated for Russian-origin items. For comparison, other available speech samples, documenting the earlier stages of the language development, as well as the few existing grammatical descriptions and dictionaries were referred to.
Findings:
The paper develops and showcases three diagnostic criteria indicative of language fusion: (a) regularization of the donor language items’ usage patterns in the mixed variety; (b) functional reduction, or functional extension, of the donor language element, and/or of its inherited native equivalent; (c) the introduction of new constructions involving the donor language grammatical elements by way of loan translation. Finally, we report multiple parallels existing between the distribution of Russian-origin items in vernacular Kildin Saami and alternational mixing.
Originality:
This paper is the first to propose and systematically test diagnostic criteria indicative of language fusion in a situation of long-term bilingualism.
Significance:
The proposed criteria may reliably be employed as indicators of fusion in future studies of contact varieties with little, or undocumented, linguistic histories. Furthermore, in contrast to the mainstream assumption, this study also provides evidence for the claim that alternational mixing can be a starting point for the emergence of a fused lect.
Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered and under-described Uralic speech communities. Applying NLP methods in language documentation – specifically rule-based morphological and syntactic analyzers – helps us to create more systematically annotated corpora, rather than eclectic data collections.
We propose a step-by-step approach to reach higher-level annotations by using and improving truly computational methods.
This is unlike the mainstream, which prefers manual or semi-manual work.
Ultimately, the spoken corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
While working with a spoken variety and in the framework of documentary linguistics, we apply language technology methods and tools, which have been applied so far only to normalized written languages.
Specifically, we describe a script providing interactivity between ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora, and different morphosyntactic analysis modules implemented as Finite State Transducers and Constraint Grammar for rule-based morphosyntactic tagging and disambiguation.
Our aim is to challenge current manual approaches in the annotation of language documentation corpora.
The languages under investigation are Kildin Saami, Skolt Saami and North Saami. Our data supports an analysis of complementizers in Kildin, Skolt and North Saami in terms of epistemic contrast. In Skolt and North Saami, we have on the one hand an epistemically neutral marker which does not pose any semantic restrictions upon the truth value of the proposition in the clause it introduces. On the other hand, we have an emerging complementizer used in cases where the truth of the proposition in its clause is uncertain. In Kildin Saami, the complementizers are better described as indicating certainty vs. uncertainty.
Based on data about the use of exclusive focus particles in the seriously endangered Kildin Saami language, we show how puristic attitudes can affect the actual language performance of recorded speakers and potentially give rise to language variation and change. We also discuss the question of how this kind of synchronic variation can be accounted for in the documentation and description of an endangered language.
Distinguishing between language mixing and language fusion is a non-trivial task, particularly in situations of long-standing bilingualism. The main goal of this paper is thus to propose and test a methodology for discerning language fusion from conventionalized mixing. In addition, we examine the hypothesis that the fusion of unbound elements evolves from alternational mixing.
Design:
The paper addresses the goals through a distributional analysis of a vernacular variety of Kildin Saami, a seriously endangered East Saamic (Uralic) language spoken on the Kola Peninsula in Northwest Russia, as a partially fused lect due to contact with Russian.
Data and Analysis:
A one-hour recording of an informal group conversation with three native speakers, comprising some 10,000 word tokens, was transcribed and annotated for Russian-origin items. For comparison, other available speech samples, documenting the earlier stages of the language development, as well as the few existing grammatical descriptions and dictionaries were referred to.
Findings:
The paper develops and showcases three diagnostic criteria indicative of language fusion: (a) regularization of the donor language items’ usage patterns in the mixed variety; (b) functional reduction, or functional extension, of the donor language element, and/or of its inherited native equivalent; (c) the introduction of new constructions involving the donor language grammatical elements by way of loan translation. Finally, we report multiple parallels existing between the distribution of Russian-origin items in vernacular Kildin Saami and alternational mixing.
Originality:
This paper is the first to propose and systematically test diagnostic criteria indicative of language fusion in a situation of long-term bilingualism.
Significance:
The proposed criteria may reliably be employed as indicators of fusion in future studies of contact varieties with little, or undocumented, linguistic histories. Furthermore, in contrast to the mainstream assumption, this study also provides evidence for the claim that alternational mixing can be a starting point for the emergence of a fused lect.
Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered and under-described Uralic speech communities. Applying NLP methods in language documentation – specifically rule-based morphological and syntactic analyzers – helps us to create more systematically annotated corpora, rather than eclectic data collections.
We propose a step-by-step approach to reach higher-level annotations by using and improving truly computational methods.
This is unlike the mainstream, which prefers manual or semi-manual work.
Ultimately, the spoken corpora created by our projects will be useful for scientifically significant quantitative investigations on these languages in the future.
While working with a spoken variety and in the framework of documentary linguistics, we apply language technology methods and tools, which have been applied so far only to normalized written languages.
Specifically, we describe a script providing interactivity between ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora, and different morphosyntactic analysis modules implemented as Finite State Transducers and Constraint Grammar for rule-based morphosyntactic tagging and disambiguation.
Our aim is to challenge current manual approaches in the annotation of language documentation corpora.
The languages under investigation are Kildin Saami, Skolt Saami and North Saami. Our data supports an analysis of complementizers in Kildin, Skolt and North Saami in terms of epistemic contrast. In Skolt and North Saami, we have on the one hand an epistemically neutral marker which does not pose any semantic restrictions upon the truth value of the proposition in the clause it introduces. On the other hand, we have an emerging complementizer used in cases where the truth of the proposition in its clause is uncertain. In Kildin Saami, the complementizers are better described as indicating certainty vs. uncertainty.
Based on data about the use of exclusive focus particles in the seriously endangered Kildin Saami language, we show how puristic attitudes can affect the actual language performance of recorded speakers and potentially give rise to language variation and change. We also discuss the question of how this kind of synchronic variation can be accounted for in the documentation and description of an endangered language.
The quantitative evaluation of data from our corpus provides significant evidence for proving preliminary claims that the use of borrowed vs. native function words is not dialectal or the result of individual speaker preferences, but determined by the choice of text modi or registers: whereas the native formatives are used consistently in formal written texts, borrowed formatives occur most typically in informal spoken texts. Furthermore, the degree of speakers’ language loss is also significant: fully competent speakers use borrowed function words much less regularly, which is significant even in informal speech.
Content Level » Research
Keywords » Aboriginal culture in Northern Russia - Basque language - Finnic minorities of Ingria - Frisian - Global biodiversity in the early 21st century - Global extinction of languages - Languages in the Russian Federation - Latgalian - Linguistic Rights of National Groups - Minority languages and cultures - Scottish Gaelic - Sociolinguistic ethnolinguistic variation - Sorbian languages in Germany - Sámi languages in Finland - languages in Mari El - languages in Udmurtia - linguistic and cultural diversity - minority language speakers - revitalization of endangered languages - saami languages
Kildin Saami (Glottolog: kild1236) is a seriously endangered and under-described language spoken ac- tively by no more than a few hundred speakers on the Kola Peninsula in North-West Russia. Although the number of speakers is decreasing rapidly, recent language planning and revitalization attempts have opened new domains of language use, for instance in print, radio and social media (Rießler 2014). Changes induced by this language shift to Russian have been documented on all levels of linguistic structure in Kildin Saami: phonology, morphology, syntax, discourse-pragmatics and lexicon (Rießler 2007; Rießler 2009; Blokland and Rießler 2011) and variation and change in function words specifically are dealt with in Karvovskaya (2011), Zhivotova (2014a), Zhivotova (2014b), and Kotcheva and Rießler (2015). Rießler and Karvovskaya (2013) have argued (based on the use of focus particles) that not only shift-induced lan- guage attrition is responsible for synchronic variation and change in Kildin Saami, but language planning and the introduction of new (mostly written) domains of language use, typically occupied by language activists with puristic attitudes, can lead to “revitalization-induced” language change, too.
Our investigation into variation and change in function words is based on a comprehensive corpus of spoken or written text modi of formal or informal registers originating from speakers for whom we also have comprehensive speaker biographies. Our transcribed (in Standard Kildin Saami orthography) spoken text data as well as the written text data are stored in a similar XML format, using ELAN.1 The pro- gram allows for transcription, calculation of basic frequency statistics and the creation of concordances in addition to extracting, coding and preparing statistical analysis using R.2 Different to many endangered language documentation projects, which annotate manually or semi-manually, we apply more automated corpus data annotation, specifically a part-of-speech tagger based on finite state transducer technology and programmed in collaboration with the Center for Sámi Language Technology3 (cf. Blokland, Gersten- berger, Fedina, Partanen, Rießler, and Wilbur 2014+).
The quantitative evaluation of data from our corpus provides significant evidence for proving prelim- inary claims by Zhivotova (2014a) and Zhivotova (2014b) that the use of borrowed vs. native function words is not dialectal or the result of individual speaker preferences, but determined by the choice of text modi or registers: whereas the native formatives are used consistently in formal written texts, borrowed formatives occur most typically in informal spoken texts. Furthermore, the degree of speakers’ language loss is also significant: fully competent speakers use borrowed function words much less regularly, which is significant even in informal speech.
We can also show that “purist variants” (Rießler and Karvovskaya 2013) are relevant for sociolinguistic description. In small speech communities undergoing revitalization, such variants might, in fact, be the potential survivors of language variation and change. Therefore, the documentary linguist needs to be sensitive to all kinds of variation in the data in order to make the documentation as complete and useful as possible.