Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Rule-Based Machine Translation for the Italian–Sardinian Language Pair

The Praga Bulletin of Mathematical Linguistic, 2017
This paper describes the process of creation of the first machine translation system from Italian to Sardinian, a Romance language spoken on the island of Sardinia in the Mediterranean. The project was carried out by a team of translators and computational linguists. The article focuses on the technology used (Rule-Based Machine Translation) and on some of the rules created, as well as on the orthographic model used for Sardinian....Read more
The Prague Bulletin of Mathematical Linguistics NUMBER 108 JUNE 2017 221–232 Rule-Based Machine Translation for the Italian–Sardinian Language Pair Francis M. Tyers, ab Hèctor Alòs i Font, a Gianfranco Fronteddu, e Adrià Martín-Mor d a UiT Norgga árktalaš universitehta, Tromsø, Norway b Arvutiteaduse instituut, Tartu Ülikool, Tartu, Estonia c Universitat de Barcelona, Barcelona d Universitat Autònoma de Barcelona, Barcelona e Università degli Studi di Cagliari, Cagliari Abstract This paper describes the process of creation of the irst machine translation system from Italian to Sardinian, a Romance language spoken on the island of Sardinia in the Mediterranean. The project was carried out by a team of translators and computational linguists. The article focuses on the technology used (Rule-Based Machine Translation) and on some of the rules created, as well as on the orthographic model used for Sardinian. 1. Introduction This paper presents a shallow-transfer rule-based machine translation (MT) sys- tem from Italian to Sardinian, two languages of the Romance group. Italian is spoken in Italy, although it is an oicial language in countries like the Republic of Switzer- land, San Marino and Vatican City, and has approximately 58 million speakers, while Sardinian is spoken principally in Sardinia and has approximately one million speak- ers (Lewis, 2009). The objective of the project was to make a system for creating almost-translated text that needs post-editing before being publishable. For translating between closely- related languages where one language is a majority language and the other a minority or marginalised language, this is relevant as MT of post-editing quality into a lesser- resourced language can help with creating more text in that language. As described below, Sardinian is not a fully-standardised language. This means that linguistic resources are scarce, even if the orthographic norm chosen for this © 2017 PBML. Distributed under CC BY-NC-ND. Corresponding author: francis.tyers@uit.no Cite as: Francis M. Tyers, Hèctor Alòs i Font, Gianfranco Fronteddu, Adrià Martín-Mor. Rule-Based Machine Trans- lation for the Italian–Sardinian Language Pair. The Prague Bulletin of Mathematical Linguistics No. 108, 2017, pp. 221–232. doi: 10.1515/pralin-2017-0022.
PBML 108 JUNE 2017 project was the Limba Sarda Comuna (Common Sardinian Language, or LSC), the one oicially approved by the island’s autonomous government in 2006. In fact, the main aim of the project was to create a tool that would foster text production in Sardinian, especially in areas such as administration and Wikipedia. The remainder of the article is laid out as follows: In section 2 we provide some linguistic background to Sardinian. This is followed by a description of the platform used to build the MT system in section 3. Section 4 describes the development of the system, including resources that were reused. Then section 5 gives an evaluation of the system. Finally, we comment on possible future work in section 6 and give some conclusions in section 7. 2. Sardinian The Sardinian language is a Romance language spoken by approximately one mil- lion people on the island of Sardinia, together with other Romance languages such as Tabarchino Ligurian (on the islands of San Pé and Sant’Antióccu), Algherese Catalan (in the city of L’Alguer), Sassarese (in the city of Sassari) and Gallurese Corsican (in Gaddùra). 1 At the institutional level, some of these languages are recognised by the regional government. However, the use of Sardinian language is virtually non-existent at any educational level, as well as in many ields of the public sphere (media, newspapers, administration, etc.). Still, the use of Sardinian is widespread. According to (Oppo, 2007) only 2.7% per cent of the population in Sardinia does not have any competence (either active or passive) in “any local language”. Sardinian, classiied as “deinitely endangered” by UNESCO, 2 is spoken across most of the island despite the fact that, because of its great internal variety, two macro- varieties are often distinguished: northern (Logudorese and Nuorese) and southern (Campidanese). The existence of these two macro-varieties is one of the controversial factors when it comes to the standardisation of the language. At present, there are movements who advocate for diferent standardisation models and which, broadly, correspond to northern and southern regions. On the one hand, there is a group that defends a double standard, following the Norwegian model. This model, which is basically followed in the south, has received endorsement by the provincial government of Casteddu, which has oicially adopted a “southern” standard described in the document Arrègulas po ortograia, fonètica, mor- fologia e fueddàriu de sa Norma Campidanesa de sa Lìngua Sarda (Comitau Scientìicu po sa Norma Campidanesa de su Sardu Standard, 2009). On the other hand, the Limba Sarda Comuna (LSC) has been proposed as the standard form for all varieties of Sar- dinian. It is an evolved version of the Limba Sarda Uniicada (LSU), which was in turn the result of an experts’ committee called by the Sardinian government in 2001. 1 Toponyms are written in the local languages. There are, apart from these, other linguistic islands which result from migrations, such as Venetian and Romanisku. 2 http://www.unesco.org/languages-atlas/en/atlasmap/language-id-337.html and www.unesco.org/ culture/languages-atlas/en/atlasmap/language-id-381.html 222
The Prague Bulletin of Mathematical Linguistics NUMBER 108 JUNE 2017 221–232 Rule-Based Machine Translation for the Italian–Sardinian Language Pair Francis M. Tyers,ab Hèctor Alòs i Font,a Gianfranco Fronteddu,e Adrià Martín-Mord a b UiT Norgga árktalaš universitehta, Tromsø, Norway Arvutiteaduse instituut, Tartu Ülikool, Tartu, Estonia c Universitat de Barcelona, Barcelona d Universitat Autònoma de Barcelona, Barcelona e Università degli Studi di Cagliari, Cagliari Abstract This paper describes the process of creation of the irst machine translation system from Italian to Sardinian, a Romance language spoken on the island of Sardinia in the Mediterranean. The project was carried out by a team of translators and computational linguists. The article focuses on the technology used (Rule-Based Machine Translation) and on some of the rules created, as well as on the orthographic model used for Sardinian. 1. Introduction This paper presents a shallow-transfer rule-based machine translation (MT) system from Italian to Sardinian, two languages of the Romance group. Italian is spoken in Italy, although it is an oicial language in countries like the Republic of Switzerland, San Marino and Vatican City, and has approximately 58 million speakers, while Sardinian is spoken principally in Sardinia and has approximately one million speakers (Lewis, 2009). The objective of the project was to make a system for creating almost-translated text that needs post-editing before being publishable. For translating between closelyrelated languages where one language is a majority language and the other a minority or marginalised language, this is relevant as MT of post-editing quality into a lesserresourced language can help with creating more text in that language. As described below, Sardinian is not a fully-standardised language. This means that linguistic resources are scarce, even if the orthographic norm chosen for this © 2017 PBML. Distributed under CC BY-NC-ND. Corresponding author: francis.tyers@uit.no Cite as: Francis M. Tyers, Hèctor Alòs i Font, Gianfranco Fronteddu, Adrià Martín-Mor. Rule-Based Machine Translation for the Italian–Sardinian Language Pair. The Prague Bulletin of Mathematical Linguistics No. 108, 2017, pp. 221–232. doi: 10.1515/pralin-2017-0022. PBML 108 JUNE 2017 project was the Limba Sarda Comuna (Common Sardinian Language, or LSC), the one oicially approved by the island’s autonomous government in 2006. In fact, the main aim of the project was to create a tool that would foster text production in Sardinian, especially in areas such as administration and Wikipedia. The remainder of the article is laid out as follows: In section 2 we provide some linguistic background to Sardinian. This is followed by a description of the platform used to build the MT system in section 3. Section 4 describes the development of the system, including resources that were reused. Then section 5 gives an evaluation of the system. Finally, we comment on possible future work in section 6 and give some conclusions in section 7. 2. Sardinian The Sardinian language is a Romance language spoken by approximately one million people on the island of Sardinia, together with other Romance languages such as Tabarchino Ligurian (on the islands of San Pé and Sant’Antióccu), Algherese Catalan (in the city of L’Alguer), Sassarese (in the city of Sassari) and Gallurese Corsican (in Gaddùra).1 At the institutional level, some of these languages are recognised by the regional government. However, the use of Sardinian language is virtually non-existent at any educational level, as well as in many ields of the public sphere (media, newspapers, administration, etc.). Still, the use of Sardinian is widespread. According to (Oppo, 2007) only 2.7% per cent of the population in Sardinia does not have any competence (either active or passive) in “any local language”. Sardinian, classiied as “deinitely endangered” by UNESCO,2 is spoken across most of the island despite the fact that, because of its great internal variety, two macrovarieties are often distinguished: northern (Logudorese and Nuorese) and southern (Campidanese). The existence of these two macro-varieties is one of the controversial factors when it comes to the standardisation of the language. At present, there are movements who advocate for diferent standardisation models and which, broadly, correspond to northern and southern regions. On the one hand, there is a group that defends a double standard, following the Norwegian model. This model, which is basically followed in the south, has received endorsement by the provincial government of Casteddu, which has oicially adopted a “southern” standard described in the document Arrègulas po ortograia, fonètica, morfologia e fueddàriu de sa Norma Campidanesa de sa Lìngua Sarda (Comitau Scientìicu po sa Norma Campidanesa de su Sardu Standard, 2009). On the other hand, the Limba Sarda Comuna (LSC) has been proposed as the standard form for all varieties of Sardinian. It is an evolved version of the Limba Sarda Uniicada (LSU), which was in turn the result of an experts’ committee called by the Sardinian government in 2001. 1 Toponyms are written in the local languages. There are, apart from these, other linguistic islands which result from migrations, such as Venetian and Romanisku. 2 http://www.unesco.org/languages-atlas/en/atlasmap/language-id-337.html and www.unesco.org/ culture/languages-atlas/en/atlasmap/language-id-381.html 222 F. M. Tyers et al. SL text Apertium Sardinian–Italian (221–232) deformatter morph. analyser morph. disambig. lexical transfer lexical selection structural transfer morph. generator postgenerator reformatter TL text Figure 1. The modular architecture of the Apertium MT platform. Modules communicate using Unix text pipes. In 2006, the Sardinian government adopted the LSC as a co-oicial language, alongside Italian, for the publication of oicial documents. The LSC is also the form chosen by several publishing houses and websites. The existence of these two proposals implies that all initiatives concerning the Sardinian language must irst take a stand on the issue of the standardisation model. The Sardinian Wikipedia, for instance, allows its users to mark the variety in which they write by adding a lag. In October 2016, at the time of the writing of this article, the Sardinian Wikipedia has 5,230 content pages,3 out of which 1,525 are written in Logudorese,4 776 in LSC,5 and 295 in Campidanese.6 Other digital products, such as Facebook (Beccu and MartínMor, 2017), Telegram (Martín-Mor, 2017) and Ubuntu,7 have been partially localised into Sardinian basing mainly on the LSC model. Indeed, according to Cheratzu (2015), textual and literary production in LSC is clearly greater in number than any other. Therefore, basing on textual production and resource availability, we decided to use LSC as the standard form of the Sardinian language in our project. Italian was chosen as the source language for our project. Despite the fact that linguistic resources (and competent writers) are scarce even for LSC, it was deemed appropriate, given the fragile situation of the Sardinian language, to facilitate the creation of contents in LSC from Italian (i.e., documents issued by the government, websites, newspapers, etc.). 3. Platform The system is based on the Apertium machine translation platform (Forcada et al., 2011). The platform was originally aimed at the Romance languages of the Iberian peninsula, but has also been adapted for other, more distantly related, language pairs. The whole platform, both programs and data, are licensed under the Free Software 3 https://sc.wikipedia.org/wiki/Ispetziale:Statistics 4 https://sc.wikipedia.org/wiki/Categoria:Logudoresu 5 https://sc.wikipedia.org/wiki/Categoria:Limba_Sarda_Comuna 6 https://sc.wikipedia.org/wiki/Categoria:Campidanesu 7 https://wiki.ubuntu.com/Ubuntu-Sardu/ 223 PBML 108 JUNE 2017 Foundation’s General Public Licence (GPL)8 and all the software and data for the 43 supported language pairs (and the other pairs being worked on) is available for download from the project website. 3.1. Pipeline A typical translator built with Apertium consists of 9 modules which communicate between each other using standard Unix pipes. This eases diagnosis, the insertion of new modules, etc. The modules comprise of the following: • A deformatter which encapsulates any formatting (e.g. HTML or XML tags etc.) information in the input stream. • A morphological analyser which for each surface form in the stream returns a sequence of possible analyses. • A part-of-speech tagger which out of the possible analyses for a given word returns the most probable analysis. This is based on either irst-order HMM or on HMM in combination with Constraint Grammar (Bick and Didriksen, 2015). • A lexical transfer module which for each unambiguous source language lexical form returns one or more target language lexical forms. • A lexical selection module which for each source language lexical form with more than one target language translation uses a set of rules operating on sourcelanguage context to choose the most adequate translation in the target language. • A structural transfer module which performs syntactic and morphological operations to convert the source language intermediate representation into the target language intermediate representation. Common operations include insertion, deletion and substitution of lexical units, agreement between lexical units for e.g. gender, number and case, etc. The structural transfer module calls the lexical transfer module. • A morphological generator which for each target language lexical form returns a surface (inlected) form. • A postgenerator which performs orthographic operations, for example elision (such as da+il=dal in Italian). • A reformatter which de-encapsulates any formatting, leaving it untouched. Figure 1 gives an example pipeline. The data used by these modules are by and large speciied in XML iles and compiled into binary forms for use by the modules. 4. Development The development of the Italian-Sardinian pair owes a lot to previous work on other language pairs. In this case, most of the lexical and morphological resources for Italian were taken from the Italian–Catalan pair (Toral et al., 2011), while part of the lexical and morphological resources for Sardinian was taken from the Sardinian–Catalan pair existing a prototype in the Apertium project. In parallel to our development of the 8 https://www.gnu.org/licenses/gpl-3.0.en.html 224 F. M. Tyers et al. Apertium Sardinian–Italian (221–232) Italian–Sardinian pair, developers from Prompsit Language Engineering were working on an Italian–Spanish pair, so we cooperated in the improvement of the resources for Italian. 4.1. Analysis The development began with an analysis oriented to: • collecting free linguistic resources for the dictionaries; • collecting monolingual and bilingual corpora; • systematically comparing the source and the target languages in order to understand what structural changes exist between them. The contrastive analysis between Italian and Sardinian led to more than one hundred examples of translations the translator was expected to give, but a morphemeby-morpheme translation would not, e.g. • Nella mia terra. → In sa terra mea. (“In my land”) • Bellissimi. → Bellos a beru. (“Very beautiful’) • Darmi. → Mi dare. (“To give me”) These observed diferences were used in creating the transfer rules. 4.2. Morphological dictionaries The Italian morphological dictionary is, for the most part, the one used in the Italian–Catalan translator. However, some work has been done to extend and ix verbal paradigms. In addition, some 2,000 lemmas were added from the free/opensource resource Morph-it (Zanchetta and Baroni, 2005). A irst version of the Sardinian morphological dictionary already existed. It was based on the “experimental” norms of LSC (Regione Autonoma della Sardegna, 2006). It was augmented with data from the spell checker provided by the regional government of Sardinia.9 An important lack of proper nouns in the spell checker was detected, so we partially solved it adding a few hundreds of the most common person and family names in Sardinia, as well as the names of all Sardinian municipalities and Italian regions. It is worth adding that many place names are not yet standardised, e.g. the names of the countries and capitals. We added a few of the most common. 4.3. Morphological disambiguation Romance languages have a fair amount of morphological ambiguities. Fortunately for developers of rule-based machine translation systems between these languages, they share most ambiguities, so most of the time selecting the wrong morphological analysis does not imply a bad translation, a free ride. For instance, this is generally the case for words inishing in -ista (like comunista, ‘communist’) that may be both adjectives or nouns. Since this ambiguity happens to be in both the source and the 9 http://www.sardegnacultura.it/cds/cros/ 225 PBML 108 JUNE 2017 Dictionary Sardinian Italian Sardinian–Italian Entries 51,743 35,099 25,484 Table 1. Dictionaries in the MT system. The final translator is assembled as the intersection of the entries in these dictionaries. target language, e.g. a wrong analysis of comunista as a noun in il partito comunista would still give a good translation as su partidu comunista. Probably the most frequent ambiguity in Italian, which is shared by French, Spanish and Catalan too, is la that can be both a deinite article (feminine the) or a pronoun (her). In Sardinian these two analyses have diferent forms so it was necessary to resolve the ambiguity. In addition to training the tagger on a corpus of 17,000 words from TED talks and Wikinews,10 we added a set of 30 rules using rules written using Constraint Grammar (CG) (Bick and Didriksen, 2015). CG rules for Italian mainly deal with the disambiguation between imperative verbal forms with enclitic pronouns and adjectives (e.g. centrali as ‘central’, masculine plural, or ‘centre them’), and contractions of prepositions and determiners (e.g. dalle as ‘from the.f.pl’ or ‘give.imp.2.sg them’; dai, ‘from the.m.PL’, ‘give.imp.2.sg’ or ‘give.pri.2.sg’; dei, ‘of the.m.pl’ or ‘gods’).11 Not every morphological ambiguity can be easily solved. A clear case is sono, which can be “I am” or “they are”. This ambiguity does not exist in Sardinian: “I am” is so, while “they are” is sunt. Both Italian and Sardinian are pro-drop languages, the subject pronoun can be omitted since it can be almost always inferred from the context (especially from the verb form). So it happens that we often have to guess whether it is about “I” or “they” when dealing with sono. By default we assume third person based on our target domain of encyclopaedic texts. 4.4. Transfer lexicon The transfer lexicon was one of the tasks of the project that has taken longer because of the lack of free bilingual dictionary. In total 25,484 lemmas have been added to the bilingual dictionary, about a half of them by hand using frequency lists of words. Most of the time Antonino Rubattu’s Universal Dictionary Italian-Sardinian and Mario Casu’s Logudorese-Italian vocabulary were consulted. However, when using the dictionaries we made eforts to choose a form which was also found in the LSC spell checker. 10 Corpus provided by Prompsit Language Engineering, http://www.prompsit.com = masculine, f = feminine, sg = singular, PL = plural, imp = imperative, pri = present of indicative, 2 = second person. 11 226 F. M. Tyers et al. Apertium Sardinian–Italian (221–232) 4.5. Lexical selection Because of the short time in which the translator was developed only 35 lexical selection rules have been added. The lack of bilingual corpora did not allow us to automatically infer any rules. For instance, a diicult case is the word corso, which may be both “street” and “Corsican”. Both meanings are found often in similar contexts and have diferent translations in Sardinian. Rules deine that, if the noun is found in plural or is preceded by the preposition “in”, “Corsican” is preferred, otherwise “street” is chosen. 4.6. Structural transfer rules Apertium, as a rule, translates lemmas and morphemes one by one. Obviously, this does not always work, even for closely related languages. Structural transfer rules are responsible for modifying morphology or word order in order to produce “adequate” target language. In all, we have deined 89 such transfer rules. 4.6.1. Noun-phrase internal agreement Most of the rules deal with noun-phrase internal agreement both in gender and number. Two situations have to be distinguished. On one hand, the target language has combinations of gender and/or number that do not exist in the source language. About 8% of the nouns have been labelled in the bilingual dictionary as requiring that the gender or the number needs to be determined when translating from Italian into Sardinian. In this case, the actual gender and/or number is obtained from other words in the noun phrase. On the other hand, a noun in the target language may have a gender and/or a number diferent than in the source one. This is the case for 7% of the nouns in the bilingual dictionary. In this case, the gender and/or the number of the other words of the noun phrase must be modiied to agree with the name. 4.6.2. Possessives Possessives also require a correct delimitation of noun phrases since they must be moved from its beginning to the end (1). (1) La sua apparente indiferenza . S’ aparente indiferèntzia sua . “His apparent indiference.” 4.6.3. Tenses Tenses in Sardinian tend to be often analytical. A number of tenses which are synthetic in Italian, as well in most of the Romance languages, are conjugated in Sardinian 227 PBML 108 JUNE 2017 by means of verbal periphrasis, e.g. the future (2a) and conditional (2b) and historical. In addition, LSC does not have the absolute past tense of Italian, and uses the present perfect (2c). (2) a. Canterò Apo a cantare “I will sing” b. Canterei Dia cantare “I would sing” c. Cantai Aia cantadu “I sang” All these transformations have been done by means of speciic transfer rules. 4.6.4. Clitic pronouns In Italian clitic pronouns must be placed after the verbs in ininitive, imperative and gerund forms, as well as with past participles when used as past gerunds. Instead, in Sardinian in ininitive forms clitics should be placed before the verb. As a result, for instance cantarla (“to sing it”) must be translated as la cantare. 4.6.5. Change of the auxiliary verb In Italian the present continuous construction uses the auxiliary stare, while in Sardinian the auxiliary èssere is used instead of istare (3). (3) Io sto studiando. Deo so istudiende. “I am studying.” 4.7. Post-generation rules After the generation of the raw version of the translation some additional processing has to be done. In most of the cases, this means to apostrophise. For instance, l’accumulazione (“the accumulation”) is translated irst of all as sa acumulatzione, where a special symbol is produced by the morphological generator, warning that the word sa is liable to receive modiications. A set of rules deine in which case words in Sardinian are apostrophised. In the same way, the Sardinian words no and ne (“no” and “nor”) may be changed to non and nen according to the context. 5. Evaluation The system has been evaluated in two ways. The irst is its coverage.12 The second is the error rate of two pieces of text produced when comparing with a post-edited version of them. 12 Here coverage is deined as naïve coverage, that is for any given surface form at least one analysis is returned. This may not be complete. 228 F. M. Tyers et al. Apertium Sardinian–Italian (221–232) Corpus Wikipedia 10% UD Italian Tokens Coverage (%) 34,736,257 285,199 89.3 96.4 Table 2. Naïve vocabulary coverage. This is the percentage of tokens which receive at least one analysis from the morphological analyser. The coverage of Wikipedia is lower due to the large number of proper nouns and foreign words. Words Unknown words WER TER 9.4% 9.9% 6.3% 2,033 Table 3. Word Error Rate and unknown words over the 2,033 word test corpus. 5.1. Coverage Table 2 presents the lexical coverage of the system over two corpora. The irst was a subset of the Italian Wikipedia, which was created by randomly selecting 10% of the sentences from the Italian Wikipedia as of May 2016. The second corpus is the text from the Italian treebank in the Universal Dependencies project.13 5.2. Translation quality We measured translation quality using two metrics: Word error rate (WER), which is based on the Levenshtein distance (Levenshtein, 1966) and was calculated for using the apertium-eval-translator tool; and Translation Error Rate (TER, Snover et al. (2006)). Metrics based on word error rate have been chosen for a number of reasons. Firstly we would like to be to compare the system against systems based on similar technology, and to assess the usefulness of the system in a real setting, that is of translating for dissemination. Secondly, the reference translation is a postedition, whereas most MT evaluation metrics use pre-translated references. Using a more commonly used metric in an uncommon setting would give deceptively good results. A corpus of 2,033 words (53 sentences) was extracted from Wikipedia. The average length of a sentence was 42 words. This was the irst paragraphs of the last two texts put in the section “vetrina” (“showcase”) at the time of the GSoC inal evaluation (more or less 1000 words per text). Wikipedia texts were selected, as this is one of the major uses for Apertium translators, especially as they are used by the Wikimedia Content Translation Tool.14 The section “vetrina” is a pseudo-random selection (not done by the machine translator developers) of quality Wikipedia articles. 13 http://universaldependencies.org 14 https://www.mediawiki.org/wiki/Content_translation 229 PBML 108 JUNE 2017 The vast majority of unknown words are proper names (foreign person, family and place names) as well as foreign words (e.g. in French or English). The scores are similar to or slightly better than those for other translators in the Apertium platform for Romance languages, for example the Catalan–-Occitan system achieves a WER of 9.6% (Armentano-Oller and Forcada, 2006) and the Spanish– Aragonese 16.8%, (Martínez Cortés et al., 2012). 5.3. Qualitative evaluation Along with the quantitative evaluation of post-edition efort, we also performed a qualitative evaluation to determine where the system can be improved. Based on the inal evaluation text, we have detected two major issues: 1) incorrect disambiguation of the verb avere; and 2) the absolute past tense transfer rule. In the examples that follow, the Italian phrase is presented on the irst line, followed by the current translation into Sardinian produced by the system on the second, the correct translation on the third, and an English translation on the fourth. 5.3.1. Incorrect disambiguation of “avere” The Italian verb avere (“to have”) may be both an auxiliary and a lexical verb. These have diferent translations in Sardinian (4). The distinction between both verbs avere is done in the tagger. Nevertheless, it happens that when the auxiliary is separated from the participle by an adverb, avere is wrongly tagged as a lexical verb (5). (4) a. Ho cantato. Apo cantadu. “I have sung.” b. Ho un gatto. Tèngio unu gatu . “I have a cat.” (5) Non aver adeguatamente protetto la Francia. * Non tènnere in manera adeguada amparadu sa Frantza. Non àere in manera adeguada amparadu sa Frantza. “Not having adequately protected France”. This issue has to be solved in the morphological disambiguation step, for example using CG rules. 5.3.2. Absolute past tense As seen before, an absolute past tense exists in Italian, but not in LSC, in which the present perfect is used instead. A transfer rule constructs the past perfect adding the Sardinian auxiliary verb àere (“to have”) with the same person and number as the Italian verb and the past participle of the Sardinian translation of the lemma. Nevertheless, in Sardinian, as well as in Italian, several verbs are conjugated with the 230 F. M. Tyers et al. Apertium Sardinian–Italian (221–232) auxiliary verb “to be”, particularly the verbs of movement and the verb “to be” itself. The current transfer rule is too simple and does not take into account this fact 6a, so needs to be improved. (6) a. Sfuggì. * Aiat isfugidu. Fiat isfugidu. “He escaped.” b. Fu. * Aiat istadu. Fiat istadu. “He was.” 6. Future work Aside from ixing the problems outlined in section 5.3, we would also like to see more translation systems for Sardinian. We have an experimental system for Sardinian–Catalan which is particularly relevant as Catalan is one of the larger languages in direct contact with Sardinian. We are also interested in working on Corsican as it is also spoken in Sardinia. 7. Conclusions We have presented the irst ever MT system from Italian to Sardinian. The performance is similar to other translators created using the same technology. It translates texts suiciently well for post edition, although there remains a lot of work to do with respect to improving lexical coverage, and some work to do on improving the disambiguation and transfer rules. The system is available as free/open-source software under the GNU GPL and the may be downloaded from Apertium SVN.15 Acknowledgements We would like to thank Mikel Forcada for his constant encouragement and comments on an earlier version of this manuscript. Thanks also go out to Diegu Corràine for clariications on the LSC standard. The project was partially funded by a stipend from the Google Summer of Code. We would also like to thank Gema Ramírez Sánchez, and the anonymous reviewers. Bibliography Armentano-Oller, Carme and Mikel L. Forcada. Open-source machine translation between small languages: Catalan and Aranese Occitan. In 5th SALTMIL workshop on Minority Languages, pages 51–54, 2006. Beccu, A. and A. Martín-Mor. Sa localizatzione de Facebook in sardu. Revista Tradumàtica, 14, 2017. 15 http://www.apertium.org 231 PBML 108 JUNE 2017 Bick, Eckhard and Tino Didriksen. CG-3 – Beyond Classical Constraint Grammar. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania, pages 31–39. Linköping University Electronic Press, Linköpings universitet, 2015. Cheratzu, Francesco. Sa Chirca. In Mura, Riccardo and Maurizio Virdis, editors, Caratteri e strutture fonetiche, fonologiche e prosodiche della lingua sarda. Il sintetizzatore vocale SINTESA. 2015. Comitau Scientìicu po sa Norma Campidanesa de su Sardu Standard. Arrègulas po ortograia, fonètica, morfologia e fueddàriu de sa Norma Campidanesa de sa Lìngua Sarda, 2009. Forcada, M. L., M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez, and F. M. Tyers. Apertium: a free/open-source platform for rule-based machine translation. Machine Translation, 25(2):127–144, 2011. Levenshtein, Vladimir I. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710, 1966. Lewis, M. Paul, editor. Ethnologue: Languages of the World. SIL International, Dallas, TX, USA, sixteenth edition, 2009. Martín-Mor, A. La localització de l’apli de missatgeria Telegram al sard: l’experiència de Sardware i una aplicació docent. Revista Tradumàtica, 14, 2017. Martínez Cortés, Juan Pablo, Jim O’Regan, and Francis Tyers. Free/Open Source ShallowTransfer Based Machine Translation for Spanish and Aragonese. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 2012. European Language Resources Association (ELRA). Oppo, Anna. Conoscere e parlare le lingue locali. In Oppo, Anna, editor, Le lingue dei sardi: una ricerca sociolinguistica, chapter 1, pages 6–45. Regione Autonoma della Sardegna, 2007. Regione Autonoma della Sardegna. Limba Sarda Comune. Norme linguistiche di riferimento a carattere sperimentale per la lingua scritta dell’Amministrazione regionale, 2006. URL http://www.regione.sardegna.it/documenti/1_72_20060418160308.pdf. Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the Conference of the Association for Machine Translation in the Americas, 2006. Toral, Antonio, Mireia Ginestí-Rosell, and Francis M. Tyers. An Italian to Catalan RBMT system reusing data from existing language pairs. In Sanchez-Martínez, F. and J.A. Perez-Ortiz, editors, Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, pages 77–81, 2011. Zanchetta, Eros and Marco Baroni. Morph-it! A free corpus-based morphological resource for the Italian language. Corpus Linguistics 2005, 1(1), 2005. ISSN 1747-9398. Address for correspondence: Francis M. Tyers francis.tyers@uit.no Giela ja kultuvvra instituhta UiT Norgga árktalaš universitehta, N-9018 Romsa, Norway 232
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Christopher Crick
Oklahoma State University
Dag I K Sjoberg
University of Oslo
Qusay F. Hassan
Álvaro Figueira, PhD
Faculdade de Ciências da Universidade do Porto