Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasav... more Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasava pitanje elektronskih resursa i alata. Elektronski rjecnici, korpusi, prijevodne memorije, terminoloske baze i strojno prevođenje neke su od tehnologija koje se koriste u prevođenju u EU. U radu se želi prikazati važnost jezicnih tehnologije kako bi se postigli visejezicni standardi i hrvatski moduli ukljucili u visejezicnu komunikaciju EU.
Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasav... more Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasava pitanje elektronskih resursa i alata. Elektronski rjecnici, korpusi, prijevodne memorije, terminoloske baze i strojno prevođenje neke su od tehnologija koje se koriste u prevođenju u EU. U radu se želi prikazati važnost jezicnih tehnologije kako bi se postigli visejezicni standardi i hrvatski moduli ukljucili u visejezicnu komunikaciju EU.
The paper describes a methodology for bilingual terminology extraction and termbase building base... more The paper describes a methodology for bilingual terminology extraction and termbase building based on the terminological, lexical and pragmatic criteria along with the translator's knowledge and experience. The research work is conducted on the sentence aligned million- word Croatian-English parallel corpus of legislative texts, the first bigger corpus designed for this language pair so far. In order to assess the hybrid, statistical and linguistic approach as well as the tools for automatic term extraction, the automatically obtained lists of term candidates are compared to the manually created reference list. The term extraction includes multi-word units and single-word units corresponding to multi-word ones. The tools used in this research are: SDL Trados WinAlign (sentence alignment), SDLMultiTermExtract, and WordSmith (for statistically-based term extraction) and NooJ (linguistically-based environment). The evaluation is reported by statistical measures of precision, recall and Fmeasure. The language resources covering a specific domain speed up the translation process, reduce the cost and time and enable communication across different languages and cultures. Also, their application greatly facilitates machine translation and computer-assisted translation, information retrieval, building of multilingual term bases, glossaries and other resources which are prerequisite for the development of a language with insufficient linguistic resources, such as Croatian.
This paper describes and evaluates the performance of a semi-automatic authoring tool (SAAT) for ... more This paper describes and evaluates the performance of a semi-automatic authoring tool (SAAT) for knowledge extraction in the AC&NL Tutor, highlighting its strengths and weaknesses. We assessed the accuracy of automatic annotation tasks (Part-of-Speech tagging, Name Entity Recognition, Dependency parsing, and Coreference Resolution) performed on a dataset of 160 sentences from unstructured Wikipedia text on a computer. We compared the automatic annotations to the gold standard, created after human post-editing and validation. Human-error analysis included 3769 words, 582 subsentences, 1129 questions, 917 propositions, 1020 concepts, and 667 relations. It resulted in the error type classification and the set of custom rules further used for automatic error identification and correction. The results showed that an average of 68.7% of the error corrections referred to CoreNLP performance and 31.3% to the SAAT extraction algorithms. Our main contributions include an integrated approach t...
Consistent terminology can positively influence communication, information transfer, and proper u... more Consistent terminology can positively influence communication, information transfer, and proper understanding. In multilingual written communication processes, challenges are augmented due to translation variants. The main aim of this study was to implement the Herfindahl-Hirshman Index (HHI) for the assessment of translated terminology in parallel corpora for the evaluation of translated terminology. This research was conducted on three types of legal domain subcorpora, dating from different periods: the Croatian-English parallel corpus (1991–2009), Latin-English and Latin-Croatian versions of the Code of Canon Law (1983), and English and Croatian versions of the EU legislation (2013). After the terminology extraction process, validation of term candidates was performed, followed by an evaluation. Terminology consistency was measured using the HHI—a commonly accepted measurement of market concentration. Results show that the HHI can be used for measuring terminology consistency to ...
Angelina Gaspar Faculty of Humanities and Social Sciences Catholic Faculty of Theology, Universit... more Angelina Gaspar Faculty of Humanities and Social Sciences Catholic Faculty of Theology, University of Split, Croatia ABSTRACT This paper presents a corpus-based approach to semi-automatic extraction of English phrasal verbs, very productive, but complex and often non-transparent lexical units, via particles (prepositions, adverbs) they consist of and which are among the top-ranking functional words in the list of running words of the British National Corpus (BNC). The research is carried out on a comparable English corpus of publicly available legal texts consisting of 392 255 words and using WordSmith Tools 6.0. The evaluation of the system efficiency is conducted via the statistical measures of Precision, Recall and F-measure, whereas the list of phrasal verbs is checked against the reference source Cambridge Phrasal Verbs Dictionary (2015). The results show that the process of semi-automatic extraction of phrasal verbs requires a considerable human intervention as well as control...
Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasav... more Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasava pitanje elektronskih resursa i alata. Elektronski rjecnici, korpusi, prijevodne memorije, terminoloske baze i strojno prevođenje neke su od tehnologija koje se koriste u prevođenju u EU. U radu se želi prikazati važnost jezicnih tehnologije kako bi se postigli visejezicni standardi i hrvatski moduli ukljucili u visejezicnu komunikaciju EU.
Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasav... more Međunarodna komunikacija zahtijeva primjenu jezicnih tehnologija u postupku prevođenja i naglasava pitanje elektronskih resursa i alata. Elektronski rjecnici, korpusi, prijevodne memorije, terminoloske baze i strojno prevođenje neke su od tehnologija koje se koriste u prevođenju u EU. U radu se želi prikazati važnost jezicnih tehnologije kako bi se postigli visejezicni standardi i hrvatski moduli ukljucili u visejezicnu komunikaciju EU.
The paper describes a methodology for bilingual terminology extraction and termbase building base... more The paper describes a methodology for bilingual terminology extraction and termbase building based on the terminological, lexical and pragmatic criteria along with the translator's knowledge and experience. The research work is conducted on the sentence aligned million- word Croatian-English parallel corpus of legislative texts, the first bigger corpus designed for this language pair so far. In order to assess the hybrid, statistical and linguistic approach as well as the tools for automatic term extraction, the automatically obtained lists of term candidates are compared to the manually created reference list. The term extraction includes multi-word units and single-word units corresponding to multi-word ones. The tools used in this research are: SDL Trados WinAlign (sentence alignment), SDLMultiTermExtract, and WordSmith (for statistically-based term extraction) and NooJ (linguistically-based environment). The evaluation is reported by statistical measures of precision, recall and Fmeasure. The language resources covering a specific domain speed up the translation process, reduce the cost and time and enable communication across different languages and cultures. Also, their application greatly facilitates machine translation and computer-assisted translation, information retrieval, building of multilingual term bases, glossaries and other resources which are prerequisite for the development of a language with insufficient linguistic resources, such as Croatian.
This paper describes and evaluates the performance of a semi-automatic authoring tool (SAAT) for ... more This paper describes and evaluates the performance of a semi-automatic authoring tool (SAAT) for knowledge extraction in the AC&NL Tutor, highlighting its strengths and weaknesses. We assessed the accuracy of automatic annotation tasks (Part-of-Speech tagging, Name Entity Recognition, Dependency parsing, and Coreference Resolution) performed on a dataset of 160 sentences from unstructured Wikipedia text on a computer. We compared the automatic annotations to the gold standard, created after human post-editing and validation. Human-error analysis included 3769 words, 582 subsentences, 1129 questions, 917 propositions, 1020 concepts, and 667 relations. It resulted in the error type classification and the set of custom rules further used for automatic error identification and correction. The results showed that an average of 68.7% of the error corrections referred to CoreNLP performance and 31.3% to the SAAT extraction algorithms. Our main contributions include an integrated approach t...
Consistent terminology can positively influence communication, information transfer, and proper u... more Consistent terminology can positively influence communication, information transfer, and proper understanding. In multilingual written communication processes, challenges are augmented due to translation variants. The main aim of this study was to implement the Herfindahl-Hirshman Index (HHI) for the assessment of translated terminology in parallel corpora for the evaluation of translated terminology. This research was conducted on three types of legal domain subcorpora, dating from different periods: the Croatian-English parallel corpus (1991–2009), Latin-English and Latin-Croatian versions of the Code of Canon Law (1983), and English and Croatian versions of the EU legislation (2013). After the terminology extraction process, validation of term candidates was performed, followed by an evaluation. Terminology consistency was measured using the HHI—a commonly accepted measurement of market concentration. Results show that the HHI can be used for measuring terminology consistency to ...
Angelina Gaspar Faculty of Humanities and Social Sciences Catholic Faculty of Theology, Universit... more Angelina Gaspar Faculty of Humanities and Social Sciences Catholic Faculty of Theology, University of Split, Croatia ABSTRACT This paper presents a corpus-based approach to semi-automatic extraction of English phrasal verbs, very productive, but complex and often non-transparent lexical units, via particles (prepositions, adverbs) they consist of and which are among the top-ranking functional words in the list of running words of the British National Corpus (BNC). The research is carried out on a comparable English corpus of publicly available legal texts consisting of 392 255 words and using WordSmith Tools 6.0. The evaluation of the system efficiency is conducted via the statistical measures of Precision, Recall and F-measure, whereas the list of phrasal verbs is checked against the reference source Cambridge Phrasal Verbs Dictionary (2015). The results show that the process of semi-automatic extraction of phrasal verbs requires a considerable human intervention as well as control...
Uploads
Papers by Angelina Gaspar