Tłumacz na tropie terminów, czyli w poszukiwaniu uparciuchów, patrzydeł i czynów przestępnych Streszczenie Jednym z największych wyzwań w tłumaczeniu tekstów specjalistycznych jest terminologia z danej dziedziny. W rozdziale omówiono... more
Tłumacz na tropie terminów, czyli w poszukiwaniu uparciuchów, patrzydeł i czynów przestępnych Streszczenie Jednym z największych wyzwań w tłumaczeniu tekstów specjalistycznych jest terminologia z danej dziedziny. W rozdziale omówiono pracę z terminologią z punktu widzenia tłumacza: metody tworzenia terminów i ich wyszukiwania oraz korzystanie z zasobów terminologicznych dostęp-nych w Internecie. Przedstawiono również narzędzia online pomocne w tworzeniu list terminów oraz bazy terminologiczne z wybranych dziedzin. Wstęp W procesie tłumaczenia tekstów specjalistycznych jedną z największych trudności jest terminologia. Nie chodzi tu tylko o znalezienie odpowied-nich ekwiwalentów w języku docelowym, ale również o zrozumienie terminów w języku źródłowym. W przypadku specjalistycznych tekstów np. prawniczych, znalezienie ekwiwalentów terminologicznych jest czę-sto niemożliwe ze względu na nieprzystające systemy prawne z odrębną siatką pojęć. Kompetencja terminologiczna zatem nie polega tylko i wy-łącznie na umiejętnym wyszukaniu terminów, ale także na umiejscowie-niu ich w kontekście zrozumiałym dla odbiorcy. Często można usłyszeć pytanie: Kto lepiej przetłumaczy tekst specja-listyczny, np. tekst medyczny – lekarz ze znajomością języka obcego czy profesjonalny tłumacz, który zgłębił wiedzę z danej dziedziny i ma już doświadczenie w tłumaczeniu tekstów medycznych. Odpowiedź nie jest jednoznaczna i zależy od wielu czynników, np. czy odbiorca będzie spe-cjalistą czy laikiem lub czy tłumaczenie ma być wykonane tylko na wła-sne potrzeby odbiorcy, czy też ma być oficjalną wersją w danym języku wykorzystywaną w specjalistycznej komunikacji? Najlepszym rozwiąza
Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If African-language terminologists are willing to take their... more
Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If African-language terminologists are willing to take their rightful place in the new millennium, they must not only take cognisance of this trend but also be ready to implement the new technology. In this article it is advocated that the best way to do the latter two at this stage, is to opt for computation-ally straightforward alternatives (i.e. use 'raw corpora') and to make use of widely available soft-ware tools (e.g. WordSmith Tools). The main aim is therefore to discover whether or not the semi-automatic extraction of terminology from untagged and unmarked running text by means of basic corpus query software is feasible for the African languages. In order to answer this question a full-blown case study revolving around Northern Sotho linguistic texts is discussed in great detail. The computational results are compared throughout with the outcome of a manual excerption, and vice versa. Attention is given to the concepts 'recall' and 'precision'; different approaches are suggested for the treatment of single-word terms versus multi-word terms; and the various findings are summarised in a Linguistics Terminology lexicon presented as an Appendix.
Abstract: We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the... more
Abstract: We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all ...
Knowledge of technical vocabulary has become increasingly important over the last few decades along with the advances in various subject disciplines. ESP teachers and book authors need to know what words are considered technical... more
Knowledge of technical vocabulary has become increasingly important over the last few decades along with the advances in various subject disciplines. ESP teachers and book authors need to know what words are considered technical vocabulary when creating ESP learning materials. LSP lexicographers need to know how to determine technical vocabulary when deciding the entries for LSP dictionaries. This paper examines four methods which have been used for determining technical vocabulary. These four methods are called vocabulary classifications, keyword analysis, term extraction, and systematic classifications. A financial text taken from the Chartered Financial Analyst book is used to analyze the merits and demerits of each method. Realizing the problems with those four methods, this paper rounds up with a proposal called a hybrid method for determining technical vocabulary. This hybrid method does not only present a better way to determine technical vocabulary but also expand the concept of words in order to obtain a more comprehensive coverage of the items that constitute technical vocabulary.
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional documents being available in the languages of... more
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per
In this paper, we study how single-word term extraction and bilingual lexical alignment can be used and combined to assist terminologists when they compile bilingual specialized dictionaries. Two specific tools — namely a term extractor... more
In this paper, we study how single-word term extraction and bilingual lexical alignment can be used and combined to assist terminologists when they compile bilingual specialized dictionaries. Two specific tools — namely a term extractor calledTermoStatand a sentence and lexical aligner calledAlinea— are tested in a specific project the aim of which is the development of an English-French dictionary on climate change. We analyze the results of lexical alignment based on a typology of terminology equivalents. We first extracted French candidate terms that were then submitted to the lexical aligner. The results show that the use of these tools proves to be a valuable asset for compiling bilingual dictionaries. Most equivalents provided by the aligner were valid and the tool was able to locate several valid English equivalents (some of which were structurally different) for candidate terms.
Ontology learning has four main successive mining phases: term, concept, relation and axiom extraction. Since term extraction (TE) is the foundation of the ontology learning process, its inaccuracy will propagate in the following layers.... more
Ontology learning has four main successive mining phases: term, concept, relation and axiom extraction. Since term extraction (TE) is the foundation of the ontology learning process, its inaccuracy will propagate in the following layers. Moreover, for any general-purpose ontology system some domain independent term extraction is needed. To achieve domain independent and precise TE in Farsi the present paper has presented a new hybrid system. It is based on the combination of a linguistic filter on Ezafe construction in Farsi and a statistical filter on C-value algorithm.
Background: Knowledge management in the European project Noesis addresses concept-based annotation and multilingual Information Retrieval of documents. Objective: Multilingual enrichment of a concept-based terminology in the medical... more
Background: Knowledge management in the European project Noesis addresses concept-based annotation and multilingual Information Retrieval of documents. Objective: Multilingual enrichment of a concept-based terminology in the medical field. Experience and evaluation in the domain of cardiovascular diseases by enriching a subset of the MeSH thesaurus in six European languages. This terminology, represented in the OWL standard ontology language, has been used for manual semantic annotation of medical texts, for ...
This work presents an external memory approach to extract the maximal repeats from whole genome sequences with the statistics of these repeats across classes, where the definition of a class is determined from the statistics to be... more
This work presents an external memory approach to extract the maximal repeats from whole genome sequences with the statistics of these repeats across classes, where the definition of a class is determined from the statistics to be computed. A heuristic method consisting of a bucket-sort-like approach and the Chinese term extraction approach is adopted. The bucket-sorting method is used to
Abstract: We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the... more
Abstract: We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all ...
In this paper, we investigate the use of a machine-learning based approach to the specific problem of scientific term detection in patient information. Lacking lexical databases which differentiate between the scientific and popular... more
In this paper, we investigate the use of a machine-learning based approach to the specific problem of scientific term detection in patient information. Lacking lexical databases which differentiate between the scientific and popular nature of medical terms, we used local context, ...
The advent of the WWW and distributed information systems have made it possible to share documents between different users and organisations. However, this has created many problems related to the security, accessibility, right and most... more
The advent of the WWW and distributed information systems have made it possible to share documents between different users and organisations. However, this has created many problems related to the security, accessibility, right and most importantly the consistency of documents. It is important that the people involved in the documents management process have access to the most up-to-date version of documents, retrieve the correct documents and should be able to update the documents repository in such a way that his or her document are known to others. In this paper we propose a method for organising, storing and retrieving documents based on similarity contents. The method uses techniques based on information retrieval, document indexation and term extraction and indexing. This methodology is developed for the E-Cognos project which aims at developing tools for the management and sharing of documents in the construction domain.
A variety of methods exist for extracting terms and relations between terms from a corpus, each of them having strengths and weaknesses. Rather than just using the joint results, we apply different extraction methods in a way that the... more
A variety of methods exist for extracting terms and relations between terms from a corpus, each of them having strengths and weaknesses. Rather than just using the joint results, we apply different extraction methods in a way that the results of one method are input to another. This gives us the leverage to find terms and relations that oth erwise
This paper shows how the Web-based environment developed for language teaching is currently being adapted and extended. It explores the implications from the linguistics point of view, that, if corpora-users hope to extract interesting... more
This paper shows how the Web-based environment developed for language teaching is currently being adapted and extended. It explores the implications from the linguistics point of view, that, if corpora-users hope to extract interesting and useful information, working with corpora requires a firm grasp of linguistics. Corpora must thus be used to teach linguistics, and especially, linguistics related to NLP. The acquisition of a firm
grounding in linguistics obtained by intelligent use of corpora, can then lead the users to work more efficiently in their specific fields, such as term extraction, translation, or language teaching with corpora.
This paper presents preliminary results of our current research project DiLiA (Digital Library Assistant). The goals of the project are are twofold. One goal of the project is the development of domain-independent information extraction... more
This paper presents preliminary results of our current research project DiLiA (Digital Library Assistant). The goals of the project are are twofold. One goal of the project is the development of domain-independent information extraction methods. The other goal is the development of information visualization methods that interactively support researchers at time consuming information discovery tasks. We first describe issues that contribute to high cognitive load during exploration of unfamiliar research domains. Then we present a domain-independent approach to technical term extraction from paper abstracts, describe the architecture of the DiLiA, and illustrate an example co-author network visualization.
In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafo's workflow and describe the most important... more
In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafo's workflow and describe the most important information extraction methods used, namely its term extraction, and definition / semantic relations identification procedures. We will describe current Corpógrafo users and present a brief overview of the XML format currently used to export terminology databases. Finally, we present future improvements for this tool.
A key data preparation step in Text Mining, Term Extraction selects the terms, or collocation of words, attached to specific concepts. In this paper, the task of extracting relevant collocations is achieved through a supervised learning... more
A key data preparation step in Text Mining, Term Extraction selects the terms, or collocation of words, attached to specific concepts. In this paper, the task of extracting relevant collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as relevant/irrelevant. The candidate terms are described along 13 standard statistical criteria measures. From these examples, an evolutionary learning algorithm termed Roger, based on the optimization of the Area under the ROC curve criterion, extracts an order on the candidate terms. The robustness of the approach is demonstrated on two real-world domain applications, considering different domains (biology and human resources) and different languages (English and French).