Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Constantin Orasan
  • Room 06LC03,
    Library and Learning Centre,
    University of Surrey, Guildford,
    Surrey, GU2 7XH, UK
As Machine Translation (MT) becomes increasingly ubiquitous, so does its use in professional translation workflows. However, its proliferation in the translation industry has brought about new challenges in the field of Post-Editing (PE).... more
As Machine Translation (MT) becomes increasingly ubiquitous, so does its use in professional translation workflows. However, its proliferation in the translation industry has brought about new challenges in the field of Post-Editing (PE). We are now faced with a need to find effective tools to assess the quality of MT systems to avoid underpayments and mistrust by professional translators. In this scenario, one promising field of study is MT Quality Estimation (MTQE), as this aims to determine the quality of an automatic translation and, indirectly, its degree of post-editing difficulty. However, its impact on the translation workflows and the translators’ cognitive load is still to be fully explored. We report on the results of an impact study engaging professional translators in PE tasks using MTQE. To assess the translators’ cognitive load we measure their productivity both in terms of time and effort (keystrokes) in three different scenarios: translating from scratch, post-editi...
In this paper we investigate the challenges involved in translating book reviews from Arabic into English, with particular focus on the errors that lead to incorrect translation of sentiment polarity. Our study points to the special... more
In this paper we investigate the challenges involved in translating book reviews from Arabic into English, with particular focus on the errors that lead to incorrect translation of sentiment polarity. Our study points to the special characteristics of Arabic UGC, examines the sentiment transfer errors made by Google Translate of Arabic UGC to English, analyzes why the problem occurs, and proposes an error typology specific of the translation of Arabic UGC. Our analysis shows that the output of online translation tools of Arabic UGC can either fail to transfer the sentiment at all by producing a neutral target text, or completely flips the sentiment polarity of the target word or phrase and hence delivers a wrong affect message. We address this problem by fine-tuning an NMT model with respect to sentiment polarity showing that this approach can significantly help with correcting sentiment errors detected in the online translation of Arabic UGC.
This paper takes a preliminary look at the relation between verb pattern matches in the Pattern Dictionary of English Verbs (PDEV) and translation quality through a qualitative analysis of human-ranked sentences from 5 different machine... more
This paper takes a preliminary look at the relation between verb pattern matches in the Pattern Dictionary of English Verbs (PDEV) and translation quality through a qualitative analysis of human-ranked sentences from 5 different machine translation systems. The purpose of the analysis is not only to determine whether verbs in the automatic translations and their immediate contexts match any pre-existing semanto-syntactic pattern in PDEV, but also to establish links between hypothesis sentences and the verbs in the reference translation. It attempts to answer the question of whether or not the semantic and syntactic information captured by Corpus Pattern Analysis (CPA) can indicate whether a sentence is a “good” translation. Two human annotators manually identified the occurrence of patterns in 50 translations and indicated whether these patterns match any identified pattern in the corresponding reference translation. Results indicate that CPA can be used to distinguish between well ...
This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who... more
This paper explores automatic methods to identify relevant biography candidates in large databases, and extract biographical information from encyclopedia entries and databases. In this work, relevant candidates are defined as people who have made an impact in a certain country or region within a pre-defined time frame. We investigate the case of people who had an impact in the Republic of Austria and died between 1951 and 2019. We use Wikipedia and Wikidata as data sources and compare the performance of our information extraction methods on these two databases. We demonstrate the usefulness of a natural language processing pipeline to identify suitable biography candidates and, in a second stage, extract relevant information about them. Even though they are considered by many as an identical resource, our results show that the data from Wikipedia and Wikidata differs in some cases and they can be used in a complementary way providing more data for the compilation of biographies
Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data... more
Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data was enough as a demonstrator of possibilities. Knowledge resource modeling, development, enrichment and integration is crucial for associating relevant information in pools of digital material which are not only scattered across various archives, libraries and collections, but they also often lack relevant metadata. Within this research framework, NLP approaches originally stemming from lexico-semantic information extraction and knowledge resource representation, modeling, development and reuse have a pivotal role to play. From the NLP perspective, applications of knowledge resources for the SocioEconomic Sciences and Humanities present numerous interesting research challenges that relate among others to the development of historical lexico-semantic...
The ability to automatically detect human stress and relaxation is crucial for timely diagnosing stress-related diseases, ensuring customer satisfaction in services and managing human-centric applications such as traffic management.... more
The ability to automatically detect human stress and relaxation is crucial for timely diagnosing stress-related diseases, ensuring customer satisfaction in services and managing human-centric applications such as traffic management. Traditional methods employ stress-measuring scales or physiological monitoring which may be intrusive and inconvenient. Instead, the ubiquitous nature of the social media can be leveraged to identify stress and relaxation, since many people habitually share their recent life experiences through social networking sites. This paper introduces an improved method to detect expressions of stress and relaxation in social media content. It uses word sense disambiguation by word sense vectors to improve the performance of the first and only lexicon-based stress/relaxation detection algorithm TensiStrength. Experimental results show that incorporating word sense disambiguation substantially improves the performance of the original TensiStrength. It performs better than state-of-the-art machine learning methods too in terms of Pearson correlation and percentage of exact matches. We also propose a novel framework for identifying the causal agents of stress and relaxation in tweets as future work.
Quality Estimation (QE) predicts the quality of machine translation output without the need for a reference translation. This quality can be defined differently based on the task at hand. In an attempt to focus further on the adequacy and... more
Quality Estimation (QE) predicts the quality of machine translation output without the need for a reference translation. This quality can be defined differently based on the task at hand. In an attempt to focus further on the adequacy and informativeness of translations, we integrate features of semantic similarity into QuEst, a framework for QE feature extraction. By using methods previously employed in Semantic Textual Similarity (STS) tasks, we use semantically similar sentences and their quality scores as features to estimate the quality of machine translated sentences. Preliminary experiments show that finding semantically similar sentences for some datasets is difficult and time-consuming. Therefore, we opt to start from the assumption that we already have access to semantically similar sentences. Our results show that this method can improve the prediction of machine translation quality for semantically similar sentences.
This paper gives a brief overview of the EXPloiting Empirical appRoaches to Translation (EXPERT) project, an FP7 Marie Curie Initial Training Network, which is preparing the next generation of world-class researchers in the field of... more
This paper gives a brief overview of the EXPloiting Empirical appRoaches to Translation (EXPERT) project, an FP7 Marie Curie Initial Training Network, which is preparing the next generation of world-class researchers in the field of hybrid machine translation. The project is employing 15 Marie Curie fellows who are working on 15 individual, but interconnected, projects and is organising local and consortium wide training activities. The project has been running for three years and has already produced high-quality research. This paper presents the most important research achievements of the project.
Analysis of long sentences are source of problems in advanced applications such as machine translation. With the aim of solving these problems in advanced applications, we have analysed long sentences of two corpora written in Standard... more
Analysis of long sentences are source of problems in advanced applications such as machine translation. With the aim of solving these problems in advanced applications, we have analysed long sentences of two corpora written in Standard Basque in order to make syntactic simplification. The result of this analysis leads us to design a proposal to produce shorter sentences out of long ones. In order to perform this task we present an architecture for a text simplification system based on previously developed general coverage tools (giving them a new utility) and on hand written rules specific for syntactic simplification. Being Basque an agglutinative language this rules are based on morphological features. In this work we focused on specific phenomena like appositions, finite relative clauses and finite temporal clauses. The simplification proposed does not exclude any target audience, and the simplification could be used for both humans and machines. This is the first proposal for Au...
This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method... more
This paper describes our participation in the First Shared Task on Aggression Identification. The method proposed relies on machine learning to identify social media texts which contain aggression. The main features employed by our method are information extracted from word embeddings and the output of a sentiment analyser. Several machine learning methods and different combinations of features were tried. The official submissions used Support Vector Machines and Random Forests. The official evaluation showed that for texts similar to the ones in the training dataset Random Forests work best, whilst for texts which are different SVMs are a better choice. The evaluation also showed that despite its simplicity the method performs well when compared with more elaborated methods.
The EXPERT project (http://expert-itn.eu) is an Initial Training Network (ITN) supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. By... more
The EXPERT project (http://expert-itn.eu) is an Initial Training Network (ITN) supported by the People Programme (Marie Curie Actions) of the European Union’s Framework Programme (FP7/2007-2013) under REA grant agreement no 317471. By appointing 15 fellows to work on related projects, the project aims to train the next generation of world-class researchers in the field of data-driven translation technology.
In today’s world, large amounts of information have to be dealt with, regardless of the field involved. Most of this information comes in written format. Computers seem the right choice for making life easier, by processing text... more
In today’s world, large amounts of information have to be dealt with, regardless of the field involved. Most of this information comes in written format. Computers seem the right choice for making life easier, by processing text automatically, but in many cases at least partial understanding (if not full understanding) is necessary in order to automate a process. Luhn (1958) proposed a method for producing abstracts which works regardless of the type of the document. Although further research carried out in this area is based on his work, it has become apparent that general methods are not a solution. Instead, more and more research has been done into restricted domains, where certain particularities are used for “understanding”. A well known case is that of DeJong (1982) where the structure of newspaper articles was used in order to get the gist of the articles and then generate summaries.
Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main... more
Automatic text summarisation is a topic that has been receiving attention from the research community from the early days of computational linguistics, but it really took off around 25 years ago. This article presents the main developments from the last 25 years. It starts by defining what a summary is and how its definition changed over time as a result of the interest in processing new types of documents. The article continues with a brief history of the field and highlights the main challenges posed by the evaluation of summaries. The article finishes with some thoughts about the future of the field.
This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of... more
This article presents a new method to automatically simplify English sentences. The approach is designed to reduce the number of compound clauses and nominally bound relative clauses in input sentences. The article provides an overview of a corpus annotated with information about various explicit signs of syntactic complexity and describes the two major components of a sentence simplification method that works by exploiting information on the signs occurring in the sentences of a text. The first component is a sign tagger which automatically classifies signs in accordance with the annotation scheme used to annotate the corpus. The second component is an iterative rule-based sentence transformation tool. Exploiting the sign tagger in conjunction with other NLP components, the sentence transformation tool automatically rewrites long sentences containing compound clauses and nominally bound relative clauses as sequences of shorter single-clause sentences. Evaluation of the different co...
This paper investigates to what extent the use of paraphrasing in translation memory (TM) matching and retrieval is useful for human translators. Current translation memories lack semantic knowledge like paraphrasing in matching and... more
This paper investigates to what extent the use of paraphrasing in translation memory (TM) matching and retrieval is useful for human translators. Current translation memories lack semantic knowledge like paraphrasing in matching and retrieval. Due to this, paraphrased segments are often not retrieved. Lack of semantic knowledge also results in inappropriate ranking of the retrieved segments. Gupta and Or˘ asan (2014) proposed an improved matching algorithm which incorporates paraphrasing. Its automatic evaluation suggested that it could be beneficial to translators. In this paper we perform an extensive human evaluation of the use of paraphrasing in the TM matching and retrieval process. We measure post-editing time, keystrokes, two subjective evaluations , and HTER and HMETEOR to assess the impact on human performance. Our results show that paraphrasing improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase enhanced TMs.
Research Interests:
Current Translation Memory (TM) systems work at the surface level and lack semantic knowledge while matching. This paper presents an approach to incorporating semantic knowledge in the form of paraphrasing in matching and retrieval. Most... more
Current Translation Memory (TM) systems work at the surface level and lack semantic knowledge while matching. This paper presents an approach to incorporating semantic knowledge in the form of paraphrasing in matching and retrieval. Most of the TMs use Levenshtein edit-distance or some variation of it. Generating additional segments based on the para-phrases available in a segment results in exponential time complexity while matching. The reason is that a particular phrase can be paraphrased in several ways and there can be several possible phrases in a segment which can be paraphrased. We propose an efficient approach to incorporating paraphrasing with edit-distance. The approach is based on greedy approximation and dynamic programming. We have obtained significant improvement in both retrieval and translation of retrieved segments for TM thresholds of 100%, 95% and 90%.
Research Interests:
This article presents a new annotation scheme for syntactic complexity in text which has the advantage over other existing syntactic annotation schemes that it is easy to apply, is reliable and it is able to encode a wide range of... more
This article presents a new annotation scheme for syntactic complexity in text which has the advantage over other existing syntactic annotation schemes that it is easy to apply, is reliable and it is able to encode a wide range of phenomena. It is based on the notion that the syntactic complexity of sentences is explicitly indicated by signs such as conjunctions, complementisers and punctuation marks. The article describes the annotation scheme developed to annotate these signs and evaluates three corpora containing texts from three genres that were annotated using it. Inter-annotator agreement calculated on the three corpora shows that there is at least “substantial agreement” and motivates directions for future work.
Research Interests:

And 102 more

This paper describes the system submitted by the University of Wolverhampton and the University of Malaga for SemEval-2015 Task 2: Semantic Textual Similarity. The system uses a Supported Vector Machine approach based on a number of... more
This paper describes the system submitted by the University of Wolverhampton and the University of Malaga for SemEval-2015 Task 2: Semantic Textual Similarity. The system uses a Supported Vector Machine approach based on a number of linguistically motivated features. Our system performed satisfactorily for English and obtained a mean 0.7216 Pearson correlation. However, it performed less adequately for Spanish, obtaining only a mean 0.5158.
Research Interests: