Tony McEnery

McEnery, T., Xiao, R., & Tonio, Y. (2006). Corpus-Based Language Studies: An Advanced Resource Book

TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Ait... more

The corpus-driven revolution in applied linguistics continues apace, and along with it the parado... more

McEnery, T. (2005). Swearing in English: Bad Language, Purity and Power from 1586 to the Present

The title of this work in combination with its inclusion in a corpus linguistics series seems to ... more

Xiao, R. & McEnery, T. (2004). Aspect in Mandarin Chinese: A corpus-based study

McEnery, T. & Wilson, A. (2001). Corpus Linguistics (second edition)

The appearance of not one but two introductions to corpus linguistics within the same series show... more The appearance of not one but two introductions to corpus linguistics within the same series shows the maturation and diversification of this fledgling subdiscipline within linguistics. McEnery and Wilson offer an overview or annotated report on work done within the computer-corpus research paradigm, including computational linguistics, whereas Barnbrook offers a guide or manual on the procedures and methodology of corpus linguistics, particularly with regard to machine-readable texts in English and to the type of results thereby generated.

McEnery, T. & Wilson, A. (1996). Corpus Linguistics

McEnery, T. (1992). Computational Linguistics: A Natural Language Processing Toolbox and Guide

Edited volumes by Tony McEnery

McEnery, T., Hardie, A. & Younis, N. (eds). (forthcoming). Arabic Corpus Linguistics

Baker, P. & McEnery, T. (eds). (forthcoming). Corpora and Discourse Studies: Integrating Discourse and Corpora

Culpeper, J., Katamba, F., Kerswill, P., Wodak, R. & McEnery, T. (2009). The English Language: Description, Variation and Context

by Tony McEnery, Jonathan Culpeper, and Ruth Wodak

McEnery, T. & Botley, S. (eds). (2007). Discourse Anaphora and Resolution

Branco, A., McEnery, T. & Mitkov, R. (2005). Anaphora Processing: Linguistic, Cognitive And Computational Modelling: Selected Papers From Daarc 2002

McEnery, T. & Woodhead, L. (forthcoming). Everyday Language and Religion

McEnery, T. & Hardie, A. (forthcoming). Fundamental Concepts in Corpus Linguistics

McEnery, T. & Baker, H. (forthcoming). Corpus Linguistics and the Humanities

Baker, P., Gabrielatos, C. & McEnery, T. (2013). Discourse Analysis and Media Attitudes: The representation of Islam in the British press

McEnery, T., & Hardie, A. (2011). Corpus linguistics: method, theory and practice

Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of ... more

Xiao, R. & McEnery, T. (2010) Corpus-Based Contrastive Studies of English and Chinese

Xiao, R., Rayson, P. & McEnery, T. (2008). A Frequency Dictionary of Mandarin Chinese

by Tony McEnery, Richard Xiao, and Paul Rayson

Baker, P., Hardie, A. & McEnery, T. (2006). A glossary of corpus linguistics

McEnery, T., Xiao, R., & Tonio, Y. (2006). Corpus-Based Language Studies: An Advanced Resource Book

TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Ait... more

The corpus-driven revolution in applied linguistics continues apace, and along with it the parado... more

McEnery, T. (2005). Swearing in English: Bad Language, Purity and Power from 1586 to the Present

The title of this work in combination with its inclusion in a corpus linguistics series seems to ... more

Xiao, R. & McEnery, T. (2004). Aspect in Mandarin Chinese: A corpus-based study

McEnery, T. & Wilson, A. (2001). Corpus Linguistics (second edition)

McEnery, T. & Wilson, A. (1996). Corpus Linguistics

McEnery, T. (1992). Computational Linguistics: A Natural Language Processing Toolbox and Guide

McEnery, T., Hardie, A. & Younis, N. (eds). (forthcoming). Arabic Corpus Linguistics

Baker, P. & McEnery, T. (eds). (forthcoming). Corpora and Discourse Studies: Integrating Discourse and Corpora

Culpeper, J., Katamba, F., Kerswill, P., Wodak, R. & McEnery, T. (2009). The English Language: Description, Variation and Context

by Tony McEnery, Jonathan Culpeper, and Ruth Wodak

McEnery, T. & Botley, S. (eds). (2007). Discourse Anaphora and Resolution

Branco, A., McEnery, T. & Mitkov, R. (2005). Anaphora Processing: Linguistic, Cognitive And Computational Modelling: Selected Papers From Daarc 2002

Wilson, A., Rayson, P. & McEnery, T. (eds). (2003). Lancaster Corpus Linguistics

Wilson, A., Rayson, P. & McEnery, T. (eds). (2003). Non-English Corpus Linguistics

Burnard, L. & McEnery, T. (2000). (eds). Rethinking language pedagogy from a corpus perspective

McEnery, T., Botley, S.P. & Wilson, A. (eds). (2000). Multilingual corpora in teaching and research

McEnery, T. & Botley, S. (eds). (1998). Proceedings of the Second DAARC Colloquium

UCREL Technical Papers Special Issue

Garside, R., Leech, G. and McEnery, T. (eds). (1997). Corpus Annotation

Wichmann, A., Fligelstone, S., McEnery, T. & Knowles, G. (1997). Teaching and Language Corpora

McEnery, T., Botley, S., Glass, J. & Wilson, A. (eds). (1996). Corpora and Language Research: A Selection of Papers from Talc96

UCREL Technical Papers Special Issue

McEnery, T., Botley, S., Glass, J. and Wilson, A. (eds). (1996). Proceedings of the DAARC Colloquium

UCREL Technical Papers Special Issue

Wilson, A. & McEnery, T. (eds). (1994). Corpora in Language Education and Research: A Selection of Papers from Talc94

UCREL Technical Papers Special Issue, 1994

McEnery T. & Paice C.D. (eds). (1992). British Computer Society 14th Information Retrieval Colloquium

McEnery, T. (ed). (1991). British Computer Society 13th Information Retrieval Colloquium

Germond, B., McEnery, T. & Marchi, A. (forthcoming). ‘The EU’s comprehensive approach as the dominant discourse: a corpus linguistics analysis of the EU’s conter-piracy narrative

by Anna Marchi and Tony McEnery

Gablasova, D., Brezina, V., McEnery, T., & Boyd, E. (2015). Epistemic Stance in Spoken L2 English: The Effect of Task and Speaker Style

by Tony McEnery and Dana Gablasova

The article discusses epistemic stance in spoken L2 production. Using a subset of the Trinity Lan... more The article discusses epistemic stance in spoken L2 production. Using a subset of the Trinity Lancaster Corpus of spoken L2 production, we analysed the speech of 132 advanced L2 speakers from different L1 and cultural backgrounds taking part in four speaking tasks: one largely monologic presentation task and three interactive tasks. The study focused on three types of epistemic forms: adverbial, adjectival, and verbal expressions. The results showed a systematic variation in L2 speakers' stance-taking choices across the four tasks. The largest difference was found between the monologic and the dialogic tasks, but differences were also found in the distribution of epistemic markers in the three interactive tasks. The variation was explained in terms of the interactional demands of individual tasks. The study also found evidence of considerable inter-speaker variation, indicating the existence of individual speaker style in the use of epistemic markers. By focusing on social use of language, this article seeks to contribute to our understanding of communicative competence of advanced L2 speakers. This research is of relevance to teachers, material developers , as well as language testers interested in second language pragmatic ability.

Brezina, V., McEnery, T. & Wattam, S. (2015). Collocations in context: A new perspective on collocational networks

International Journal of Corpus Linguistics, 20:2, 2015

The idea that text in a particular field of discourse is organized into lexical patterns, which c... more The idea that text in a particular field of discourse is organized into lexical patterns, which can be visualized as networks of words that collocate with each other, was originally proposed by . This idea has important theoretical implications for our understanding of the relationship between the lexis and the text and (ultimately) between the text and the discourse community/ the mind of the speaker. Although the approaches to date have offered different possibilities for constructing collocation networks, we argue that they have not yet successfully operationalized some of the desired features of such networks. In this study, we revisit the concept of collocation networks and introduce GraphColl, a new tool developed by the authors that builds collocation networks from user-defined corpora. In a case study using data from study of the Society for the Reformation of Manners Corpus (SRMC), we demonstrate that collocation networks provide important insights into meaning relationships in language.

Gabrielatos, C., McEnery, T., Diggle, P. & Baker, P. (2012). The peaks and troughs of corpus-based contextual analysis

McEnery, T. & Hardie, A. (2010). On two traditions in corpus linguistics, and what they have in common

International Journal of Corpus Linguistics, 17:2, 2012

This paper focuses upon two issues. Firstly, the question of identifying diachronic trends, and m... more This paper focuses upon two issues. Firstly, the question of identifying diachronic trends, and more importantly significant outliers, in corpora which permit an investigation of a feature at many sampling points over time. Secondly, we consider how best to combine more qualitatively oriented approaches to corpus data with the type of trends that can be observed in a corpus using quantitative techniques. The work uses a recently completed ESRC-funded project as a case study, the representation of Islam in the UK press, in order to demonstrate the potential of the approach taken to establishing significant peaks in diachronic frequency development, and the fruitful interface that may be created between qualitative and quantitative techniques.

International Journal of Corpus Linguistics, 2010

RefDoc Bienvenue - Welcome. Refdoc est un service / is powered by. ...

Baker, P., Gabrielatos, C. & McEnery, T. (2013). Sketching Muslims: A Corpus Driven Analysis of Representations Around the Word ‘Muslim’ in the British Press 1998–2009

Applied Linguistics, 34:3

This article uses methods from corpus linguistics and critical discourse analysis to examine patt... more This article uses methods from corpus linguistics and critical discourse analysis to examine patterns of representation around the word Muslim in a 143 million word corpus of British newspaper articles published between 1998 and 2009. Using the analysis tool Sketch Engine, an analysis of noun collocates of Muslim found that the following categories (in order of frequency) were referenced: ethnic/national identity, characterizing/differentiating attributes, conflict, culture, religion, and group/organizations. The ‘conflict’ category was found to be particularly lexically rich, containing many word types. It was also implicitly indexed in the other categories. Following this, an analysis of the two most frequent collocate pairs: Muslim world and Muslim community showed that they were used to collectivize Muslims, both emphasizing their sameness to each other and their difference to ‘The West’. Muslims were also represented as easily offended, alienated, and in conflict with non-Muslims. The analysis additionally considered legitimation strategies that enabled editors to print more controversial representations, and concludes with a discussion of researcher bias and an extended notion of audience through online social networks.

Baker, P., Gabrielatos, C., Khosravnik, M., Kryzanowski, M., McEnery, T. & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press

by Tony McEnery and Ruth Wodak

Discourse & Society, 19:3, 2008

and corpus linguistics to examine discourses of refugees and asylum

Xiao, Z. & McEnery, T. (2008). Negation in Chinese: a corpus-based study

Journal of Chinese Linguistics, 36:2, 2008

This article explores negation in Chinese on the basis of spoken and written corpora of Mandarin ... more This article explores negation in Chinese on the basis of spoken and written corpora of Mandarin Chinese. The use of corpus data not only reveals central tendencies in language based on quantitative data, it also provides typical examples attested in authentic contexts. In this study we will first introduce the two major negators bu and mei (meiyou) and discuss their semantic and genre distinctions. Following this is an exploration of the interaction between negation and aspect marking. We will then move on to discuss the scope and focus of negation, transferred negation, and finally double negation and redundant negation.

Yadava, Y.P., Hardie, A., Lohani R.R., Regmi B.N., Gurung, S., Gurung, A., McEnery, T., Allwood, J., & Hall, P. (2008). Construction and annotation of a corpus of contemporary Nepali

by Tony McEnery, Yogendra Yadava, and Pat Hall

Corpora 3:2, 2008

In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). ... more In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English-Nepali and Nepali-English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.

McEnery, T. (2006). The moral panic about bad language in England, 1691–1745

Journal of Historical Pragmatics, 7:1, 2006

Page 1. Journal of Historical Pragmatics 7:1 (2006), 893. issn 15665852 / e-issn 15699854 © ... more

Xiao, R. & McEnery, T. (2006). Collocation, semantic prosody, and near synonymy: A cross-linguistic perspective

Applied Linguistics, 27:1, 2006

This paper explores the collocational behaviour and semantic prosody of near synonyms from a cros... more This paper explores the collocational behaviour and semantic prosody of near synonyms from a cross-linguistic perspective. The importance of these concepts to language learning is well recognized. Yet while collocation and semantic prosody have recently attracted much interest from researchers studying the English language, there has been little work done on collocation and semantic prosody on languages other than English. Still less work has been undertaken contrasting the collocational behaviour and semantic prosody of near synonyms in different languages. In this paper, we undertake a cross-linguistic analysis of collocation, semantic prosody and near synonymy, drawing upon data from English and Chinese (pu3tong1hua4). The implications of the findings for language learning are also discussed.

Xiao, R. & McEnery, T. (2006). Can completive and durative adverbials function as tests for telicity? Evidence from English and Chinese

Xiao, R., McEnery, T. & Qian, Y. (2006). Passive constructions in English and Chinese: A corpus-based contrastive study

Corpus Linguistics and Linguistic Theory, 2006

Telicity is an important concept in the study of aspect. While the compat-ibility tests with comp... more Telicity is an important concept in the study of aspect. While the compat-ibility tests with completive and durative adverbials have long been in opera-tion as a diagnostic for telicity, their validity and reliability have rarely been questioned. This article critically explores the validity and ...

Languages in Contrast, 6:1, 2006

McEnery, T. & Xiao, Z. (2005). Help or Help To: what do corpora have to say?

HELP is a frequent verb of English, with a distinctive syntax, that has generated ongoing debate ... more HELP is a frequent verb of English, with a distinctive syntax, that has generated ongoing debate amongst language researchers. As such, it is a verb that is often given some prominence in textbooks and grammars, 2 though the treatment of the verb can be poor. 3 For example, all of the authors who provide a poor account of HELP maintain that the choice of a full or bare infinitive after HELP is determined by a semantic distinction between the two-this is not the case (cf. the section ''Semantic Distinction''). In this paper, we will take a corpus-based approach to improve the description of the verb and to test claims made about the verb in the literature. We will also explore variation in that description between two major varieties of English, British English (BrE) and American English (AmE). In addition, we will investigate how HELP has varied diachronically and by register in these varieties. First, however, the claim that HELP is a frequent verb of English with distinctive syntactic properties must be justified.

Xiao, Z. & McEnery, T. (2005). Two approaches to genre analysis: three genres in modern American English’

Journal of English Linguistics, 33:1

This article compares two approaches to genre analysis: Biber’s multidimensional analysis (MDA) a... more This article compares two approaches to genre analysis: Biber’s multidimensional analysis (MDA) and Tribble’s use of the keyword function of WordSmith. The comparison is undertaken via a case study of conversation, speech, and academic prose in modern American English. The terms conversation and speech as used in this article correspond to the demographically sampled and context-governed spoken data in the British National Corpus. Conversation represents the type of communication we experience every day whereas speech is produced in situations in which there are few producers and many receivers (e.g., classroom lectures, sermons, and political speeches). Academic prose is a typical formal-written genre that differs markedly from the two spoken genres. The results of the MDA and keyword approaches both on similar genres (conversation vs. speech) and different genres (the two spoken genres vs. academic prose) show that a keyword analysis can capture important genre features revealed by MDA.

Piao, S., Rayson, P., Archer, D. & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction

by Tony McEnery, Paul Rayson, and Scott Piao

Journal of Computer Speech & Language, 19:4, 2005

Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP commu... more Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7-12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this 0885-2308/$ -see front matter Ó COMPUTER SPEECH AND LANGUAGE case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.

Baker, P. & McEnery, T. (2005). A corpus-based approach to discourses of refugees in UN and newspaper texts

McIntyre, D., Bellard-Thomson, C., Heywood, J., McEnery, T., Semino, E. & Short, M. (2004) Investigating the presentation of speech, writing and thought in spoken British English: a corpus-based approach

A corpus-based analysis of discourses of refugees and asylum seekers was carried out on data take... more

In this paper we describe the Lancaster Speech, Thought and Writing Presentation (ST&WP) Spoken C... more In this paper we describe the Lancaster Speech, Thought and Writing Presentation (ST&WP) Spoken Corpus. We have constructed this corpus to investigate the ways in which speakers present speech, thought and writing in contemporary spoken British English, with the associated aim of comparing our findings with the patterns revealed by the previous Lancaster corpus-based investigation of ST&WP in written texts. We describe the structure of the corpus, the archives from which its composite texts are taken, the decisions that we made concerning the selection of suitable extracts from the archives, and the problems associated with the original archived transcripts. We then move on to consider issues surrounding the mark-up of our data with TEI-conformant SGML, and explain the tagging format we adopted in annotating our data for ST&WP.

Xiao, X. & McEnery, T. (2004). A Corpus-Based Two-Level Model of Situation Aspect A corpus-based two-level model of situation aspect

In this paper we will extend Smith's (I997) two-component aspect theory to develop a two-level mo... more In this paper we will extend Smith's (I997) two-component aspect theory to develop a two-level model of situation aspect in which situation aspect is modelled as verb classes at the lexical level and as situation types at the sentential level. Situation types are the composite result of the rule-based interaction between verb classes and complements , arguments, peripheral adjuncts and viewpoint aspect at the nucleus, core and clause levels. With a framework consisting of a lexicon, a layered clause structure and a set of rules mapping verb classes onto situation types, the model is developed and tested using an English corpus and a Chinese corpus.

McEnery, T., Xiao, Z. & Mo, L. (2004). Aspect Marking in English and Chinese: using the Lancaster Corpus of Mandarin Chinese for contrastive language study

Gabrielatos, C. & McEnery, T. (2005). Epistemic modality in MA dissertations

Fuertes Olivera, P.A. (ed). Lengua y sociedad: Aportaciones recientes en lingüística cognitiva, lenguas en contacto, lenguajes de especialidad y lingüística del corpus, 2005

McEnery, T. & Kifle, N. (2001). Epistemic modality in argumentative essays of second-language writers

Flowerdew, J. (ed). Academic Discourse, 2001

Leech, G., McEnery, T. & Wynne, M. (1997). Further levels of annotation

R. Garside, G. Leech & T. McEnery (eds). Corpus Annotation, 1997

McEnery, T. & Oakes, M.P. (1996). Sentence and word alignment in the CRATER project

Using Corpora for Language Research, 1996

McEnery, T. & Wilson, A. (1994). Corpora and translation: uses and future prospects

Lorgnet, M. (ed). Atti della Fiera Internazionale della Traduzione II, 1994

Piao, S., Archer, D., Mudraya, O., Rayson, P., Garside, R., McEnery, T. & Wilson, A. (2005). A Large Semantic Lexicon for Corpus Annotation

by Tony McEnery, Dawn Archer, and Paul Rayson

Proceedings of the corpus linguistics 2005 conference

Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the pa... more Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see

Piao, S., Rayson, P., Archer, D. & McEnery, T. (2004). Evaluating Lexical Resources for A Semantic Tagger

by Tony McEnery, Paul Rayson, and Dawn Archer

LREC 2004 Proceedings, 2004

Semantic lexical resources play an important part in both linguistic study and natural language e... more Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17 th and 19 th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% --97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them 'future proof', we need to evaluate their potential both synchronically and diachronically across genres.

Archer, D., Rayson, P., Piao, S. & McEnery, T. (2004). Comparing the UCREL Semantic Annotation Scheme with Lexicographical Taxonomies

by Tony McEnery, Paul Rayson, and Dawn Archer

Proceedings of the EURALEX-2004 Conference, 2004

Annotation schemes for semantic field analysis use abstract concepts to classify words and phrase... more Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing, mdeed, our own UCREL semantic annotation system (USAS) is to form part ofa web-based 'intelligent' dictionary (Herpio 2002). As USAS was originally designed to enable automatic content analysis (WUson and Rayson 1993), we have been assessing its usefulness in a lexicographical setting, and also comparing its taxonomy with schemes developed by lexicographers. This paper initially reports the comparisons we have undertaken with two dictionary taxonomies: the first was designed by Tom McArthur for use in the Longman Lexicon of Contemporary English, and the second by Collins Dictionaries for use in their Collins English Dictionary. We then assess thefeasibility of mapping USAS to the CED tagset, before reporting our intentions to also map to WordNet (a reasonably comprehensive machine-useable database ofthe meanings ofEnglish words) via WordNet Domains (which augments WordNet 1.6 with 200+ domains). We argue that this type ofresearch can provide a practical guide for tagset mapping and, by so doing, bring lexicographers one-step closer to using the semantic field as the organising principle for their general-purpose dictionaries.

Rayson, P., Archer, D., Piao, S. & McEnery, T. (2004). The UCREL semantic analysis system

Proceedings of the Beyond Named Entity Recognition Semantic Labeling for NLP Tasks Workshop, 2004

The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semant... more The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.

Xiao, R., McEnery, T., Baker, P. & Hardie, A. (2004). Developing Asian language corpora: standards and practice

by Tony McEnery, Paul Baker, and Richard Xiao

Proceedings of the 4th Workshop on Asian Language Resources, 2004

This paper first discusses standards for developing Asian language corpora so as to facilitate in... more This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University -the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and other corpus tools.

Archer, D., McEnery, T., Rayson, P, & Hardie, A. (2003). Developing an automated semantic analysis system for Early Modern English

by Tony McEnery, Paul Rayson, and Dawn Archer

Proceedings of Corpus Linguistics 2003, 2003

As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis... more As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth EmodE) period. We begin by describing how effectively the existing system tagged a training corpus prior to any modifications. The training corpus consists of newsbooks dating from December 1653 -May 1654, and totals approximately 613,000.words. We then document the various adaptations that we made to the system in an attempt to improve its efficiency, and the results we achieved when we applied the modified system to two newsbook texts, and an additional text from the Lampeter Corpus (i.e. a text that was not part of the original training corpus). To conclude, we propose a design for a modified semantic tagger for EmodE texts, that contains an 'intelligent' spelling regulariser, that is, a system that has been designed so as to regularise spellings in their 'correct' context. selection of texts from the Lampeter corpus, before undertaking experiments using the semantic categories, using the newsbook test corpus to validate our findings).

Löfberg, L., Archer, D., Piao, S., Rayson, P., McEnery, T., Varantola, K. & Juntunen, J.P. (2003). Porting an English semantic tagger to the Finnish language

Proceedings of Corpus Linguistics 2003, 2003

Semantic annotation is an important and challenging issue in corpus linguistics and language engi... more Semantic annotation is an important and challenging issue in corpus linguistics and language engineering. While such a tool is available for English in Lancaster (Wilson and Rayson 1993), few such tools have been reported for other languages. In a joint Benedict project funded by the European Community under the 'Information Society Technologies Programme', we have been working towards developing a Finnish semantic tagger that will parallel the existing English semantic tagger. The intention is to avoid building a completely new system but to bootstrap using the existing software and the largely hand-constructed English lexical resources. In this paper, we report on our work to date, which includes (i) a comparative study of the grammar of English and Finnish, (ii) the tagging of an English-Finnish parallel corpus, and (iii) the building of a Finnish lexicon using existing lexicons and software such as a Finnish-English-Finnish machine-translation system, Finnish dependency parser and morphological analyser, etc. This paper also discusses some challenging issues that have arisen during the construction of the parallel semantic tagging system between English and Finnish, namely, the complications caused by the widely different grammatical systems of the two languages. We believe that our work will provide a valuable experience for the community working on cross-language annotation schemes.

Piao, S., Rayson, P., Archer, D., Wilson, A. & McEnery, T. (2003). Extracting multiword expressions with a semantic tagger

by Tony McEnery, Dawn Archer, Scott Piao, and Paul Rayson

Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, , 2003

Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP commun... more Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowledge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested approaching the MWE issue using a semantic field annotator. We use an English semantic tagger (USAS) developed at Lancaster University to identify multiword units which depict single semantic concepts. The Meter Corpus built in Sheffield was used to evaluate our approach. In our evaluation, this approach extracted a total of 4,195 MWE candidates, of which, after manual checking, 3,792 were accepted as valid MWEs, producing a precision of 90.39% and an estimated recall of 39.38%. Of the accepted MWEs, 68.22% or 2,587 are low frequency terms, occurring only once or twice in the corpus. These results show that our approach provides a practical solution to MWE extraction.

Baker, P., Hardie, A., McEnery, T., & Jayaram, B.D. (2003). Constructing Corpora of South Asian Languages

Proceedings of Corpus Linguistics 2003

The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 mil... more The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.

Piao, S. & McEnery, T. (2003). A Tool for Text Comparison

by Tony McEnery and Scott Piao

Proceedings of Corpus Linguistics 2003

Text reuse is commonplace in academia and the media. An efficient algorithm for automatically det... more

Baker, P., Hardie, A., McEnery, T., Cunningham, H. & Gaizauskas, R. (2002). EMILLE, A 67Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation

LREC 2002 Proceedings, 2002

The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engine... more The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.

Tablan, V., Ursu, C., Bontcheva, K., Cunningham, H., Maynard, D., Hamza, O., McEnery, T., Baker, P. & Leisher, M. (2002). A Unicode-based Environment for Creation and Use of Language Resources

LREC 2002 Proceedings, 2002

GATE is a Unicode-aware architecture, development environment and framework for building systems ... more GATE is a Unicode-aware architecture, development environment and framework for building systems that process human language. It is often thought that the character sets problem has been solved by the arrival of the Unicode standard. This standard is an important advance, but in practice the ability to process text in a large number of the World's languages is still limited. This paper describes work done in the context of the GATE project that makes use of Unicode and plugs some of the gaps for language processing R&D. First we look at storing and decoding of Unicode compliant linguistic resources. The new capabilities for processing textual data and taking advantage of the Unicode standard are detailed next. Finally, the solutions used to add Unicode displaying and editing capabilities for the graphical interface are described.

McEnery, T., Baker, P. & Burnard, L. (2000). Corpus Resources and Minority Language Engineering

M. Gavrilidou, G. Carayannis, S. Markantontou, S. Piperidis and G. Stainhauoer (eds). Proceedings of the Second International Conference on Language Resources and Evaluation, 2000

Low density languages are typically viewed as those for which few language resources are availabl... more

Baker, P., McEnery, T. & Gabrielatos, C. (2007). Using Collocation Analysis to Reveal the Construction of Minority Groups: The Case of Refugees, Asylum Seekers and Immigrants in the UK Press

Teaching and Language Corpora(TALC)

ReCALL, May 1, 1997

Collocation, Semantic Prosody, and Near Synonymy: A Cross-Linguistic Perspective

Applied Linguistics, Mar 1, 2006