Tony McEnery is Distinguished Professor of English Language and Linguistics, Lancaster University and Director of the ESRC Centre for Corpus Approaches to Social Science, Lancaster University.
Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of ... more Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of very extensive collections of transcribed utterances or written texts. This textbook outlines the basic methods of corpus linguistics, explains how the discipline of corpus ...
TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Ait... more TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Aitchison A Glossary of Language and Mind 0 7486 1824 4 Laurie Bauer A Glossary of Morphology 0 7486 1853 8 Alan Davies A Glossary of Applied Linguistics 0 7486 1854 ...
The corpus-driven revolution in applied linguistics continues apace, and along with it the parado... more The corpus-driven revolution in applied linguistics continues apace, and along with it the paradox that while corpora are changing the face of applied linguistics (most dictionaries, grammars, and course books now claim to be corpus based), this is occurring largely without the ...
The title of this work in combination with its inclusion in a corpus linguistics series seems to ... more The title of this work in combination with its inclusion in a corpus linguistics series seems to promise not only the chance to see actual swear words in print, but a whole lot of them as well. Reader beware. While Tony McEnery's Swearing in English ultimately does deliver to those ...
The appearance of not one but two introductions to corpus linguistics within the same series show... more The appearance of not one but two introductions to corpus linguistics within the same series shows the maturation and diversification of this fledgling subdiscipline within linguistics. McEnery and Wilson offer an overview or annotated report on work done within the computer-corpus research paradigm, including computational linguistics, whereas Barnbrook offers a guide or manual on the procedures and methodology of corpus linguistics, particularly with regard to machine-readable texts in English and to the type of results thereby generated.
Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of ... more Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of very extensive collections of transcribed utterances or written texts. This textbook outlines the basic methods of corpus linguistics, explains how the discipline of corpus ...
TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Ait... more TITLES IN THE SERIES INCLUDE Peter Trudgill A Glossary of Sociolinguistics 0 7486 1623 3 Jean Aitchison A Glossary of Language and Mind 0 7486 1824 4 Laurie Bauer A Glossary of Morphology 0 7486 1853 8 Alan Davies A Glossary of Applied Linguistics 0 7486 1854 ...
The corpus-driven revolution in applied linguistics continues apace, and along with it the parado... more The corpus-driven revolution in applied linguistics continues apace, and along with it the paradox that while corpora are changing the face of applied linguistics (most dictionaries, grammars, and course books now claim to be corpus based), this is occurring largely without the ...
The title of this work in combination with its inclusion in a corpus linguistics series seems to ... more The title of this work in combination with its inclusion in a corpus linguistics series seems to promise not only the chance to see actual swear words in print, but a whole lot of them as well. Reader beware. While Tony McEnery's Swearing in English ultimately does deliver to those ...
The appearance of not one but two introductions to corpus linguistics within the same series show... more The appearance of not one but two introductions to corpus linguistics within the same series shows the maturation and diversification of this fledgling subdiscipline within linguistics. McEnery and Wilson offer an overview or annotated report on work done within the computer-corpus research paradigm, including computational linguistics, whereas Barnbrook offers a guide or manual on the procedures and methodology of corpus linguistics, particularly with regard to machine-readable texts in English and to the type of results thereby generated.
The article discusses epistemic stance in spoken L2 production. Using a subset of the Trinity Lan... more The article discusses epistemic stance in spoken L2 production. Using a subset of the Trinity Lancaster Corpus of spoken L2 production, we analysed the speech of 132 advanced L2 speakers from different L1 and cultural backgrounds taking part in four speaking tasks: one largely monologic presentation task and three interactive tasks. The study focused on three types of epistemic forms: adverbial, adjectival, and verbal expressions. The results showed a systematic variation in L2 speakers' stance-taking choices across the four tasks. The largest difference was found between the monologic and the dialogic tasks, but differences were also found in the distribution of epistemic markers in the three interactive tasks. The variation was explained in terms of the interactional demands of individual tasks. The study also found evidence of considerable inter-speaker variation, indicating the existence of individual speaker style in the use of epistemic markers. By focusing on social use of language, this article seeks to contribute to our understanding of communicative competence of advanced L2 speakers. This research is of relevance to teachers, material developers , as well as language testers interested in second language pragmatic ability.
International Journal of Corpus Linguistics, 20:2, 2015
The idea that text in a particular field of discourse is organized into lexical patterns, which c... more The idea that text in a particular field of discourse is organized into lexical patterns, which can be visualized as networks of words that collocate with each other, was originally proposed by . This idea has important theoretical implications for our understanding of the relationship between the lexis and the text and (ultimately) between the text and the discourse community/ the mind of the speaker. Although the approaches to date have offered different possibilities for constructing collocation networks, we argue that they have not yet successfully operationalized some of the desired features of such networks. In this study, we revisit the concept of collocation networks and introduce GraphColl, a new tool developed by the authors that builds collocation networks from user-defined corpora. In a case study using data from study of the Society for the Reformation of Manners Corpus (SRMC), we demonstrate that collocation networks provide important insights into meaning relationships in language.
International Journal of Corpus Linguistics, 17:2, 2012
This paper focuses upon two issues. Firstly, the question of identifying diachronic trends, and m... more This paper focuses upon two issues. Firstly, the question of identifying diachronic trends, and more importantly significant outliers, in corpora which permit an investigation of a feature at many sampling points over time. Secondly, we consider how best to combine more qualitatively oriented approaches to corpus data with the type of trends that can be observed in a corpus using quantitative techniques. The work uses a recently completed ESRC-funded project as a case study, the representation of Islam in the UK press, in order to demonstrate the potential of the approach taken to establishing significant peaks in diachronic frequency development, and the fruitful interface that may be created between qualitative and quantitative techniques.
This article uses methods from corpus linguistics and critical discourse analysis to examine patt... more This article uses methods from corpus linguistics and critical discourse analysis to examine patterns of representation around the word Muslim in a 143 million word corpus of British newspaper articles published between 1998 and 2009. Using the analysis tool Sketch Engine, an analysis of noun collocates of Muslim found that the following categories (in order of frequency) were referenced: ethnic/national identity, characterizing/differentiating attributes, conflict, culture, religion, and group/organizations. The ‘conflict’ category was found to be particularly lexically rich, containing many word types. It was also implicitly indexed in the other categories. Following this, an analysis of the two most frequent collocate pairs: Muslim world and Muslim community showed that they were used to collectivize Muslims, both emphasizing their sameness to each other and their difference to ‘The West’. Muslims were also represented as easily offended, alienated, and in conflict with non-Muslims. The analysis additionally considered legitimation strategies that enabled editors to print more controversial representations, and concludes with a discussion of researcher bias and an extended notion of audience through online social networks.
This article explores negation in Chinese on the basis of spoken and written corpora of Mandarin ... more This article explores negation in Chinese on the basis of spoken and written corpora of Mandarin Chinese. The use of corpus data not only reveals central tendencies in language based on quantitative data, it also provides typical examples attested in authentic contexts. In this study we will first introduce the two major negators bu and mei (meiyou) and discuss their semantic and genre distinctions. Following this is an exploration of the interaction between negation and aspect marking. We will then move on to discuss the scope and focus of negation, transferred negation, and finally double negation and redundant negation.
In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). ... more In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English-Nepali and Nepali-English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.
This paper explores the collocational behaviour and semantic prosody of near synonyms from a cros... more This paper explores the collocational behaviour and semantic prosody of near synonyms from a cross-linguistic perspective. The importance of these concepts to language learning is well recognized. Yet while collocation and semantic prosody have recently attracted much interest from researchers studying the English language, there has been little work done on collocation and semantic prosody on languages other than English. Still less work has been undertaken contrasting the collocational behaviour and semantic prosody of near synonyms in different languages. In this paper, we undertake a cross-linguistic analysis of collocation, semantic prosody and near synonymy, drawing upon data from English and Chinese (pu3tong1hua4). The implications of the findings for language learning are also discussed.
Telicity is an important concept in the study of aspect. While the compat-ibility tests with comp... more Telicity is an important concept in the study of aspect. While the compat-ibility tests with completive and durative adverbials have long been in opera-tion as a diagnostic for telicity, their validity and reliability have rarely been questioned. This article critically explores the validity and ...
HELP is a frequent verb of English, with a distinctive syntax, that has generated ongoing debate ... more HELP is a frequent verb of English, with a distinctive syntax, that has generated ongoing debate amongst language researchers. As such, it is a verb that is often given some prominence in textbooks and grammars, 2 though the treatment of the verb can be poor. 3 For example, all of the authors who provide a poor account of HELP maintain that the choice of a full or bare infinitive after HELP is determined by a semantic distinction between the two-this is not the case (cf. the section ''Semantic Distinction''). In this paper, we will take a corpus-based approach to improve the description of the verb and to test claims made about the verb in the literature. We will also explore variation in that description between two major varieties of English, British English (BrE) and American English (AmE). In addition, we will investigate how HELP has varied diachronically and by register in these varieties. First, however, the claim that HELP is a frequent verb of English with distinctive syntactic properties must be justified.
This article compares two approaches to genre analysis: Biber’s multidimensional analysis (MDA) a... more This article compares two approaches to genre analysis: Biber’s multidimensional analysis (MDA) and Tribble’s use of the keyword function of WordSmith. The comparison is undertaken via a case study of conversation, speech, and academic prose in modern American English. The terms conversation and speech as used in this article correspond to the demographically sampled and context-governed spoken data in the British National Corpus. Conversation represents the type of communication we experience every day whereas speech is produced in situations in which there are few producers and many receivers (e.g., classroom lectures, sermons, and political speeches). Academic prose is a typical formal-written genre that differs markedly from the two spoken genres. The results of the MDA and keyword approaches both on similar genres (conversation vs. speech) and different genres (the two spoken genres vs. academic prose) show that a keyword analysis can capture important genre features revealed by MDA.
Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP commu... more Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7-12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this 0885-2308/$ -see front matter Ó COMPUTER SPEECH AND LANGUAGE case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.
A corpus-based analysis of discourses of refugees and asylum seekers was carried out on data take... more A corpus-based analysis of discourses of refugees and asylum seekers was carried out on data taken from a range of British newspapers and texts from the Office of the United Nations High Commissioner for Refugees website, both published in 2003. Concordances of the terms ...
In this paper we describe the Lancaster Speech, Thought and Writing Presentation (ST&WP) Spoken C... more In this paper we describe the Lancaster Speech, Thought and Writing Presentation (ST&WP) Spoken Corpus. We have constructed this corpus to investigate the ways in which speakers present speech, thought and writing in contemporary spoken British English, with the associated aim of comparing our findings with the patterns revealed by the previous Lancaster corpus-based investigation of ST&WP in written texts. We describe the structure of the corpus, the archives from which its composite texts are taken, the decisions that we made concerning the selection of suitable extracts from the archives, and the problems associated with the original archived transcripts. We then move on to consider issues surrounding the mark-up of our data with TEI-conformant SGML, and explain the tagging format we adopted in annotating our data for ST&WP.
In this paper we will extend Smith's (I997) two-component aspect theory to develop a two-level mo... more In this paper we will extend Smith's (I997) two-component aspect theory to develop a two-level model of situation aspect in which situation aspect is modelled as verb classes at the lexical level and as situation types at the sentential level. Situation types are the composite result of the rule-based interaction between verb classes and complements , arguments, peripheral adjuncts and viewpoint aspect at the nucleus, core and clause levels. With a framework consisting of a lexicon, a layered clause structure and a set of rules mapping verb classes onto situation types, the model is developed and tested using an English corpus and a Chinese corpus.
Fuertes Olivera, P.A. (ed). Lengua y sociedad: Aportaciones recientes en lingüística cognitiva, lenguas en contacto, lenguajes de especialidad y lingüística del corpus, 2005
Proceedings of the corpus linguistics 2005 conference
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the pa... more Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of meaning, the Lancaster semantic lexicon employs a semantic field taxonomy and maps words and multiword expression (MWE) templates to their potential semantic categories, which are disambiguated according to their context in use by a semantic tagger called USAS (UCREL semantic analysis system). The lexicon is classified with a set of broadly defined semantic field categories, which are organised in a thesaurus-like structure. The Lancaster semantic taxonomy provides a conception of the world that is as general as possible as opposed to a semantic network for some specific domains. This paper describes the Lancaster semantic lexicon both in terms of its semantic field taxonomy, lexical distribution across the semantic categories and lexeme/tag type ratio. As will be shown, the Lancaster semantic lexicon is a unique and valuable lexical resource that offers a large-scale generalpurpose semantically structured lexicon resource, which can have various applications in corpus linguistics and NLP. 1 The semantic lexicon and the USAS tagger are accessible for academic research as part of the Wmatrix tool, for more details see
Semantic lexical resources play an important part in both linguistic study and natural language e... more Semantic lexical resources play an important part in both linguistic study and natural language engineering. In Lancaster, a large semantic lexical resource has been built over the past 14 years, which provides a knowledge base for the USAS semantic tagger. Capturing semantic lexicological theory and empirical lexical usage information extracted from corpora, the Lancaster semantic lexicon provides a valuable resource for the corpus research and NLP community. In this paper, we evaluate the lexical coverage of the semantic lexicon both in terms of genres and time periods. We conducted the evaluation on test corpora including the BNC sampler, the METER Corpus of law/court journalism reports and some corpora of Newsbooks, prose and fictional works published between 17 th and 19 th centuries. In the evaluation, the semantic lexicon achieved a lexical coverage of 98.49% on the BNC sampler, 95.38% on the METER Corpus and 92.76% --97.29% on the historical data. Our evaluation reveals that the Lancaster semantic lexicon has a remarkably high lexical coverage on modern English lexicon, but needs expansion with domain-specific terms and historical words. Our evaluation also shows that, in order to make claims about the lexical coverage of annotation systems as well as to render them 'future proof', we need to evaluate their potential both synchronically and diachronically across genres.
Annotation schemes for semantic field analysis use abstract concepts to classify words and phrase... more Annotation schemes for semantic field analysis use abstract concepts to classify words and phrases in a given text. The use of such schemes within lexicography is increasing, mdeed, our own UCREL semantic annotation system (USAS) is to form part ofa web-based 'intelligent' dictionary (Herpio 2002). As USAS was originally designed to enable automatic content analysis (WUson and Rayson 1993), we have been assessing its usefulness in a lexicographical setting, and also comparing its taxonomy with schemes developed by lexicographers. This paper initially reports the comparisons we have undertaken with two dictionary taxonomies: the first was designed by Tom McArthur for use in the Longman Lexicon of Contemporary English, and the second by Collins Dictionaries for use in their Collins English Dictionary. We then assess thefeasibility of mapping USAS to the CED tagset, before reporting our intentions to also map to WordNet (a reasonably comprehensive machine-useable database ofthe meanings ofEnglish words) via WordNet Domains (which augments WordNet 1.6 with 200+ domains). We argue that this type ofresearch can provide a practical guide for tagset mapping and, by so doing, bring lexicographers one-step closer to using the semantic field as the organising principle for their general-purpose dictionaries.
Proceedings of the Beyond Named Entity Recognition Semantic Labeling for NLP Tasks Workshop, 2004
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semant... more The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Proceedings of the 4th Workshop on Asian Language Resources, 2004
This paper first discusses standards for developing Asian language corpora so as to facilitate in... more This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University -the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and other corpus tools.
As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis... more As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth EmodE) period. We begin by describing how effectively the existing system tagged a training corpus prior to any modifications. The training corpus consists of newsbooks dating from December 1653 -May 1654, and totals approximately 613,000.words. We then document the various adaptations that we made to the system in an attempt to improve its efficiency, and the results we achieved when we applied the modified system to two newsbook texts, and an additional text from the Lampeter Corpus (i.e. a text that was not part of the original training corpus). To conclude, we propose a design for a modified semantic tagger for EmodE texts, that contains an 'intelligent' spelling regulariser, that is, a system that has been designed so as to regularise spellings in their 'correct' context. selection of texts from the Lampeter corpus, before undertaking experiments using the semantic categories, using the newsbook test corpus to validate our findings).
Semantic annotation is an important and challenging issue in corpus linguistics and language engi... more Semantic annotation is an important and challenging issue in corpus linguistics and language engineering. While such a tool is available for English in Lancaster (Wilson and Rayson 1993), few such tools have been reported for other languages. In a joint Benedict project funded by the European Community under the 'Information Society Technologies Programme', we have been working towards developing a Finnish semantic tagger that will parallel the existing English semantic tagger. The intention is to avoid building a completely new system but to bootstrap using the existing software and the largely hand-constructed English lexical resources. In this paper, we report on our work to date, which includes (i) a comparative study of the grammar of English and Finnish, (ii) the tagging of an English-Finnish parallel corpus, and (iii) the building of a Finnish lexicon using existing lexicons and software such as a Finnish-English-Finnish machine-translation system, Finnish dependency parser and morphological analyser, etc. This paper also discusses some challenging issues that have arisen during the construction of the parallel semantic tagging system between English and Finnish, namely, the complications caused by the widely different grammatical systems of the two languages. We believe that our work will provide a valuable experience for the community working on cross-language annotation schemes.
Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, , 2003
Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP commun... more Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowledge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested approaching the MWE issue using a semantic field annotator. We use an English semantic tagger (USAS) developed at Lancaster University to identify multiword units which depict single semantic concepts. The Meter Corpus built in Sheffield was used to evaluate our approach. In our evaluation, this approach extracted a total of 4,195 MWE candidates, of which, after manual checking, 3,792 were accepted as valid MWEs, producing a precision of 90.39% and an estimated recall of 39.38%. Of the accepted MWEs, 68.22% or 2,587 are low frequency terms, occurring only once or twice in the corpus. These results show that our approach provides a practical solution to MWE extraction.
The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 mil... more The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.
Text reuse is commonplace in academia and the media. An efficient algorithm for automatically det... more Text reuse is commonplace in academia and the media. An efficient algorithm for automatically detecting and measuring similar/related texts would have applications in corpus linguistics, historical studies and natural language engineering.
The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engine... more The paper describes developments to date on the EMILLE Project (Enabling Minority Language Engineering) being carried out at the Universities of Lancaster and Sheffield. EMILLE was established to construct a 67 million word corpus of South Asian languages. In addition to undertaking this corpus construction, the project has had to address a number of related issues in the context of establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools on EMILLE has contributed to the on-going development of the LE architecture GATE.
GATE is a Unicode-aware architecture, development environment and framework for building systems ... more GATE is a Unicode-aware architecture, development environment and framework for building systems that process human language. It is often thought that the character sets problem has been solved by the arrival of the Unicode standard. This standard is an important advance, but in practice the ability to process text in a large number of the World's languages is still limited. This paper describes work done in the context of the GATE project that makes use of Unicode and plugs some of the gaps for language processing R&D. First we look at storing and decoding of Unicode compliant linguistic resources. The new capabilities for processing textual data and taking advantage of the Unicode standard are detailed next. Finally, the solutions used to add Unicode displaying and editing capabilities for the graphical interface are described.
M. Gavrilidou, G. Carayannis, S. Markantontou, S. Piperidis and G. Stainhauoer (eds). Proceedings of the Second International Conference on Language Resources and Evaluation, 2000
Low density languages are typically viewed as those for which few language resources are availabl... more Low density languages are typically viewed as those for which few language resources are available. Work relating to low density languages is becoming a focus of increasing attention within language engineering (e.g.
Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of ... more Corpus linguistics is the study of language data on a large scale-the computer-aided analysis of very extensive collections of transcribed utterances or written texts. This textbook outlines the basic methods of corpus linguistics, explains how the discipline of corpus ...
This chapter reports on research resulting from academics from linguistics, history and geography... more This chapter reports on research resulting from academics from linguistics, history and geography working together in order to cast light upon the geography of prostitution in seventeenth-century Britain. We will demonstrate the usefulness and untapped potential of combining corpus linguistics and Geographical Information Systems (GIS) as an approach to researching historical texts. Corpus linguists are beginning to pursue new methodological advances which encourage them to “think geographically” and provide opportunities to enrich their understanding of a body of texts by uncovering spatial patterns in types of discourse (Gregory & Hardie 2011: 298-299, 309). The ability to move from corpus text to a visual mapping of geographical data and then back into the corpus text provides rich opportunities for humanities scholars in general, and corpus linguists in particular.
Introduction The role of corpus data in linguistics has waxed and waned over time. Prior to the m... more Introduction The role of corpus data in linguistics has waxed and waned over time. Prior to the mid-twentieth century, data in linguistics was a mix of observed data and invented examples. There are some examples of linguists relying almost exclusively on observed language data in this period. Studies in field linguistics in the North American tradition (e.g. Boas ) often proceeded on the basis of analysing bodies of observed and duly recorded language data. Similarly, studies of child language acquisition often proceeded on the basis of the detailed observation and analysis of the utterances of individual children (e.g. Stern and Stern ) or else were based on large-scale studies of the observed utterances of many children (Templin ). From the mid-twentieth century, the impact of Chomsky's views on data in linguistics promoted introspection as the main source of data in linguistics at the expense of observed data. Chomsky (interviewed by Andor : 97) clearly disfavours the type of observed evidence that corpora consist of: Corpus linguistics doesn't mean anything. It's like saying suppose a physicist decides, suppose physics and chemistry decide that instead of relying on experiments, what they're going to do is take videotapes of things happening in the world and they'll collect huge videotapes of everything that's happening and from that maybe they'll come up with some generalizations or insights. Well, you know, sciences don't do this. But maybe they're wrong. Maybe the sciences should just collect lots and lots of data and try to develop the results from them. Well if someone wants to try that, fine. They're not going to get much support in the chemistry or physics or biology department. But if they feel like trying it, well, it's a free country, try that. We'll judge it by the results that come out. The impact of Chomsky's ideas was a matter of degree rather than absolute. Linguists did not abandon observed data entirely – indeed, even linguists working broadly in a Chomskyan tradition would at times use what might reasonably be described as small corpora to support their claims. For example, in the period from 1980 to 1999, most of the major linguistics journals carried articles which were to all intents and purposes corpus-based, though often not self-consciously so. Language carried nineteen such articles, The Journal of Linguistics seven, and Linguistic Inquiry four. But even so there is little doubt that introspection became the dominant, indeed for some the only permissible, source of data in linguistics in the latter half of the twentieth century. However, after 1980, the use of corpus data in linguistics was substantially rehabilitated, to the degree that in the twenty-first century, using corpus data is no longer viewed as unorthodox and inadmissible. For an increasing number of linguists, corpus data plays a central role in their research. This is precisely because they have done what Chomsky suggested – they have not judged corpus linguistics on the basis of an abstract philosophical argument but rather have relied on the results the corpus has produced. Corpora have been shown to be highly useful in a range of areas of linguistics, providing insights in areas as diverse as contrastive linguistics (Johansson ), discourse analysis (Aijmer and Stenstrom ; Baker ), language learning (Chuang and Nesi ; Aijmer ), semantics (Ensslin and Johnson ), sociolinguistics (Gabrielatos et al . ) and theoretical linguistics (Wong ; Xiao and McEnery ). As a source of data for language description, they have been of significant help to lexicographers (Hanks ) and grammarians (see sections 4.2, 4.3, 4.6, 4.7). This list is, of course, illustrative – it is now, in fact, difficult to find an area of linguistics where a corpus approach has not been taken fruitfully.
Uploads
Authored works by Tony McEnery
Edited volumes by Tony McEnery