Chapter 1 Hoffmann, Evert, Smith, Lee, Berglund-Prytz (2008) Corpus Linguistics With BNCweb
Chapter 1 Hoffmann, Evert, Smith, Lee, Berglund-Prytz (2008) Corpus Linguistics With BNCweb
Chapter 1 Hoffmann, Evert, Smith, Lee, Berglund-Prytz (2008) Corpus Linguistics With BNCweb
Foreword Acknowledgments 1 Looking at language in usesome preliminaries 1.1 Introduction 1.1.1 Thinking about goalless, shall and cars 1.1.2 Clues from a corpusthe BNC 1.2 Why read this book? 1.3 Organization of the book 1.4 How to use this book Corpus linguistics: some basic principles 2.1 Outline 2.2 Introduction 2.3 Representativeness in corpora 2.4 What is corpus linguistics? Why use a corpus? 2.5 A briefand more advancedexcursion: description vs. theory 2.6 Types of corpora 2.7 Further reading Introducing the British National Corpus 3.1 3.2 3.3 3.4 3.5 Outline Introduction Written material Spoken material More than text 3.5.1 Part-of-speech tags 3.5.2 Headwords and lemmas 3.5.3 Words & sentences versus w-units & s-units 3.6 Format 3.7 Errors 3.8 More information 3.9 Is it Present-day English? 3.10 Exercise
ix xiii 1 1 1 4 10 11 12 13 13 13 15 18 20 24 26 27 27 27 28 32 38 38 40 41 42 43 45 45 46
iv
First queries with BNCweb 4.1 Outline 4.2 Introduction 4.3 Getting started: your first query 4.3.1 Planning your query 4.3.2 Running the query 4.3.3 Getting basic frequency information 4.4 Exploring the concordance 4.4.1 Navigating through a query result 4.4.2 KWIC view and Sentence view 4.4.3 Random order and corpus order 4.4.4 Viewing the larger context of an example 4.4.5 Obtaining more information about the source of an example 4.5 Running a query for a word sequence 4.6 Restricting your query to selected portions of the BNC 4.7 Accessing previous queries 4.7.1 Query history 4.7.2 Save current set of hits 4.8 Browse a text 4.9 Exercises Some further aspects of corpus-linguistic methodology 5.1 Outline 5.2 Introduction 5.3 Comparing results: normalized frequencies 5.4 Normalized frequenciessome further issues 5.5 Precision and recall 5.6 Statistical significance 5.6.1 Confidence intervals 5.6.2 Hypothesis tests for frequency comparison 5.6.3 Using statistical software 5.7 Further reading 5.8 Exercises The Simple Query Syntax 6.1 Outline 6.2 Introduction 6.3 Basic queries: searching words and phrases
47 47 47 48 48 49 50 50 51 52 53 54 57 58 59 64 64 65 66 68 69 69 69 69 73 77 79 80 83 86 90 90 93 93 93 94
Using wildcards A short tour of the Simple Query Syntax Advanced wildcard queries Queries based on part-of-speech and headword/lemma Matching lexico-grammatical patterns Proximity queries Matching special characters Exercises
97 99 103 106 109 114 116 117 119 119 119 119 124 128 131 131 133 136 137 139 139 139 140 142 149 158 159 159 161 161 161 162 163
Automated analyses of concordance linesPart I: Distribution and Sorting 7.1 Outline 7.2 Distribution 7.2.1 A lovely example: distributional facts about the users of lovely 7.2.2 Frequency distribution by genre 7.2.3 Dispersion & File-frequency extremes: checking the influence of idiosyncratic texts on frequencies 7.3 Sort 7.3.1 Sorting a query result on preceding or following context 7.3.2 The Frequency breakdown function 7.3.3 Sorting on the query hit 7.4 Exercises Automated analyses of concordance linesPart II: Collocations 8.1 Outline 8.2 Introduction 8.3 Understanding the concept of collocational strength 8.4 Steps in collocation analysis 8.5 Which association measure should I use? 8.6 Calculating collocations in sub-sections of the BNC 8.7 Further reading 8.8 Exercises "Adding value" to a concordance using customized annotations 9.1 9.2 9.3 Outline Introduction: why annotate your concordance data? Annotation within BNCweb: using the "Categorize hits" function 9.3.1 Setting up a category for analysis
vi
Categorizing concordance hits Analyzing data categorized in BNCweb Re-editing your annotations Advantages and disadvantages of categorizing queries within BNCweb 9.4 Summarizing and presenting results of customized annotations 9.5 Exporting a BNCweb query result to an external database 9.5.1 Downloading from BNCweb 9.5.2 Importing into database software 9.5.3 Annotating the database 9.5.4 Analyzing the database 9.5.5 Advantages and disadvantages of the database approach 9.6 Reimporting an analyzed database into BNCweb 9.7 Further reading 9.8 Exercises 10 Creating and using subcorpora 10.1 Outline 10.2 Introduction: why create subcorpora? 10.3 Basic steps for creating and using a subcorpus 10.3.1 Defining a new subcorpus via Written metatextual categories 10.3.2 Running a query on your subcorpus 10.4 More on methods for creating subcorpora 10.4.1 Selecting a narrower range of texts for a subcorpus 10.4.2 Defining a new subcorpus via Spoken metatextual categories 10.4.3 Defining a new subcorpus via Genre labels 10.4.4 Defining a new subcorpus via Keyword/title scan 10.4.5 Defining a new subcorpus via manual entry of text IDs or speaker IDs 10.4.6 Modifying your subcorpora 10.5 Saving time by using subcorpora 10.6 Exercises 11 Keywords and frequency lists 11.1 Outline 11.2 Introduction
165 167 169 169 170 174 175 177 179 180 181 181 183 183 185 185 185 186 186 188 190 190 191 193 195 198 200 201 202 205 205 205
vii
11.3 The Keywords function 11.3.1 About keywords 11.3.2 Producing keyword lists 11.3.3 Interpreting and adjusting keyword list settings 11.3.4 Finding items contained in only one frequency list 11.4 The Frequency lists function 11.5 Exercises 12 Advanced searches with the CQP Query Syntax 12.1 Outline 12.2 Introduction 12.3 From Simple queries to CQP syntaxa primer 12.4 Regular expressions 12.5 Part-of-speech and headword/lemma queries 12.6 Lexico-grammatical patterns and text structure 12.7 Advanced features of CQP queries 12.8 Exercises 13 Understanding the internals of BNCweb: user types, the cache system and some notes about installation 13.1 Outline 13.2 BNCweb users: standard users and administrators 13.3 Additional information available to administrator users 13.3.1 Overview 13.3.2 Administrator access to the Query history feature 13.3.3 Administrator access to user-specific data stored by other features 13.4 Customizable settings in BNCweb 13.4.1 Configuration settings available to standard users 13.4.2 Configuration settings available to administrator users 13.5 The cache system 13.5.1 General description 13.5.2 Maintenance of the cache system 13.6 Installation of BNCweb 13.6.1 Prerequisites 13.6.2 Time and disk-space required 13.6.3 Configuration of the Perl library bncConfigXML.pm
205 205 206 210 211 212 216 217 217 217 218 224 228 232 238 243 247 247 247 249 249 249 251 251 251 253 254 254 255 257 257 258 258
viii
References Glossary Appendix 1: Genre classification scheme Appendix 2: Part-of-speech tags Appendix 3: Quick reference to the Simple Query Syntax Appendix 4: HTML-entities for less common characters Index
Foreword
This book is about two electronic objects: the BNC and BNCweb. The BNC (in full, the British National Corpus) is a vast collection of over 4,000 English texts, providing a unique record of contemporary spoken and written English. BNCweb is a piece of softwarea search and retrieval systemto enable the student or researcher to extract or derive information from the BNC. Together the BNC (http://www.natcorp.ox.ac.uk/) and BNCweb (http://www.bncweb.info) form an arguably unparalleled combination of facilities for finding out about the English language of the present day (or, more exactly, of the very recent past). To put this claim in context, let me trace the history of these two remarkable objects. First, the BNC, consisting of nearly a hundred million words in all, was created in 1991-5 by a consortium consisting of three publishersOxford University Press, Longman Group Ltd, and Chamberstwo universitiesOxford University and Lancaster Universityand one librarythe British Library. It is by no means an accident that the three publishers involved were (and still are) leading dictionary makers in the UK: computerized lexicography was the foremost application that the commercial members of the consortium had in mind. On the other hand, the corpus is open to all uses and all users. Since its initial release many hundreds of copies have been licensed worldwide to universities, colleges, schools, and industrial organizations, making the corpus available to thousands of users. Why does the BNC contain only British English material? The simple answer is that it was financed 50 per cent by British government grants, and was intended to be an investment in British industry. In fact, industrial users have been a small minority, and the corpus has proved its value as a research resource mainly for increasing numbers of universities and other educational users. (The corpus is described in detail in Chapter 3.) The fact that the texts and transcriptions included in the corpus were collected in the early 1990s and are now at least 15 years old has scarcely affected its usefulness. In fact, since that time, the increasing availability of computer power and the increasing use of corpora in linguistics have if anything increased its popularity. Naturally an electronic corpus, however vast and varied, is virtually useless without the right software for searching it to investigate the language. One answer to this need is Xaira (in its former incarnation known as Sara), the retrieval tool issued as part of the official BNC distribution, but another is the BNCweb software, which is the topic of this book. In an introductory article on its most recent versionthe CQP-editionHoffmann & Evert (2006) refer to BNCweb as "a user-friendly and feature-rich corpus tool". The term tool here might strike one as a terminological understatement, given that a Swiss Army knife would
give a poor idea of the multi-functional nature of this interface. However, the terms user-friendly and feature-rich are more than justified. The important point is that this software, which was already a breakthrough in corpus retrieval technology when it was first launched (see Lehmann et al. 2000), has been developed and enhanced over a period of more than ten years during which it has been thoroughly trialed and explored by its authors drawing on the experience of users. The uninitiated corpus-user, with very little introduction and practice, will already find BNCweb a revealing window onto the possibilities of corpus linguistics. As experience with the tool deepens, more and more features or functions can be added to one's repertoire with little fuss. In fact, BNCweb provides an enlightening educational experience as one progresses, until a whole range of things one can do with a corpus are at one's finger-tips. Working through BNCweb is the best guide to corpus linguistics that I can think of, extending the range of questions to which one can seek answers far beyond what most users could imagine. The present guide to the BNC and BNCweb is structured to provide this kind of educational experience, and can be treated as a general textbook introducing corpus linguistics. From the introductory "taster" of Chapter 1, showing how the combination of BNCweb and the BNC can answer some simple Englishlanguage problems of interest, the book adds step-by-step to the breadth and depth of inquiries one can make of a richly varied corpus. Among the topics covered are: concordances of words and phrases the design and composition of the BNC its flexible division into subcorpora which can be studied as corpora in their own right, or used for both quantitative and qualitative comparisons quantitative aspects of corpus investigation: relative frequency, statistical significance, and precision and recall, together with the interpretation and evaluation of these features annotation: part-of-speech tags and lemmas, and ways of using them for more sophisticated searches the versatile search potential opened up by BNCweb's Simple Query Syntax and the even more advanced CQP Syntax the ability to compare subcorpora in terms of keywords the ability to explore collocational strength the ability to make one's own subcorpora from the corpus data the ability to add one's own linguistic categorizations to the corpus data.
xi
Although the BNC's built-in linguistic annotation (part-of-speech tagging and lemmatization) is essentially word-based, BNCweb's query syntax gives you the power to search on syntactic and lexico-syntactic patterns, while the customizing of your own annotations enables you to extend your analyses beyond the tagging information provided with the corpus. The explanation and illustration of this rich array of features provide a clear and engaging up-to-date introduction to the methodology of corpus linguistics, leading the user through manageable tasks or exercises at each step. BNCweb originated in 1996 at the University of Zurich, where Hans-Martin Lehmann collaborated with Sebastian Hoffmann and Peter Schneider. The latest CQP version of BNCweb has been authored by Sebastian Hoffmann and Stefan Evert. Of the other members of the present authorial team, all have engaged in close work on the BNC. Ylva Berglund Prytz began research on the BNC about ten years ago, and has been more directly involved with the corpus in recent years, owing to her position at the Oxford University Computer Services, where she performs, among other functions, that of BNC Communications Officer. Nick Smith was part of a team at Lancaster University involved in the grammatical tagging of the BNC during the 1990s, and has considerable experience of teaching and tutoring in corpus linguistics using BNCweb. David Lee has also worked on the BNC from a different angle: he undertook his PhD research at Lancaster using the BNC, and on the basis of this experience, later compiled a genre classification scheme for the corpus, which provides a more complete and detailed classification of the texts of the corpus than was available in the original BNC release. This scheme is now officially issued with the corpus, and its text classification has been incorporated into the search parameters of BNCweb. Writing this book has been a collaborative enterprise in which all authors are jointly involved. While Sebastian Hoffmann has been the lead author for the whole project, it may be of interest to know which authors bore particular responsibility for which parts of the book. The following list indicates who were the main authors of individual chapters (the authors are identified by initials): Ch. 1: SH Ch. 2: SH Ch. 3: YB Ch. 4: NS Ch. 5: SH & SE Ch. 6: SE Ch. 7: SH Ch. 8: SH & SE Ch. 9: NS Ch. 10: DL Ch. 11: DL Ch. 12: SE
Ch. 13: SH
In addition, David Lee and Nicholas Smith contributed extensively to the revision of the manuscript after the completion of the first draft. This book will be of benefit to a variety of users. Perhaps the group that springs to mind most immediately consists of researchers and students working on the English language, especially those focusing on lexical and grammatical usage. But people working within areas as diverse as discourse analysis, stylis-
xii
tics, psycholinguistics, semantics and pragmatics, for instance, are increasingly recognizing the value of the corpus as an empirical basis for answering questions within linguistics more generally. Another prominent group of users, who likewise may benefit from this book, are those concerned with teaching and learning English, especially English as a Foreign Language (EFL) and English as a Second Language (ESL). There have already been some pioneering studies comparing native-speaker usage, as reflected in the BNC, with non-native usage, as reflected in corpora produced by non-native speakers. Teachers, textbook writers and even learners of English (especially at the more advanced levels) will find out how to retrieve authentic examples illustrating points of usage, and invaluable information on the frequency of target structures in different varieties of spoken and written British English. In short, this book will no doubt provide an invaluable foundation for novices in the field of corpus exploration. But seasoned corpus linguists, too, are likely to benefit from its extensive discussions of methodological issues and the detailed description of the more advanced features of BNCweb. Whatever your interest, the practical orientation and thorough coverage of this book will equip you with the necessary skills for conducting well-informed corpus-based investigations into the workings of the English language. Geoffrey Leech Lancaster University June 2008
1.1 1.1.1
Let's start by having a look at the following three questions: a) What is the meaning of goalless? b) How is the word shall used in Present-day British English? Suggest one or two typical examples to illustrate your description. c) Who talks more about cars, British men or British women? Question a) concerns the meaning of a single wordthis type of question could, for example, be asked by a learner of English as a foreign language who has come across goalless without sufficient context to fully understand its meaning. In contrast, the second question goes beyond lexical meaning; shall is a modal verb (like will, must and can) and is therefore normally used together with other verb forms (like run, sing and be). In other words, rather than simply asking a question about the meaning of a certain word, question b) is about how this word can be combined with other elements of the English language to express a particular grammatical relationship or function. This question might for example be asked by an English teacher who is preparing a lesson on modal verbs. Question c), finally, broadly deals with the relationship between language and society. It is admittedly a bit of an odd questioncalling up common clichs and stereotypes about the difference between the two sexesand you are probably more likely to meet questions of this form during a dinner table conversation than as part of a linguistic enquiry. But there's a deeper reason for asking this question here, which will become apparent when we discuss possible answers, so let's just for the time being assume that this is a perfectly sensible thing to ask.
Task: Spend a few moments thinking about possible answers to the questions above. Then ask some fellow students or friends the same questions and compare their answers to yours. Do you all agree on what the correct answers are? If not, think about the reasons why these differences may have occurred.
Chapter 1
If you are a native speaker of English (or a highly proficient speaker of English as a second or foreign language), you may feel that your intuitions about the language will be fully sufficient to provide answers to all three of them. However, and this may have been confirmed if you did the above task as a group of people, even native speakers quite often disagree about certain aspects of language and its use, and these three questions may be no exception. For example, when answering question a), many people immediately think of goalless as meaning 'aimless, purposeless; having no destination'. Interestingly, typically only few people think of a second meaning of the word, namely that which is used in football to refer to 'a game in which no goals were scored on either side'. Moving on to question b), your intuition may have told you that shall is quite old-fashioned and slowly dying out, while speakers nowadays prefer will and other future time expressions such as going to or gonna. You may also have worked out that the modal auxiliary shall is followed by the infinitive without to, and perhaps even that shall is used most frequently when the subject is a first person pronoun (that is, I or we). As a result, the typical example you gave might have looked something like this: (1) I shall ring you up as soon as I arrive. Alternatively, you might also have thought of a use of shall in offers, suggestions, requests for instructions, and requests for advice. This use takes the form of a question, i.e. the subject (e.g. I) follows the modal shall. A typical sentence is shown in (2). (2) Shall I carry your bag? When asked about the level of formality of this second type of use, people are usually quite undecided. However, the majority have the impression that this is a particularly politeand therefore formalusage. Furthermore, when asked about which of the two structures is more frequent, people often don't feel confident in providing a clear answer. As for question c), most people would answer this by stating that men talk more about cars than women. This quick summary clearly shows that the intuition-based approach can result in a considerable range of possible answers, and it is not clear how close to the "truth"or perhaps better, how close to actual usagethey really are. In order to determine this, you may therefore want to find independent confirmation. Let us consider some ways in which this could be done. For example, dictionaries will easily help you with question a). Indeed, the Oxford English Dictionary (OED) lists both of the meanings of goalless that were mentioned above. Yet
you may also want to know which of the two senses is more common in Presentday English: unfortunately, the OED does not give you any help there.1 For the second question, grammar books are an obvious source of additional information. However, in this context it is important to ask what authority the author of a particular grammar book has for writing up his or her description. If its contents are heavily based on the author's intuitions about the English language, they may in fact also not fully reflect actual usage, even considering that an author of a grammar book is likely to be very knowledgeable about such matters.2 Another way of trying to find answers to at least the first two questions is by asking a wide range of informants who are native speakers of English. This is best done by giving them apparently unrelated questions whose context will trigger the use of the feature in question (e.g. shall vs. will). This method of "informant testing" is often more accurate than a direct appeal to native speaker intuitions, as the information provided is less likely to be influenced by factors such as self-censorship or accommodation. For example, when asked directly, an informant may opt to use I will or I'llinstead of I shallbecause he or she does not want to give the impression of being old-fashioned. However, the same informant may not have any problems with using I shall in situations where they are not aware of the fact that the questions or tasks are designed to extract information about their use of will vs. shall. Although this informant-based method is clearly more informative than relying purely on the intuitions of a single speaker, it is obviously also much more difficult and time-consuming to carry out. Finally, you could simply decide to observe what's happening around you and draw your conclusions on the basis of the data you collect. Every time someone talks about a car, you take note of the speaker's sex. Every time someone uses shall, you look at the type of construction in which it is used. And every time you read or hear goalless, you use the context to find out more about the meaning of this word. Once you have noted down a sufficient number of instances, you will have a reliable basis for a description of what is really going on with goalless, shall and talk about cars in today's English. However, there are two major problems with this method. First, with fairly infrequent words and expressions (e.g. goalless), you will have to wait a very long time before you have enough data to make any general claims. Secondly, and more importantly, your language experience may differ dramatically from that of other people who also use English. If, for example, you are a student at a British university, a large
1 2 However, some learner dictionaries (e.g. the Collins COBUILD Advanced Learner's English Dictionary 2006) do indicate whether certain senses are particularly common or rare. It has to be pointed out, however, that many modern descriptions of English are no longer purely intuition-based. Instead, grammar books nowadays are often based on exactly the kind of data and methodology that we will describe in this book.
Chapter 1
part of your language use will take place in interactions with other students and a considerable part of what you read will be academic texts (like the one you are reading right now). This is very different from the language experience of an average coal miner, lawyer, or jazz musician. And maybe the experience of these other types of language users will be particularly different from yours just in the context of the three questions you are trying to answer. This book is about a methodand a toolthat will allow you to eliminate these two major problems to a very large extent. Suppose you had access to a huge collection of texts and conversations produced by a cross-section of today's population in Britaini.e. by students, lawyers, jazz musicians, coal miners and a whole range of other types of language users. Further suppose that you would have access in such a way that it is possible to easily search the complete collection in a matter of seconds, and that you would also be able to get further information about the search results that are retrieved (e.g. about the type of speaker or writer, the kind of context in which it was produced, etc.). This is exactly what the British National Corpus (BNC) and BNCweb will give you.
1.1.2
The BNC is a 100 million word collection of samples of written and spoken language from a wide range of sources. It was put together to represent a wide cross-section of current British English, and contains a large number of language samples from different kinds of texts, produced by different kinds of language users and made available in different ways. A more detailed description of the corpusincluding an account of how it was compiled, what type of texts it contains and what additional information is available about these textswill be given in Chapter 3. BNCweb is a user-friendly web-based interface that was created to search (or as we say, to query) the data contained in the BNC. It gives you easy access to a wide range of functions that allow you to linguistically analyze the results of your queries. Originally developed at the University of Zurich by Hans Martin Lehmann, Sebastian Hoffmann and Peter Schneider (see Lehmann et al. 2000), BNCweb is nowadays maintained and further extended by Sebastian Hoffmann and Stefan Evert. The functionality of BNCweb is described in detail in the remaining chapters of the book. To whet your appetite, let us quickly return to our three questions and see what clues we can find with the help of the BNC and BNCweb. A quick search for goalless shows that there are only 86 instances in the whole corpus. So on average, the word occurs less than once in every million words. Figure 1.1 displays how BNCweb will present the results of the searchor queryto you. This kind of output is generally referred to as a concordance.
Figure 1.1:
The first 15 hits of a search for goalless in the BNC (cropped view)
Looking at this concordance, it is immediately obvious that football appears to be the predominant context in which British English speakers make use of the word goalless. In fact, if you were to look at all 86 instances in more detail, you would find that every single one is from the field of sports. Now, this does not of course mean that the other meaning of goallessi.e. 'aimless'does not exist at all in Present-Day English. After all, although the BNC contains nearly 100 million words, it is actually quite tiny in comparison with the totality of language use in Britain, and it is entirely possible that some very infrequent features are not represented at all in the corpus. However, you can now safely say that the 'aimless' meaning of goalless is very marginal indeed. The other obvious point to note from this list of results is that goalless often co-occurs with draw, referring to a game during which no goals are scored.3 Of the total of 86 instances, 51 (59 per cent) co-occur with draw. If you are a learner of English as a foreign language, this is useful information because it will not only allow you to understand the most common meaning of the word but it will also give you the opportunity to notice how it is used idiomatically by native speakers. What can the BNC tell us about the second question, i.e. how shall is used in Present-day English? A simple lexical search of shall gives you many more hits than you will want to look at: there are 19,505 instances of shall in the whole
3 At least this is the case in British English. Speakers of other varieties of English may prefer the expression goalless tie instead.
Chapter 1
BNC. However, we could restrict our investigation by looking at the spoken part of the corpus only. A good reason for doing this is that we suspect that shall is becoming less common nowadays: it is widely assumed in linguistics that when something changes in a language, that change generally starts in the spoken rather than the written variety. With BNCweb, it is easy to restrict searches to sub-parts of the corpus, e.g. spoken texts only. This part of the BNC contains about 10 million words, but shall still occurs 2,735 times. This suggests that shall is still in common use in Present-day Englishcompare this to the 86 instances of goalless in the whole corpusand that it is still a long way from vanishing from the language altogether. Figure 1.2 shows a screenshot of the first five hits that are returned by BNCweb. As you can see, both types of uses mentioned above are found in these first few sentences, e.g. shall we listen to you (no. 1, where the personal pronoun follows shall) and I shall be contacting him (no. 4, where the personal pronoun is placed first). But which of the two patterns is more frequent, and can we find out more about preferences among particular (types of) speakers?
Figure 1.2:
One way of proceeding from here would now be to look at every single one of the 2,735 instances of shall returned by the search, always noting down information about the speaker (if available) and the grammatical pattern in which it is used. However, this would be very tedious and time-consuming. Fortunately there are quicker and more convenient ways of seeing patterns in the way shall is used. Let's for example consider the age of speakers who use shall. Our intuition might tell us that older speakers are typically more conservative and might therefore more likely use an old-fashioned form. If this were true we might then expect the use of shall to be more frequent among older speakers than among younger ones. BNCweb allows you to test this hypothesis in just a few simple clicks (using the so-called DISTRIBUTION feature).
Figure 1.3:
Distribution of shall over the category "Age of speaker" in the spoken component of the BNC
As you can see in Figure 1.3, the data is not conclusive: older speakers do not use shall more frequently than younger ones; in fact, it is the youngest group that can be seen to use this modal most often, while the oldest age group is found somewhere in the middle of the table. Clearly, this finding does not support the view that shall is archaic and in the process of dying out. But let's dig a little deeper. Another thing you can do with BNCweb is to find out which words occur particularly often before or after shall. In this way, you could confirm your hunchif this is what you came up with in response to question b)that the first person pronoun subjects I and we are very frequent both before and immediately after shall. It turns out that nine out of every ten instances of shall occur together with I or we. The interesting question now is whether there are any differences among the various age groups with respect to the two possible sentence types, i.e. I/we shall vs. shall I/we. Again, BNCweb gives you this type of information very quicklythe results are shown in Figures 1.4a and 1.4b.
Chapter 1
Figure 1.4a:
Figure 1.4b:
As you can see, the two patterns show an opposite trend: I/we shall is most often used by older speakers (182 instances, on average 160 times per million words), but the same group of speakers use shall I/we the least (103 instancesabout 91 instances per million words). The reverse is true for the youngest speakers, who use shall I/we most often (175 instances, 454 instances per million words) but hardly use I/we shall at all (only 11 instances). Now that you have obtained these findingsor DESCRIPTIVE STATISTICS you have quite a good foundation for answering the second of the three questions at the start of this chapter. First of all, you can say that shall is still quite frequent in Present-day Englishalthough of course you haven't yet checked how much more frequent will is. Secondly, you can say that one of the two uses, i.e. I shall or we shall is predominantly used by older speakers, suggesting that the declarative form may indeed be old-fashioned. Furthermore, you can say that the other type of use, which includes offers, suggestions and requests for instructions expressed by shall I? or shall we?, is mainly used by younger speakers. Finallyand most cruciallyyou could look at this age distribution as a snapshot of a change in the English language that is still ongoing, and from this predict what the future of this use might be. Think about it: what will happen
when the young speakers represented in the BNC will be sixty or older? Will they have started using I/we shall more frequently by then because that's simply what older speakers do? Probably not. A much more likely interpretation of the data is that the declarative use is slightly dated and indeed slowly leaving the languageit is dying out. The use of shall for offers and suggestions, on the other hand, is probably going to increase even further. If this is true, perhaps it would make sense for teachers of English as a second or foreign language to introduce this type of use first, and only later go on to present the more marginal and archaic uses. Even though we have extracted all sorts of information from the corpus, we have of course not yet answered the question whether the use of shall in offers and suggestions is particularly polite or not. Unfortunately, the tables we have compiled so effortlessly do not help us find this answer. Instead, we will have to look more closely at a sufficient number of instances of this particular use of shall in context. Descriptive statistics are almost always only one side of the coin, and a comprehensive description of a linguistic phenomenon will often require both a quantitative and a qualitative analysis of the data. Finally, let's have a quick look at the third questionbut how do we do this? How can we really answer the question whether men or women talk more about cars? A very basic approach would be simply to look for the word car and to have BNCweb calculate the same kind of distributional statistics as for shall above, just this time for the sex of speakers rather than age. Figure 1.5 displays the result of this calculation. Interestingly, women seem to use the word car more often than men. Notice, by the way, that the number of actual hits is higher for men (1,789 male vs. 1,597 female uses), but we need to take into account that there are more words in this corpus uttered by men than by women. This is why measuring the frequency across the same amount of textas occurrences per million words, for exampleis important: 485 instances per million words (pmw) for women vs. 361 pmw for men. We willor we shall?return to this issue again in later chapters.
Figure 1.5:
Distribution of the word car over male and female speakers in the spoken component of the BNC
10
Chapter 1
But what have we actually answered by looking at Figure 1.5? If you think about it, not all that much. First of all, we have forgotten an important part of the use of the word car: the plural form cars. Secondly, and much more importantly, what does it actually mean to "talk about cars"? Do you always need the lexical item car to do so? If someone says I bought a Merc yesterday, clearly this is also talking about a car. Conversely, what about mentioning a car boot sale? The word car is used here, too, but is the speaker really talking about cars? You can probably see that finding a reliable answer to the third question involves much more than a simple search and a few clicks in BNCweband this is a valuable insight. Some research questions are much easier to answer with the help of corpora than others, and it is important to know both the opportunities and the limitations that the use of corpora involves. 1.2 Why read this book?
This book is mainly about the practical steps involved in answering relevant linguistic research questions with the help of the BNC and BNCweb. As you will quickly realize, BNCweb is a very user-friendly tool: it is easy to perform a simple search of the corpus, and a few mouse-clicks are usually sufficient to give you lots of further information about your query. You might therefore wonder: is it really necessary to read a detailed manual? Our answer to this is: first, this book is not just a software manualit was written by linguists interested in language study, and goes beyond a description of what the software can do. It is focused on what linguistic questions you can answer using the software and how you can go about interpreting the data generated by it in a meaningful way. The ease of use of BNCweb makes corpus-based language study appear simpler and more straightforward than it really is, and masks some considerations that should be part of every enquiry. First and foremost, it is necessary to know more about the corpus: What is actually in the BNC? How did the compilers of the BNC choose the texts? How much do we know about the speakers and writers of the texts and the conditions of their production? Second, it is necessary to learn the theoretical bases and methodological steps in corpus-based research: How do I interpret the results presented by BNCweb? What do they tell me about British English as a whole or the text varieties that I chose to examine? What do they not tell me? How do I compare results from different searches? How can I be sure the results are reliable? How do I know that my searches really are relevant to answering my research questions? This book will help you answer these important questions, and you will learn about theory and methods as you work your way through the chapters. It will help you avoid the potential problems and pitfalls that could turn the first
11
steps of a novice corpus user into a potentially frustrating or misguided experience. In this book, methodological points are addressed and illustrated in the context of actual investigations of language use. It is this combination of theory with extensive hands-on practice that makes the book different from others in the field of corpus linguistics. The functionality of the various features of BNCweb are explained through "real-life" examples of linguistic issues, combining "how-to" with a discussion of theoretical and methodological considerations. 1.3 Organization of the book
The organization of the chapters is as follows: Chapter 2 introduces some of the fundamental concepts of corpus-linguistic methodology. This is followed by a detailed description of the British National Corpus in Chapter 3. In Chapter 4, we then illustrate the basic search functionality of BNCweb and show how a query resultin the form of concordance linescan be investigated to gain insights into the use of a particular word or phrase. This is followed by a second methodology chapterChapter 5which covers a number of important issues relating to the comparability and reliability of findings made through BNCweb. We focus on why normalized frequencies are important (and how they are calculated), introduce the concepts of "precision" and "recall", and testing for statistical significance. In Chapter 6, we offer a detailed description of BNCweb's "Simple Query Syntax" and show how it can be used to perform highly sophisticated searches of the corpus. The next three chapters are then devoted to various ways of further manipulating and analyzing your query result. Chapters 7 (DISTRIBUTION and SORT) and 8 (COLLOCATIONS) cover ways of exploring your query results automatically, i.e. without the need to look at concordance lines individually (or, as it is often called, manually). In Chapter 9, we then turn to the manual annotation of concordance lines and guide you through the process of adding your own classifications to a query result (either within BNCweb itself or with the help of thirdparty programs such as Microsoft Excel). For many research questions, it will be necessary to restrict searches to a subsection of the whole BNCa so-called "subcorpus". Chapter 10 illustrates the various ways in which subcorpora can be defined. Furthermore, we will show how user-defined subcorpora can be employed to make repeated searches of (sub-parts of) the BNC more efficient. BNCweb also offers two additional functionsthe FREQUENCY LIST and KEYWORD featuresthat can be used to explore the corpus data from a more "whole-text" or macro perspective (i.e. without starting from a concordance); these will be covered in Chapter 11.
12
Chapter 1
In addition to the Simple Query Syntax introduced in Chapter 6, BNCweb also accepts queries in something called "CQP Query Syntax", whose advanced features allow users to perform even more powerful and flexible searches of the corpus. Given the much less intuitive nature of this query syntax, however, the description offered in Chapter 12 is likely to appeal predominantly to more advanced users. Chapter 13, finally, concerns practical issues in the running of BNCweb. It covers such aspects as the difference between standard users and users with administrator rights, and it also describes some internal aspects of the workings of the software that have been designed to optimize access by whole groups of users. The chapter concludes by outlining some issues relating to the installation and maintenance of BNCweb. 1.4 How to use this book
This book is probably best read while sitting in front of a computer with access to BNCweb. This will make it possible for readers to gain hands-on experience in using the tool by following the step-by-step descriptions of the many sample analyses. Each chapter also contains a number of tasks and exercises that will offer further opportunities for enhancing and broadening the practical skills of readers. However, the book has been written in such a way as to make independent reading of its contents a worthwhile experience. Several of the chapters contain a considerable amount of informationin fact, it may be too much to fully "digest" everything in one sitting. This especially applies to the two chapters which introduce the Simple Query Syntax and the CQP Query Syntax (Chapters 6 and 12), as their descriptions are designed to be useful as a comprehensive reference to the query language. Although it may be informative to read these chapters in one go, you will probably find yourself returning to their contents at some stage in the future, as your need to make more complex searches arises. A similar comment applies to the chapter describing the BNC (Chapter 3) and to the methodologically oriented Chapters 2 and 5. While we recommend that you consult these chapters thoroughly before you conduct any serious studies on the BNC, we would like to encourage you to explore the different features and options of BNCweb at your own pace, so don't worry if you don't fully understand everything the first time around. As you become more experienced and more familiar with the output provided by BNCweb, you will likely get a better grasp of the more theoretical aspects of corpus linguistic methods that we discuss in these chapters. They are therefore well worth revisiting. In sum, we are confident that this book will give you a thorough grounding in corpus linguistic theory and methods, as you learn by doingas we guide you through this powerful yet user-friendly program.