Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Search LLT

2001

Search LLT: Volume 5, Number 3 September 2001 Columns Using Corpora in Language Teaching and Learning From the Editors Welcome to LLT by Mark Warschauer, Dorothy Chun & Pamela DaGrossa p. 1 From the Special Issue Editors Introducing This Issue by Chris Tribble & Michael Barlow pp. 2-3 On the Net Finding Song Lyrics Online by Jean W. LeLoup & Robert Ponterio pp. 4-6 Emerging Technologies Tools and Trends in Corpora Use for Teaching and Learning by Bob Godwin-Jones pp. 7-12 Announcements News from Sponsoring Organizations pp. 13-18 Reviews Edited by Jennifer Leeman Multilingual Corpora in Teaching and Research Simon P. Botley, Anthony M. McEnery, & Andrew Wilson (Eds.) Reviewed by John M. Lawler pp. 19-23 Patterns and Meanings: Using Corpora for English Language Research and Teaching Alan Partington Reviewed by József Horváth, pp. 24-27 Exploring Academic English: A Workbook for Student Essay Writing Jennifer Thurstun & Christopher Candlin Reviewed by Paul Thompson pp. 28-31 Feature Articles Genres, Registers, Text Types, Domain, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle David YW Lee Lancaster University pp. 37-72 Text Categories and Corpus Users: A Response to David Lee (Commentary) Guy Aston University of Bologna, Italy pp. 73-76 An Evaluation of Intermediate Students' Approaches to Corpus Investigation Claire Kennedy & Tiziana Miceli Griffith University, Brisbane pp. 77-90 Looking at Citations: Using Corpora in English for Academic Purposes Paul Thompson Reading University Chris Tribble King's College London University & Reading University pp. 91-105 Lexical Behaviour in Academic and Technical Corpora: Implications for ESP Development Alejandro Curado Fuentes University of Extremadura, Spain pp. 106-129 Contact: Editors or Web Production Editor Copyright © 2001 Language Learning & Technology, ISSN 1094-3501. Articles are copyrighted by their respective authors. MonoConc Pro and WordSmith Tools Reviewed by Randi Reppen pp. 32-36 Teaching German Modal Particles: A Corpus-Based Approach Martina Mollering Macquarie University, Sydney pp. 130-151 The Emergence of Texture: An Analysis of the Functions of the Nominal Demonstratives in an English Interlanguage Corpus Terry Murphy Yonsei University, Seoul pp. 152-173 A Case for Using a Parallel Corpus and Concordancer for Beginners of a Foreign Language Elke St.John University of Sheffield pp. 174-184 Exploring Parallel Concordancing in English and Chinese Wang Lixum The Open University of Hong Kong pp. 185-203 Call for Papers Theme: Distance Learning Corpora Research Bibliography Contact: Editors or Web Production Editor Copyright © 2001 Language Learning & Technology, ISSN 1094-3501. Articles are copyrighted by their respective authors. About Language Learning & Technology Language Learning & Technology is a refereed journal which began publication in July 1997. The journal seeks to disseminate research to foreign and second language educators in the U.S. and around the world on issues related to technology and language education. • Language Learning & Technology is sponsored and funded by the University of Hawai'i National Foreign Language Resource Center (NFLRC) and the Michigan State University Center for Language Education And Research (CLEAR), and is co-sponsored by Apprentissage des Langues et Systèmes d'Information et de Communication (ALSIC), the Australian Technology Enhanced Language Learning Consortium (ATELL), the Center for Applied Linguistics (CAL), the Computer Assisted Language Instruction Consortium (CALICO), the European Association for Computer Assisted Language Learning (EUROCALL), the International Association for Language Learning Technology (IALLT), and the University of Minnesota Center for Advanced Research on Language Acquisition (CARLA). • Language Learning & Technology is a fully-refereed journal with an editorial board of scholars in the fields of second language acquisition and computer-assisted language learning. The focus of the publication is not technology per se, but rather issues related to language learning and language teaching, and how they are affected or enhanced by the use of technologies. • Language Learning & Technology is published exclusively on the World Wide Web. In this way, the journal seeks to (a) reach a broad audience in a timely manner, (b) provide a multimedia format which can more fully illustrate the technologies under discussion, and (c) provide hypermedia links to related background information. • Language Learning & Technology is currently published three times per year (January, May, September). Copyright © 2001 Language Learning & Technology, ISSN 1094-3501. Articles are copyrighted by their respective authors. Sponsors, Board, Editors, and Designers Sponsoring Organizations Sponsors University of Hawai`i National Foreign Language Resource Center (NFLRC) Michigan State University Center for Language Education and Research (CLEAR) Co-Sponsors Apprentissage des Langues et Systèmes d'Information et de Communication (ALSIC) Australian Technology Enhanced Language Learning Consortium (ATELL) Center for Advanced Research on Language Acquisition, University of Minnesota (CARLA) Center for Applied Linguistics, Washington, DC (CAL) Computer Assisted Language Instruction Consortium (CALICO) European Association for Computer Assisted Language Learning (EUROCALL) International Association for Language Learning Technology (IALLT) Advisory and Editorial Boards Advisory Board Susan Gass Richard Schmidt Michigan State University University of Hawai`i gass@msu.edu schmidt@hawaii.edu University of Hawai`i at Manoa The George Washington Univ. Université de Franche-Comte Iowa State University University of Hawai`i at Manoa University of Hawai`i at Manoa Thames Valley University University of Melbourne Virginia Commonwealth Univ. Univ. of MD, University College Northern Arizona University University of Haifa University of Queensland San Diego State University Georgetown University SUNY-Albany San Jose State University University of San Francisco University of Texas at El Paso brownj@hawaii.edu auchamot@gwu.edu thierry.chanier@univ-fcomte.fr carolc@iastate.edu crookes@hawaii.edu crosby@ics.hawaii.edu grahamdavies1@compuserve.com robert@genesis.language.unimelb.edu.au rgjones@atlas.vcu.edu lhart@umuc.edu joan.jamieson@nau.edu batialau@research.haifa.ac.il a.luke@mailbox.uq.edu.au mlymanha@mail.sdsu.edu mackeya@gusun.georgetown.edu cmeskill@uamail.albany.edu denise.murray@mq.edu.au nagatan@usfca.edu novick@cs.utep.edu Editorial Board James D. Brown Anna Uhl Chamot Thierry Chanier Carol Chapelle Graham Crookes Martha E. Crosby Graham Davies Robert Debski Robert Godwin-Jones Lucinda Hart-González Joan Jamieson Batia Laufer Allan Luke Mary Ann Lyman-Hager Alison Mackey Carla Meskill Denise Murray Noriko Nagata David G. Novick Patricia Paulsell Jill Pellettieri Joy Kreeft Peyton Jenise Rowekamp Rafael Salaberry Larry Selinker Maggie Sokolik Seppo Tella Leo van Lier Yong Zhao Michigan State University CA State Univ., San Marcos Center for Applied Linguistics, Washington, DC University of Minnesota Rice University University of London University of Cal., Berkeley University of Helsinki Monterey Institute of International Studies Michigan State University paulsell@msu.edu pjill@csusm.edu joy@cal.org rowek001@tc.umn.edu salaberry@rice.edu l.selinker@app-ling.book.ac.uk sokolik@socrates.berkeley.edu seppo.tella@helsinki.fi lvanlier@miis.edu zhaoyo@msu.edu Editorial Staff Editors Mark Warschauer Dorothy Chun Associate Editors Irene Thompson Managing Editor Web Production Editor Book & Software Review Editor On the Net Editors Emerging Technologies Editor Copyeditors Richard Kern Pamela DaGrossa Dennie Hoopingarner Jennifer Leeman Jean LeLoup Robert Ponterio Robert Godwin-Jones Scott Armstrong Jan McNeil Scott Petersen John Rylander Anthony Silva University of CA, Irvine University of CA, Santa Barbara The George Washington University (Emerita) Univ. of CA, Berkeley University of Hawai`i Michigan State University George Mason University markw@uci.edu dchun@humanitas.ucsb.edu SUNY at Cortland SUNY at Cortland Virginia Commonwealth University Harvard University National University of Singapore Meitoku Junior College University of Hawai`i Chaminade University leloupj@cortland.edu ponterior@cortland.edu rgjones@atlas.vcu.edu napooka@aloha.net kern@socrates.berkeley.edu dagrossa@hawaii.edu hooping4@msu.edu leemanj@georgetown.edu scott9@mediaone.net janamerican@yahoo.com rv5s-ptrs@asahi-net.or.jp rylander@hawaii.edu a.silva@att.net Copyright © 2001 Language Learning & Technology, ISSN 1094-3501. The contents of this publication were developed under a grant from the Department of Education (CFDA 84.229, P229A6001296 and P229A6007). However, the contents do not necessarily represent the policy of the Department of Education, and one should not assume endorsement by the Federal Government. Information for Contributors Language Learning & Technology is seeking submissions of previously unpublished manuscripts on any topic related to the area of language learning and technology. Articles should be written so that they are accessible to a broad audience of language educators, including those individuals who may not be familiar with the particular subject matter addressed in the article. General guidelines are available for reporting on both quantitative and qualitative research. Manuscripts are being solicited in the following categories: Articles | Commentaries | Reviews Articles Articles should report on original research or present an original framework that links previous research, educational theory, and teaching practices. Full-length articles should be no more than 8,500 words in length and should include an abstract of no more than 200 words. We encourage articles that take advantage of the electronic format by including hypermedia links to multimedia material both within and outside the article. All article manuscripts submitted to Language Learning & Technology go through a two-step review process. Step 1: Internal Review. The editors of the journal first review each manuscript to see if it meets the basic requirements for articles published in the journal (i.e., that it reports on original research or presents an original framework linking previous research, educational theory, and teaching practices), and that it is of sufficient quality to merit external review. Manuscripts which do not meet these requirements or are principally descriptions of classroom practices or software are not sent out for further review, and authors of these manuscripts are encouraged to submit their work elsewhere. This internal review takes about 1-2 weeks. Following the internal review, authors are notified by e-mail as to whether their manuscript has been sent out for external review or, if not, why not. Step 2: External Review. Submissions which meet the basic requirements are then sent out for blind peer review from 2-3 experts in the field, either from the journal's editorial board or from our larger list of reviewers. This second review process takes 2-3 months. Following the external review, the authors are sent copies of the external reviewers' comments and are notified as to the decision (accept as is, accept pending changes, revise and resubmit, or reject). Commentaries Commentaries are short articles, usually no more than 2,000 words, discussing material previously published in Language Learning & Technology or otherwise offering interesting opinions on theoretical and research issues related to language learning and technology. Commentaries which comment on previous articles should do so in a constructive fashion. Hypermedia links to additional information may be included. Commentaries go through the same two-step review process as for articles described above. Submission Guidelines for Articles and Commentaries Please list the names, institutions, e-mail addresses, and if applicable, World Wide Web addresses (URLs), of all authors. Also include a brief biographical statement (maximum 50 words, in sentence format) for each author. (This information will be temporarily removed when the articles are distributed for blind review.) Articles and commentaries can be transmitted in either of the following ways: (a) By electronic mail, send the main document and any accompanying files (images, etc.) to llt-editors@hawaii.edu (b) By mail, send the material on a Macintosh or IBM diskette to the following address: LLT NFLRC University of Hawai'i at Manoa 1859 East-West Road, #106 Honolulu, HI 96822 USA Please check the General Policies below for additional guidelines. Reviews Language Learning & Technology publishes reviews of professional books, classroom texts, and technological resources related to the use of technology in language learning, teaching, and testing. Reviews should normally include references to published theory and research in SLA, CALL, pedagogy, or other relevant disciplines. Reviewers are encouraged to incorporate images (e.g., screen shots or book covers) and hypermedia links that provide additional information, as well as specific ideas for classroom or research-oriented implementations. Reviews of individual books or software are generally 1,200-1,600 words long, while comparative reviews of multiple products may be 2,000 words or longer. They can be submitted in ASCII, Rich Text Format, Word, or HTML. Accompanying images should be sent separately as jpeg or gif files. Reviews should include the name, institutional affiliation, e-mail address, URL (if applicable), and a short biographical statement (maximum 50 words) of the reviewer(s). In addition, the following information should be included in a table at the beginning of the review: Books Author(s) Title Series (if applicable) Publisher City and country Year of publication Number of pages Price ISBN Software Title (including previous titles, if applicable) and version number Platform Minimum hardware requirements Publisher (with contact information) Support offered Target language Target audience (type of user, level, etc.) Price ISBN (if applicable) LLT does not accept unsolicited reviews. Contact Jennifer Leeman if you are interested in having material reviewed or in serving as a reviewer (leemanj@georgetown.edu). Jennifer Leeman Dept. of Modern and Classical Languages Mail Stop #3E5 George Mason University Fairfax, VA 22030 General Policies The following policies apply to all articles, reviews, and commentaries: 1. All submissions should conform to the requirements of the Publication Manual of the American Psychological Association (4th edition). Authors are responsible for the accuracy of references and citations, which must be in APA format. 2. Manuscripts that have already been published elsewhere or are being considered for publication elsewhere are not eligible to be considered for publication in Language Learning & Technology. It is the responsibility of the author to inform the editor of any similar work that is already published or under consideration for publication elsewhere. 3. Authors of accepted manuscripts will assign to Language Learning & Technology the permanent right to electronically distribute their article, but authors will retain copyright and, after the article has appeared in Language Learning & Technology, authors may republish their text (in print and/or electronic form) as long as they clearly acknowledge Language Learning & Technology as the original publisher. 4. The editors of Language Learning & Technology reserve the right to make editorial changes in any manuscript accepted for publication for the sake of style or clarity. Authors will be consulted only if the changes are major. 5. Authors of published articles, commentaries, and reviews will receive 10 free hard-copy offprints of their articles upon publication. 6. Articles and reviews may be submitted in the following formats: (a) (b) (c) (d) HTML files Microsoft Word documents RTF documents ASCII text If a different format is required in order to better handle foreign language fonts, please consult with the editors. Copyright © 2001 Language Learning & Technology, ISSN 1094-3501. Articles are copyrighted by their respective authors. Language Learning & Technology http://llt.msu.edu/vol5num3/from_the_editors.html September 2001, Vol. 5, Num. 3 p. 1 From the Editors This is a special issue of Language Learning & Technology on using corpora in language teaching and learning. The Guest Editors, Christopher Tribble and Michael Barlow, have written an Introduction to the issue. In addition to the fine collection of articles and reviews in this issue, we are delighted to announce the addition to the LLT site of a bibliography focused on language corpora. This site is maintained by LLT and your contributions to it are welcome. Although the journal is free and available to anyone with Internet access, subscriptions are important. The information obtained through subscriptions allows us to demonstrate to our funders the primary reason to continue supporting the journal, namely, our broad readership. If you have not already done so, please take a moment to subscribe to the journal. If you are already a subscriber, we appreciate your continued support and welcome your feedback. Finally, we are pleased to announce an upcoming special issue on Distance Learning, to be guest edited by Margo Glew of Michigan State University. With the current rate at which distance learning is being embraced around the world, we anticipate an exciting issue and look forward to your contributions. Mark Warschauer & Dorothy Chun Editors Pamela DaGrossa Managing Editor Copyright  2001, ISSN 1094-3501 1 Language Learning & Technology http://llt.msu.edu/vol5num3/from_the_spec_issue_ed.html September 2001, Vol. 5, Num. 3 p. 2-3 From the Special Issue Editors This Special Issue of Language Learning and Technology has been in the making for many months. We feel it has been worth the effort, and hope that our readers do, too. If you've never used corpus tools in your teaching or learning, we hope that the Special Issue inspires you to investigate further (the research bibliography that has been launched with this special edition should be helpful to this end). If you have been working with this kind of resource for some time, we are sure that you will find articles here that will help you extend and deepen your understanding of the potential of corpora and corpus tools. The Articles There are nine major articles in this edition of LLT -- making it one of the largest that the Journal has produced -- and they cover four broad areas of interest to language teachers and students. These concern the kinds of corpus that are most helpful for language learning and teaching; practical applications of corpus resources in special purposes teaching; using corpora in grammar teaching and language awareness raising; and finally the value of parallel aligned corpora (multi-lingual resources which are receiving growing interest in teaching and translation studies) in language learning and teaching. In the first section, Lee's piece on problems that can arise for teachers and researchers who want to use the British National Corpus (BNC) is of particular relevance as his account of the problematic area of genre offers a comprehensive guide to the topic. The article is not uncontentious, however, as is made clear by Aston's response in which, while valuing Lee's contribution, he also points out reasons why the BNC has been structured as it is, and gives insights into how teachers can make fuller use of what it offers. Following this account of issues associated with one of the most important English language corpora, Kennedy and Miceli discuss some of the ways in which language learners can benefit from the investigative approaches which corpus use encourages in language education, and Thompson and Tribble outline a practical application of corpus research methods in helping learners gain mastery of a central skill in academic writing -- citation. These two articles are followed by a further practically oriented paper in which Curado demonstrates the value of corpus informed teaching and learning in ESP, in particular in relation to vocabulary development. The third section of the Special Issue considers matters more closely related the research/language teaching interface. Mollering's article on German modal particles provides a very clear account of ways in which a corpus can be used in language description. Murphy's paper on "emergent texture" demonstrates how a corpus based approach can provide significant information about interlanguage development. Finally, in section four there are two papers dealing with applications of parallel aligned corpora. Wang's innovative piece shows that what might be considered a purely academic resource can offer learners very real benefits, and St.John's article provides a neat demonstration of the practical relevance of parallel corpus informed teaching with beginner students of German. Copyright  2001, ISSN 1094-3501 2 Christopher Tribble and Michael Barlow From the Special Issue Editors The Columns In On the Net, Jean LeLoup and Robert Ponterio provide guidance for "Finding Song Lyrics Online," a wonderful way to bring authentic language materials into the classroom for use in learning vocabulary, grammar, and topical information. And in keeping with our Special Issue topic, Robert Godwin-Jones brings us information on "Tools and Trends in Corpora Use for Teaching and Learning" in his Emerging Technologies column. The Journal's sponsors are key in publicizing and otherwise supporting the journal. Please take a moment to find out what these organizations do and what are contributing to the field of language learning and technology under Announcements. Jennifer Leeman, the Reviews Editor, brings us reviews of three books and one software program this issue. John Lawler reviews Botley, Mcenery, & Wilson's Multilingual Corpora in Teaching and Research; József Horváth comments on Patterns and Meanings: Using Corpora for English Language Research and Teaching by Alan Partington; and Paul Thompson reviews Exploring Academic English: A Workbook for Student Essay Writing. Finally, Randi Reppen appraises MonoConc Pro and WordSmith Tools, software programs which are mentioned throughout this issue. As editors, we have had the difficult task of selecting from a large number of contributions -- an indication of itself of the growing interest in this area. However, we have had wonderful support from the LLT team -- in particular Pamela DaGrossa, Managing Editor, and, of course, the Journal's General Editor Mark Warschauer, so many thanks to them. Also, we wish to thank the anonymous reviewers who have so generously given their time and professional insight. We hope that they (and you) feel that this special edition justifies their support. Christopher Tribble King's College London University (UK) School of Linguistics and Applied Language Studies, Reading University (UK) Michael Barlow Rice University, Texas (USA) Language Learning & Technology 3 Language Learning & Technology http://llt.msu.edu/vol5num3/onthenet September 2001, Vol. 5, Num. 3 pp. 4-6 ON THE NET Finding Song Lyrics Online Jean W. LeLoup SUNY Cortland Robert Ponterio SUNY Cortland Most foreign language teachers enjoy studying song lyrics as authentic text in their classes. Songs can be used at all levels and for a wide variety of activities and purposes such as comprehension, vocabulary introduction, illustration or recognition of grammar structures, and reinforcement of topics. Traditional or new children's songs, musical classics, or the latest pop hits are all fair game. The rhythm and melody of songs can make the words and expressions easier to remember and more enjoyable for students than other sorts of texts. But providing written support for the lyrics can sometimes be a problem. Photocopying the lyrics from the album cover might not meet the needs of a specific activity if some modification, such as blanking out some words or adding definitions, is required. Retyping or transcribing the lyrics takes time that the teacher might not be able to spare, though, of course, transcribing lyrics is a good listening activity for us teachers as well as for our students. The Internet has become a useful source of song lyrics that can be copied into a word processor and transformed into an activity for class use. Sometimes these lyrics can be easy to find, but teachers often ask us for help locating songs that they have searched for in vain. We will explore some of the kinds of sites where song lyrics may be found and describe some techniques that can help teachers use WWW search engines to locate the lyrics to a particular song more quickly. When searching for song lyrics, one needs to think a bit differently from the way one might approach searching for other kinds of information online. Many teachers begin by looking for a good Web site for song lyrics. Although there are some sites that do present a selection of lyrics as a corpus, in most cases this is not a productive search strategy because the songs are generally not collected in one place but rather distributed around the Internet in millions of different sites. Where can one find these songs? Record labels often have official Web sites for their artists that provide a variety of information about their activities and usually add a "discography" and/or "lyrics" section that might include song lyrics. This site for Patricia Kaas is managed by Sony Music: http://www.sonymusic.fr/kaas/ Some companies seem to be very protective of their control of the lyrics and have even closed down private sites that put lyrics online. Official and unofficial fan club sites sometimes duplicate or replace the function of record label in promoting the artist. For example, discography and lyrics pages for Mecano can be found at the MecanoWeb site: http://www.geocities.com/~mecanoweb/LETRAS.html http://www.geocities.com/mecanoweb/DISCOGRAFIAmecano.html Other private sites by individual music fans are another option, and these might be located anywhere in the world. http://members.es.tripod.de/Ananta/letras/mecano.htm Many individuals might have a reason to include the text of a particular song on a Web page. If you need the lyrics for all of the songs on an album, the most efficient search strategy will likely be different than if you simply need to find a particular song. Copyright  2001, ISSN 1094-3501 4 Jean LeLoup and Robert Ponterio On the Net There are many search engines for the Web whose results will be similar, so it is not necessary to use any particular site. Some people have a preference for a certain search engine, and this is fine. A few favorites are Altavista.com, Google.com, Snap.com, Yahoo.com, Lycos.com. Many search engines allow the user to specify sites in a particular language, but this is generally not useful as few Web sites bother to label their language. So including a language in the search might even prevent finding the pages you need. The most important feature to use when searching for songs is using quotation marks to identify a string of words that go together. "Twinkle, twinkle little star" should locate the title that we intend to find, but without the quotation marks we might also find "The little star will twinkle brightly." Careful use of quotation marks will eliminate false hits -- pages that match the search criteria even though they are not what we want. The more false hits we get, the harder it is and the longer it takes to track down what we really need. But quoting strings that are too long can have the opposite result if some small difference in the text makes the string in the Web page slightly different from the search string. For example, if the title in the Web page appears on two separate lines: Twinkle, Twinkle Little Star Our search might miss the very page we are looking for. Every search is a matter of narrowing or widening the search parameters depending on whether we are getting too many false hits or not enough good hits. Quoting strings tends to narrow the search, so use fewer quotes if the search results seem too narrow, more quotes if the results seem too wide. But just what should we be searching for? A problem for many novice Web searchers is that they begin by searching for words that identify the topic rather than words that will appear on the pages they hope to find. For example, very few pages of song lyrics include the word "lyrics," so do not use the word "lyrics" in the search for the words of a particular song. However, the word "lyrics" might be effective in looking for a collection of lyrics of many songs. Of course, a page in Spanish will probably use the word "letras" rather than "lyrics," so don't forget to consider the various possibilities in the languages that you use. Most song lyrics pages include the name of the artist and the song title, but not all of them do. In addition, the artist name and song title are the elements most likely to be present in some fancy format that might prevent the search engine from seeing them correctly. Clearly, the words that will always be on any page containing the lyrics of a song are the words of the song itself, and these are invariably the most effective search parameters. The words "Twinkle, twinkle little star" are in the song but are also the title, so that search will bring up many pages that include only the titles of songs and not the lyrics. The search string "how I wonder what you are" will be more likely to find only pages with the lyrics of the song. Be sure to consider how common an expression is when selecting search criteria. For instance, "what you are" is a string that we can expect to find in many contexts other than this song. Less common expressions from the song will be more effective: "above the world," "diamond in the sky." Some songs also have different versions whose lyrics may vary. This is something to consider depending on whether one is looking for a particular version or all versions of a song. A search for Twinkle "how I wonder" "above the world" diamond is likely to locate the pages we want very effectively. The addition of words from other stanzas might help us eliminate pages that only include the first stanza. In short, the best search strategy is to include only words and short phrases that must appear in the pages we hope to find. To locate sites that provide the lyrics of many songs -- for example, all the songs on a particular album -a different approach is required. In this case one might find either a page with a list of song titles and links Language Learning & Technology 5 Jean LeLoup and Robert Ponterio On the Net to the words of each song, or a long page with the lyrics of many songs. In the first case a search for a couple of titles might work; in the second case, expressions from the lyrics of several songs will be more effective. The problem with searching for titles is that far too many pages will be found that list titles without providing the lyrics. In this case, adding the search term "lyrics" or an appropriate substitute in the targeted language might help. One example of a useful collection of lyrics is the "comptines" page of the "Premiers pas sur Internet" site for French children: http://www.momes.net/comptines/index.html, including the words and often the music for hundreds of children's songs. There are times, though, when the list of titles can be of use. Some sites that sell CDs online also provide audio excerpts of individual songs. This can be a useful tool for the language teacher in search of new music in the target language, especially for teachers who do not often get to travel to countries where the language is spoken. A caveat: once you find the lyrics, check them out carefully before using them. Many Web pages contain errors and misspellings. The lyrics on many pages will require corrections before they are shared with students. Copy and paste them into your favorite word processor; read them carefully while listening to the song, and use the spell check. Now that we've discovered how to find these lyrics, what can we do with them in the FL classroom? As was previously noted, many FL teachers like to use songs as authentic materials in their curriculum. Songs can be used in a variety of ways for FL instruction. A search of the FLTEACH archives from January 1, 1999 to the present using the keywords "song lyrics" yields at least 146 hits, ranging from postings that are requesting aid in finding lyrics and using them to detailed messages describing grammar and other language lessons that are enhanced by the use of songs and their lyrics. One example of a lesson that uses lyrics for literacy in the L2 was profiled in a previous column: Literacy: Reading on the Net. A sample message by Kathy White from the FLTEACH archives offers nearly 40 suggestions for activities using music and songs in the FL classroom. Another FLTEACH post by Claudia Irigoin offers song activities from a workshop presentation given in Argentina. The purpose of the workshop was to help teachers motivate students in writing in English (an L2 there). You might also wish to expand repeated portions of songs to make it easier for students to follow along. For listening comprehension, some words or phrases may be replaced by underlining to allow students to fill in the blanks: a cloze task. Definitions or translations of phrases may be added in the margins or footers. Grammatical elements may be highlighted. The text, thus modified, can become a useful tool for language study. Using songs is a wonderful way to make the target language accessible to language learners. It is a universal medium, and speaks volumes about cultural origin, language patterns, and usage. The power that songs contain is underscored by George Jellinek (WQXR-FM): "The history of a people is found in its songs." On a more basic level, music and songs are simply the stuff that life is made of: "Give me a laundry list and I'll set it to music" (Gioacchino Antonio Rossini). Language Learning & Technology 6 Language Learning & Technology http://llt.msu.edu/vol5num3/emerging/ September 2001, Vol. 5, Num. 3 pp. 7-12 EMERGING TECHNOLOGIES Tools and Trends in Corpora Use for Teaching and Learning Bob Godwin-Jones Virginia Commonwealth University INTRODUCTION Language corpora have long been exploited for language instruction. Vocabulary lists for learners, for example, have been generated from corpora, and word counts derived from corpus analysis have helped in defining goals for vocabulary acquisition. Dictionary and textbook creators have used corpora extensively. In recent years, the move to the use of authentic language materials in language pedagogy has enhanced the role collections of spoken or written language can play in language learning. Corpora are, after all, huge storehouses of real language use. The interest in languages for special purposes further favors the use of corpora, as a means to identify the specific language components to be taught. Technology enhancements have made corpora more widely available, as well as provided more powerful tools for their use. In particular, the Internet is playing a steadily growing role in the dissemination of corpora and corpus-based teaching materials. Corpora are no longer the exclusive domain of lexicographers and computational linguists. ACCESS TO CORPORA Corpora are of interest today to professionals in a wide variety of fields, from ethnologists to telecommunication conglomerates. Creating a language corpus is a major undertaking, both timeconsuming and expensive. This is all the more the case for collections which include multiple languages and/or audio/video recordings. Given the cost and the growing interest, it makes little sense for corpora not to be made widely accessible. In fact, there have been a large number of corpora in many different languages which have become available over the Internet in the last few years. Good starting points for finding them are Michael Bohman's Corpus Linguistics page, the Linguistic Exploration page (at the LDC - Linguistic Data Consortium) or the Tractor page (the "Telri Research Archive of Computational Tools and Resources"). These pages in many cases link to direct corpus access, including a number of parallel corpora of particular interest in translation studies and language learning for specific purposes. There are as well a substantial number of text collections of literary works in a variety of languages. Some include comprehension aids and annotations for use in language learning. As the number of language archives grows, locating the specific resources needed for a project will become more problematic. One can only go so far with lists of Web links (even when annotated) or traditional Web searching. There is a recently launched international project, the Open Language Archives Community (OLAC), to build an infrastructure linking language archives of all types together. OLAC builds on the Open Archives Initiative and on the Dublin Core Metadata Initiative. The Dublin Core project began in 1995 to develop conventions for resource searching on the Web. OLAC uses the core 15 elements of the Dublin Core and extends them through the use of qualifiers to fit the needs of the language community. The use of a controlled vocabulary of descriptors should allow more efficient searching of archives. The consistent use of meta-data in language resources is likely to become of growing importance in the language community. There has not been a standard way to include information about a resource, such as the participants in an interview included in a corpus (i.e., age, nationality, first language, education, etc.). Such information is typically included in a "header" which is either part of the resource file itself or stored separately. Because of the different ways such meta-information has been stored, there has been a proliferation of tools and approaches for the user to access that information. It would be very helpful for Copyright  2001, ISSN 1094-3501 7 Bob Godwin-Jones Emerging Technologies both researchers and users to have a common approach to resource description, not only for corpora, but for all language resources such as text collections, lexicons, grammar tutorials, multimedia files, Web lessons, and so forth. This would in turn facilitate the development of universal tools. ENCODING AND ANNOTATION Standardization, or at least inter-operability, is needed not only in resource description but, of course, also in the encoding and annotation of language resources. Increasingly, corpus creators have moved from proprietary systems to standard-based approaches. Given the effort behind corpus creations and the longevity of most corpora, the challenge is to design an environment which is adaptable over time as technologies evolve. It also needs to be flexible enough to be extensible to include classification categories for which a need may arise in the future. In recent years the Text Encoding Initiative (TEI) has provided a standard used by a large number of language and literature resources. TEI uses SGML ("specialized general markup language") and provides for an extensive header containing meta-data. The header information is included within the annotation files. While the most widespread use has been in the encoding of literary texts, there is also an extensive list of projects using TEI encoding in corpora in a variety of languages. The TEI standard is part of the Corpus Encoding Standard (CES) proposed by the EAGLES group ("Expert Advisory Group on Language Engineering Standards"). CES specifies a minimal encoding level to be "standard" and provides encoding specifications for linguistic annotation. TEI is also used in the MATE project ("Multilevel Annotation Tools Engineering"), designed for the encoding and annotation of spoken dialogue corpora. The TEI standard, however, has some drawbacks as well. SGML is highly complex (as experienced by anyone having tried to decipher the intricacies of the TEI header), and SGML documents are not directly accessible from standard Web browsers. While extensible, customizing the TEI for an individual project is a daunting enterprise. Some projects have done so, such as the BBAW digital dictionary of German, adding custom headers in separate files. In fact, there are a number of advantages to a "stand-off" data architecture in which the annotations and meta-data are stored in separate files from the data itself. This allows for considerable flexibility in adding and changing the annotation categories and information as needed, without having to revise the data files themselves. The encoding system that lends itself the best to doing that is XML ("extensible markup language"), the widely acclaimed successor to HTML and slimmed-down version of SGML. Recent Web browsers have native support for XML documents, but more importantly, there are standards and methods for transforming XML documents on the fly into a variety of formats. For a corpus, annotation can be stored in separate XML documents from the data itself, which are linked in hypertext to the documents. XML enables such linking to be one-way or two-way, useful for parallel corpora. A number of the most recent corpus projects are beginning to use XML, which in fact is being supported by the EAGLES group in an XML version of CES (XCES). There is also an XML version of TEI forthcoming. One of the advantages of XML is that there need not be uniformity in the precise tags used, as long as there is an available description of each tag. Through XSLT ("extensible style language transformations"), information from XML documents can be retrieved and reformatted in a variety of ways, providing a powerful means for delivering data to a variety of users and browsers. Of course, a common data model for language resources would make it much easier to standardize access. Points (discrete objects) and spans (strings of objects) must be identified and tagged, with a common level of granularity (i.e., detail), and a means provided of identifying structure, class membership, and inheritance. There have been several large-scale projects, such as Tipster, to provide such a data model. The Atlas project also aims to provide an extensible architecture for linguistic annotation, through use of an "annotation graph model". Language Learning & Technology 8 Bob Godwin-Jones Emerging Technologies RETRIEVAL TOOLS A common (or at least exchangeable) data model would facilitate the use and development of tools for corpus extraction. In the past, new tools were often developed for the processing of each new corpus created. New projects needed to budget time and money not only to data collection but also to creating an encoding/annotation system as well as a set of tools for accessing the data. Many of these tailor-made systems replicated functionality available elsewhere but not useable due to differences in software, platform or data architecture. Fortunately, there are tool projects underway which are designed with reusability as a major goal. They tend to use a modular, building blocks approach, rather than a monolithic all-or-nothing design, allowing for more flexible use as well as future extensibility. Among such projects are GATE ("General Architecture for Text Engineering") which sets as its goal a set of infrastructure tools for natural language processing which can accommodate models written for a variety of programming and scripting languages. The Multext project, similarly, encompasses a series of projects whose goals are to develop standards and specifications for the encoding of corpora and to develop tools and resources using these standards. Multext projects are underway in at least 18 different languages. One of the other positive developments in the area of tools is the Natural Language Software Registry (NLSR), which collects and makes available over the Web detailed information on a wide variety of natural language processing software, including annotation tools (taggers, parsers), speech analysis, machine learning, evaluation tools, corpus analysis, translation, etc. The fourth edition of the NLSR provides for both browsing and searching, using a taxonomy based on the "State of the Art in Language Technology", edited by G.B. Varile and A. Zampolli. Many of the tools listed in the Registry are Internetbased, which is increasingly the case in tool creation. Most use Web forms to provide an access interface, as in sample collections in French,German, Spanish, Chinese, or Japanese. An interesting approach is provided by a service from the University of Leeds, which accepts email to amalgamtagger@comp.leeds.ac.uk containing English text, which is then parsed and tagged for parts of speech and sent back by email. OUTLOOK ON LANGUAGE LEARNING One of the more frequently used tools in working with corpora for language learning are concordances. A concordance is an alphabetical listing of words in a text or collection of texts, together with the contexts in which they appear. Typically concordances are in KWIC format ("key word in context") in which each word is centered in a fixed field, and each occurrence of the word is listed on a separate line. Good concordancers do more than simply index words to lines, they can sort in a variety of ways, search for collocations, and produce extensive statistics. Concordances have been used extensively in literary studies and stylistic analysis, but less frequently in language learning. An extensive linguistic corpus is a gold mine of authentic language use and mining that through KWIC concordances can provide students with multiple contexts from which to learn new vocabulary. An interesting example of this use of concordances is providing contextual help in the reading of second-language texts. This approach seems to work best when students try computer-aided contextual inferences first (through the concordance) which can then be confirmed through on-line dictionary access. Concordances can also be very useful in providing assessment items. Cloze exercises, for example, can easily be generated from KWIC concordances. Corpora, of course, can provide much more than just lexical information; they are invaluable in supplying syntactical examples. One of the caveats in using corpora in this way, is that for the most part corpora have been created for research purposes, rather than for language learners and as a consequence may not supply the needed information. Not all corpora, for example, are annotated for syntactic functions. Most of the parallel corpora available are restricted to narrow, often technical, language uses, thus making them less useful for contrastive analysis or translation studies. Such corpora can on the other hand be Language Learning & Technology 9 Bob Godwin-Jones Emerging Technologies invaluable in language learning for special purposes. There have been experiments using syntactically annotated corpora in providing grammar help for learners. The Cytor project at the University of Lancaster showed interesting results in providing students access to concordances, which led to improvement in their categorization of part-of-speech distinctions. This kind of activity provides a means of putting research tools into the hands of students, and working towards shifting some of the responsibility for learning on to their shoulders. An area of significant interest to language educators are collections of recorded speech preserved as audio or video. This adds an entirely new dimension to corpora, with the addition of gestures, intonation, and facial expressions, but also adds a challenge in terms of encoding and annotation. There are several projects underway to help in establishing standards for such resources. The ISLE Meta Data Initiative is seeking to create a standard for meta-data description of multimedia language resources. The Talkbank project is an interdisciplinary project hosted by Carnegie Mellon University to provide standards and tools for human (and animal) communication. EUDICO ("European Distributed Corpora Project"), from the Max Planck Institute, is looking at ways to categorize and search collections of annotations on digital video and audio recordings. One of the corpus needs for developers of CALL applications is for collections of non-native speech. Large corpora of transcribed speech data from language learners, for example, could be very useful in efforts to improve the understanding of the speech patterns of language learners necessary for interactive voice applications. There are databases of telephone speech available (from LDC) in a variety of languages. The European Science Foundation Second Language Data Bank consists of data obtained over a 3-year period for adult migrant workers in five European countries with a focus on language learning in the absence of formal instruction. Clearly, creating such non-native language collections is a huge task, complicated by the fact that there should be separate databases for different kinds of non-native speakers (according to country of origin, amount and nature of language exposure, nature of need for language ability, etc.). The needs of the telecommunication industry for reliable voice-based applications might be helpful in finding funding for such large-scale projects. It would be useful as well to have a corpus of email messages, from both natives and non-native, to provide a basis for evaluating the transformation of language through technology, and how that might affect language teaching and learning. A significant impediment in the use of corpora in teaching and learning is the form in which most corpora are stored. Most are annotated in SGML and housed in large Unix servers. It most cases, it is not practical to store such large amounts of data locally. Thus access is provided remotely, which may present performance issues. The other barrier, of course, is the proliferation of different formats for accessing corpora and the bewildering array of tools available. The growth in Web access to corpora and tools is helpful, but often the interfaces are poorly designed. The corpus linguistics community has recognized this issue, as well as the need for greater consideration of teaching needs in corpus design, and the situation looks likely to improve in the future. RESOURCE LIST General Corpus Information • • • • Language Software Helpdesk from the Language Technology Group (Edinburgh) Corpora List archive in Hypermail excellent source of up-to-date info on corpora Multilingual Theory & Technology from Xerox Corpus Linguistics Michael Barlow's extensive listing Corpora Access • • projects using the TEI English Language Corpora and Corpus resources from the British National Corpus Language Learning & Technology 10 Bob Godwin-Jones • • • • • • • • • • • Emerging Technologies Corpora, Text Resources good list from Kiat Lab (Japan) CobuildDirect Corpus Access Information commercial site with trial access available TRACTOR Network of multilingual resources corpora in multiple languages listed COMPARA Portuguese-English parallel translation corpus COSMAS access to the Mannheim corpus of German LAPT&DA access to special vocabulary lexica in German (Erlangen) Digital Dictionary of the 20th Century German BBAW project Archives for Language Documentation and Description from the University of Pennsylvania Linguistic Exploration list of resources from the Uuniversity of Pennsylvania Web EuroWordNet Interface access to multilingual lexical knowledge bases European Literature - Electronic Textscomprehensive listing Standards and Projects • • • • • • • • • • • • • • • • • • • Open Language Archives Community TEI Text Encoding Initiative MATE Multilevel Annotation, Tools Engineering The GATE project ambitious project for building a NLP infrastructure (Sheffield) XML from the W3C (World Wide Web Consortium) XSLT from the W3C (World Wide Web Consortium) EAGLES Expert Advisory Group on Language Engineering Standards The XML Cover Pages - Home Page excellent resource list by Robin Cover Multext large-scale corpora and tools project from the Centre National de la Recherche Scientifique (France) Talkbank multimedia database project from Carnegie Mellon University Synchronized Multimedia Integration Language from the W3C Survey of the State of the Art in Human Language Technology EAGLES/ISLE Meta Data Initiative Corpus Encoding Standard part of the EAGLES initiative XCES XML version of CES Tipster main site Tipster Architecture info ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation EUDICO European Distributed Corpora Project Corpus Retrieval Tools • • • • • • • Concordancers FTP downloads TACT (Text Analysis Computing Tools) DOS Concordancer from the University of Toronto LTG Software tools for text processing (including XML) from Edinburgh On-line corpus analysis Web-based concordance generator (in German) for texts in French, Italian and Spanish Software Tools for NLP list from Kita Lab (Japan) NLSR Natural Language Software Registry CRATER tools and resources for multilingual corpus work Teaching and Learning • • • • Teaching and Language Corpora article in ReCALL by T. McEnery and A. Wilson (PDF) Tutorial: Concordances and Corpora Web-based introduction by Catherine Ball (Georgetown) Corpora in the Teaching of Languages and Linguistics Can the rate of lexical acquisition from reading be increased? case study in concordance use in reading Language Learning & Technology 11 Bob Godwin-Jones • • • Emerging Technologies Pruebas de PHP-KWIC Web-based concordance general for Spanish texts (in Spanish) Corpus of Historical and Modern Spanish Web-based access to large Spanish corpus (from Mark Davies) VLC Web Concordancer search options in Chinese, English, French, Japanese, as well as parallel texts Language Learning & Technology 12 Language Learning & Technology http://llt.msu.edu/vol5num3/announcements/ September 2001, Vol. 5, Num. 3 pp. 13-18 NEWS FROM SPONSORING ORGANIZATIONS This page includes announcements from the organizations sponsoring LLT. University of Hawai'i National Foreign Language Resource Center (NFLRC) Less commonly taught languages, particularly those of Asia and the Pacific, are the focus of the University of Hawai`i National Foreign Language Resource Center, which engages in research and materials development projects and conducts Summer Institutes for language professionals among its many activities. PACIFIC SECOND LANGUAGE RESEARCH FORUM (PacSLRF) 2001 The NFLRC is pleased to co-sponsor the upcoming PacSLRF 2001 Conference, which will be held at the Imin Conference Center on the University of Hawai'i at Manoa campus October 4-7, 2001. This international conference will focus on the acquisition of second languages in instructed and naturalistic settings, particularly in East Asian, Southeast Asian, and Pacific languages. Questions? Contact us at pacslrf@hawaii.edu. NEW PUBLICATIONS FROM THE UH NFLRC • • A Focus on Language Test Development: Expanding the Language Proficiency Construct Across a Variety of Tests by T. Hudson & J. D.Brown (Eds.). This volume presents eight research studies which introduce a variety of novel, non-traditional forms of second and foreign language assessment. To the extent possible, the studies also show the entire test development process, warts and all. These language testing projects not only demonstrate many of the types of problems that test developers run into in the real world but also afford the reader unique insights into the language test development process. Motivation and Second Language Learning by Z. Dörnyei & R. Schmidt (Eds.). This volume, the second in this series concerned with motivation and foreign language learning, includes papers presented in a state-of-the-art colloquium on L2 motivation at the American Association for Applied Linguistics Conference (Vancouver, 2000) and a number of specially commissioned studies. The 20 chapters, written by some of the best known researchers in the field, cover a wide range of theoretical and research methodological issues, and also offer empirical results (both qualitative and quantitative) concerning the learning of many different languages (Arabic, Chinese, English, Filipino, French, German, Hindi, Italian, Japanese, Russian, and Spanish) in a broad range of learning contexts (Bahrain, Brazil, Canada, Egypt, Finland, Hungary, Ireland, Israel, Japan, Spain, and the US). Additions have also been made to the NFLRC NetWorks collection of online publications. Check out our other publications at http://www.LLL.hawaii.edu/nflrc/publication.html. . Copyright  2001, ISSN 1094-3501 13 News from Sponsoring Organizations Michigan State University Center for Language Education and Research (CLEAR) CLEAR’s mission is to promote foreign language education in the United States. To meet its goals, CLEAR’s projects focus on foreign language research, materials development, and teacher training. FOREIGN LANGUAGE RESEARCH • • • Acquisition of Prosody by English-Speaking Learners of French Feedback and Interaction Longitudinal Analysis of Foreign Language Writing Development MATERIALS DEVELOPMENT Products • • • • • • • • • Business Chinese (CD-ROM) Modules for Assessing Socio-Cultural Competence for German (CD-ROM) Pronunciación y fonética (CD-ROM) African Language Tutorial Guide (guide and video) Foreign Languages: Doors to Opportunity (video and discussion guide) Task-based Communicative Grammar Activities for Japanese and Thai (workbook) Test Development (workbook and video) The Internet Sourcebook for Business German Business Language Packets for High School Classrooms (French, German, and Spanish) Coming Soon! • • • • Portuguese Pronunciation and Phonetics CD-ROM Modules for Assessing Socio-Cultural Competence for Russian (CD-ROM) Thai Tutorial Guide The Internet Sourcebook for Business Spanish Game-O-Matic The Game-O-Matic is a suite of wizards that create Web-based activities for language learning and practice. Teachers can make original Game-O-Matic games by visiting http://clear.msu.edu/dennie/matic/. Have a new idea for a Game-O-Matic activity? Contact Dennie Hoopingarner at hooping4@msu.edu. Newsletter CLEAR News is a biyearly publication covering FL teaching techniques, research, and materials. Contact the CLEAR office to join the mailing list or see it on the Web at http://clear.msu.edu/clearnews/. TEACHER TRAINING Summer Workshops Every summer, CLEAR offers teacher development workshops for foreign language educators to help strengthen and expand their teaching skills. CLEAR offers stipends to help defray the workshop fees and travel/accommodation expenses. For more information, see CLEAR’s Web site at http://clear.msu.edu. For more information, contact Center for Language Education And Research (CLEAR) A712 Wells Hall Michigan State University East Lansing, MI 48824-1027 Language Learning & Technology Phone: 517/432-2286 Fax: 517/432-0473 Email: clear@msu.edu 14 News from Sponsoring Organizations Apprentissage des Langues et Systèmes d'Information et de Communication (ALSIC) ALSIC (Language Learning and Information and Communication Systems) is an electronic journal in French for researchers and practitioners in fields related to applied linguistics, didactics, psycholinguistics, educational sciences, computational linguistics, and computer science. The journal gives priority to papers from the French-speaking community and/or in French, but it also regularly invites papers in other languages so as to strengthen scientific and technical exchanges between linguistic communities that too often remain separate. The editorial board of ALSIC invites you to contact them for any prospective contributions at the following electronic address: alsic@lifc.univ-fcomte.fr. The Australian Technology Enhanced Language Learning Consortium (ATELL) Contacts: Dr. Mike Levy, The University of Queensland (mlevy@lingua.arts.uq.edu.au) Prof. Roly Sussex, The University of Queensland (sussex@lingua.arts.uq.edu.au) ATELL is an informal collaboration of Australian language teachers involved in technology-enhanced language learning and teaching. It has recently been moved to The University of Queensland, where Dr. Mike Levy and Professor Roly Sussex are developing the concept in collaboration with Mr. Greg Dabelstein, coordinator of the CALL special interest group of the Association of Modern Language Teachers' Associations of Australia (AFMTLA). We intend to establish a network of complementary and collaborating resources for teachers and learners in the TELL domain in schools and tertiary institutions. There will be a Web site, which will include information, collaboration, and resources such as • • • • • • • • • • a register of Australian TELL experts links to other sites with TELL-related information and materials links to reviews of hardware, software, courseware a section for FAQs (Frequently Asked Questions) what's new -- ideas, research, materials a register of projects, current and past, in TELL research, development, implementation software modules, libraries, and related resources for developers audio and video files for language learning support policies and discussion special interest groups In addition, we are reviving the ATELL mailing list, whose e-mail location is atell@lingua.arts.uq.edu.au. ATELL is supported by the Language Laboratory at the University of Queensland. Language Learning & Technology 15 News from Sponsoring Organizations Center for Advanced Research on Language Acquisition, University of Minnesota (CARLA) CARLA is one of nine National Language Resource Centers whose role is to improve the nation's capacity to teach and learn foreign languages effectively. Launched in 1993 with funding from the national Title VI Language Resource Center program of the U.S. Department of Education, CARLA's mission is to study multilingualism and multiculturalism, to develop knowledge of second language acquisition, and to advanced the quality of second language teaching, learning, and assessment by conducting research and action projects sharing research-based and other forms of knowledge across disciplines and education systems extending, exchanging, and applying this knowledge in the wider society. CARLA's research and action initiatives include a focus on the articulation of language instruction, content-based language teaching through technology, culture and language studies, less commonly taught languages, language immersion education, second language assessment, second language learning strategies, and technology and second language learning. To share its latest research and program opportunities with language teachers around the country, CARLA offers the following resources: a summer institute program for teachers; a database which lists where less commonly taught languages are taught throughout the country; listservs for teachers of less commonly taught languages and immersion educators; a working paper series; conferences and workshops; and a battery of instruments in French, German, and Spanish for assessing learners' proficiency in reading, writing, speaking, and listening at the intermediate-low level on the ACTFL scale. Check out these and other CARLA resources on the CARLA Web site at http://carla.acad.umn.edu. The Center for Applied Linguistics (CAL) The Center for Applied Linguistics is a private, nonprofit organization that promotes and improves the teaching and learning of languages, identifies and solves problems related to language and culture, and serves as a resource for information about language and culture. CAL carries out a wide range of activities in the fields of English as a second language, foreign languages, cultural education, and linguistics. These activities include research, teacher education, information dissemination, instructional design, conference planning, technical assistance, program evaluation, and policy analysis. Publications include books on language education, online databases of language programs and assessments, curricula, research reports, teacher training materials, and print and online newsletters. Major CAL projects include the following: • • • • ERIC Clearinghouse on Languages and Linguistics National Clearinghouse for ESL Literacy Education Refugee Service Center Pre-K-12 School Services CAL collaborates with other language education organizations on the following projects: • • • • Center for Research on Education, Diversity & Excellence Improving Foreign Languages in the Schools Project of the Northeast and Island Regional Laboratory at Brown University National Capitol Language Resource Center National K-12 Foreign Language Resource Center Language Learning & Technology 16 News from Sponsoring Organizations • National Network for Early Language Learning News from the ERIC Clearinghouse on Languages and Linguistics • ERIC/CLL’s quarterly online newsletter, ERIC/CLL Language Link, covers current topics in language education. Recent articles in Language Link include a review of the 2000 US Census and its implications for language educators, CoBaLTT (computer-assisted language learning), profiles of effective Early Foreign Language Programs, and a Language Policy update. • Recent ERIC/CLL Digests cover a range of topics in ESL, foreign language, and bilingual education including our newest Digest, Lexical Approach to Second Language Teaching. News from the National Center for ESL Literacy Education Facts and Statistics Related to Adult ESL provides links to resources that NCLE most often consult for statistics on adult ESL and the populations served by adult ESL programs. The latest NCLE Digest, Reflective Teaching Practice in Adult ESL Settings offers the adult ESL practitioner background information and step-by-step suggestions for using reflective processes as a tool for professional development. Computer Assisted Language Instruction Consortium (CALICO) Since its inception in 1983, CALICO has served as an international forum for language teachers who want to develop and utilize the potential of advanced technology to support their teaching and research needs. Through its Annual Symposia, Special Interest Groups (SIGs), CALICO Journal, CALICO Monograph Series, CALICO Resource Guide, and numerous other publications, CALICO provides both leadership and perspective in the ever-changing field of computer-assisted instruction. The strength of CALICO derives from the enthusiasm, creativity, and diversity of its members. It comprises language teachers and researchers from universities, military academies, community colleges, K-12 schools, government agencies, and commercial enterprises. To learn more about CALICO activities and how to participate in them, visit the CALICO homepage at http://www.calico.org. Language Learning & Technology 17 News from Sponsoring Organizations European Association for Computer Assisted Language Learning (EUROCALL) EUROCALL is an association of language teaching professionals from Europe and worldwide aiming to • • • Promote the use of foreign languages within Europe Provide a European focus for all aspects of the use of technology for language learning Enhance the quality, dissemination, and efficiency of CALL materials EUROCALL's journal, ReCALL, published by Cambridge University Press, is one of the leading academic journals covering research into computer-assisted and technology-enhanced language learning. The association organises special interest meetings and annual conferences, and works towards the exploitation of electronic communications systems for language learning. For those involved in education and training, EUROCALL provides information and advice on all aspects of the use of technology for language learning. Forthcoming EUROCALL conferences • EUROCALL 2001 will be at the University of Nijmegen, The Netherlands, 30 August to 1 September 2001. • EUROCALL 2002 will be at the University of Jyväskylä, Finland, 14 - 17 August 2002. For full details, contact us at http://www.eurocall.org. International Association for Language Learning Technology (IALLT) Established in 1965, IALLT (formerly IALL) is a professional organization whose members provide leadership in the development, integration, evaluation, and management of instructional technology for the teaching and learning of language, literature, and culture. Its strong sense of community promotes the sharing of expertise in a variety of educational contexts. Members include directors and staff of language labs, resource or media centers, language teachers at all levels, developers and vendors of hardware and software, grant project developers and others. IALLT offers biennial conferences, regional groups and meetings, the LLTI listserv (Language Learning Technology International), and key publications such as the IALL Journal, the IALL Language Center Design Kit, and the IALL Lab Management Manual. The 2003 IALLT conference will be held at the University of Michigan, June 17 - 21. For information, visit the IALLT Web site at www.iallt.org/. Language Learning & Technology 18 Language Learning & Technology http://llt.msu.edu/vol5num3/review1/ September 2001, Vol. 5, Num. 3 pp. 19-23 REVIEW OF MULTILINGUAL CORPORA IN TEACHING AND RESEARCH Multilingual Corpora in Teaching and Research (From the series Language and Computers: Studies in Practical Linguistics, No 22) Simon P. Botley, Anthony M. McEnery, and Andrew Wilson, Eds. 2000 ISBN: 90-420-0541-6 US $19.00 (Paperback) 208 + vi Editions Rodopi B.V. Amsterdam (Netherlands) and Atlanta, GA (USA) Reviewed by John M. Lawler, University of Michigan. Multilingual corpora are those consisting of texts in more than one language, often a monolingual original and a translation. These translations vary greatly in their faithfulness, accuracy, style, and order of presentation, as well as in granularity of translation, that is, the size of the chunks being translated (e.g., word-to-word, sentence-to-sentence, paragraph-to-paragraph, or idea-to-idea). Since the reasons for constructing multilingual corpora include being able to correlate individual pieces of one text with corresponding parts of another, their use immediately raises the problem of text alignment, or computing which chunk of a text in one language corresponds to a given chunk of the parallel text in another language. This is the major focus of Multilingual Corpora in Teaching and Research. Indeed, this book could more accurately have been titled Text Alignment in Multilingual Corpora: Overview and Case Studies. Text alignment, it quickly becomes clear, is the outstanding problem in research on multilingual corpora, and thus -- to the extent that progress has been made in its solution -- its outstanding success story. The problems that arise in alignment research reprise practically every issue in Natural Language Processing (NLP) and Automatic Translation, (e.g., sentence division, anaphor tracking, ambiguity resolution), and the peculiar limitations of the alignment task make the application of alignment strategies to these broader problems surprisingly productive, as is discussed in detail in this volume. Multilingual Corpora consists of two introductory chapters, covering theoretical and methodological issues, the literature, and the state of the art (up to early 1998), as well as 10 individual case studies, each describing an existing corpus project, 2 in the US and the rest in Europe. All the case studies except the last (on problems aligning English and Chinese texts) deal strictly with Indo-European languages (Danish, English, French, German, Greek, Italian, Norwegian, Spanish, and Swedish) and most of the corpora discussed contain texts in just two languages. Chapter 1, "Bilingual Text Alignment -- An Overview," by Michael Oakes and Tony McEnery (one of the editors) of Lancaster University, is typical of recent work in CL/NLP in that it distinguishes sharply between statistical and linguistic methods of text alignment. As these authors put it (p. 4) "Statistical methods tend to work better for large corpora, since they are relatively rapid, while linguistic methods can be better for small corpora." The vast majority of the article is a survey of the statistical methods used in various alignment projects, including formulae and discussion of results, although three varieties of Copyright  2001, ISSN 1094-3501 19 Language Learning & Technology http://llt.msu.edu/vol5num3/review1/ September 2001, Vol. 5, Num. 3 pp. 19-23 linguistic techniques are also covered. This disparity reflects the simple fact that statistically-based NLP has been far more successful overall than linguistically-based approaches, especially in tasks involving corpora (see Bayer, Aberdeen, Burger, Hirschman, Palmer, and Vilain [1998] and Hoard [1998] for discussion.). Chapter 2, "Bilingual Text Alignment: Where Do We Draw the Line?" by Michel Simard, George Foster, Marie-Loise Hannan, Elliott Macklovitch, and Pierre Plamondon of Canada's Centre d'Innovation en Technologies de l'Information, takes up the question of granularity in the context of Isabelle's (1993) concept of Translation Analysis (TA), that is, "the reconstruction of the correspondences between segments of a source text and segments of its translation" (p. 39), a principled approach to alignment. Before concluding on a generally sanguine note, they discuss three alignment programs at different granularity levels: JACAL (Just Another Cognate ALignment program), a character-level program; Salign, a sentence-level program that can be used in conjunction with JACAL (though it need not be); and TMAlign, a lexical-level alignment program. Chapter 3, "Corpus and Terminology: Software for the Translation Program at Göteborgs Universitet, or Getting Students to Do the Work," by Pernilla Daniellson and Daniel Ridings, deals with a suite of programs developed for training translators. This is one of the most obvious educational uses of multilingual corpora; the software described here is designed to be used by future translators to pick out "terminology" (i.e., technical terms that may be unfamiliar outside a particular specialty) in context, and create their own personal terminology bank for future use, in the process learning a great deal about translation. It is built from more or less off-the-shelf software (i.e., Microsoft Access) and is seen to be robust, simple, and easy to use, as well as meeting the needs of students. Chapter 4, "Parallel and Comparable Bilingual Corpora in Language Teaching and Learning," by Carol Peters, Eugenio Picchi, and Lisa Biagini of Istituto di Linguistica Computazionale in Pisa, discusses the interesting distinction between parallel corpora, or "translationally equivalent texts," and comparable corpora, for which they adopt Laffling's (1992) description: "texts which, though composed independently in their respective language communities, have the same communicative function." PiSystem DBT, an Italian/English bilingual text query program implemented for language learners, is used to highlight these issues in this chapter. A demo version is available on the Web at http://www.ilc.pi.cnr.it/pisystem/demo/demo_dbt/demo_bilingui/index.htm (this is a different URL from the one given in the book, which now returns an error message). As expected, analyses of comparable corpora are more difficult and pose unique problems. Thus, the implementation discussed is still experimental. In chapter 5, "Using Authentic Corpora and Language Tools for Adult-Centred Learning," Renée Meyer, Mary Ellen Okurowski, and Thérèse Hand of New Mexico State University explore an application, OLEADA (not an acronym, but rather the Spanish word for "tidal wave"), developed at NMSU. OLEADA is a complete learning environment, integrating "three language technologies: on-line text corpora, information retrieval, and language analysis tools. A single user interface allows seamless access to the texts and tools in ten languages" (p. 87). This short chapter doesn't go into design or performance specifics, but rather concentrates on the varying uses of OLEADA's three customer groups: language training developers, classroom developers, and independent students. Chapter 6, "Teaching Terminology Using Electronic Resources," by Jennifer Pearson of Dublin City University, is concerned, like Chapter 3, with an application designed to help future translators experience and learn to handle real use of technical jargon and phrases of art in a realistic context. This is an extremely interesting chapter, with many examples of terminological variation, and especially of culturespecific terms for which there are usually no good equivalents. Copyright  2001, ISSN 1094-3501 20 Language Learning & Technology http://llt.msu.edu/vol5num3/review1/ September 2001, Vol. 5, Num. 3 pp. 19-23 Chapter 7, "Parallel Texts in Language Teaching," by Michael Barlow of Rice University, shows how even a simple concordance program (ParaConc, a simple parallel version of Barlow's MonoConc, reviewed this issue and by Lawler, 2000) can be of great use to teachers and students for exploring the wide variety of ways in which a single word or phrase gets translated, especially as part of an idiomatic or metaphoric expression. The result, as anyone who's spent enough time with a good bilingual dictionary can attest, can be eye-opening. David Woolls of Birmingham University, extends this concept in a different direction in Chapter 8, "From Purity to Pragmatism; User-Driven Development of a Multilingual Parallel Concordancer." The software involved, part of the European Union's LINGUA project, produces various types of concordances over parallel texts in Danish, English, French, German, Greek, and Italian. Rather than focusing on its usage and applications, the chapter is a developmental history of the program, from initial specifications through iterative cycles of construction, testing, and revision of the corpus and the various software tools associated with it, and the inevitable problems that arose at each stage, and how they were handled -generally by downsizing expectations. This is an article that can be read with sympathy and profit by anyone involved in large-scale distributed development schemes. Chapter 9, "The English-Norwegian Parallel Corpus: Current Work and New Directions," by Stig Johansson and Knut Hofland of the University of Oslo, is a progress report on an ongoing project, with sections on its uses and recent multilingual extensions to French and German parallel corpora. Of particular linguistic interest are the extensive discussions, with examples, of the occurrence of the Norwegian modals skal (p. 135) and nok (p. 137); modals are often problematic, but examples like this can help understand something of their vagaries. The section on multilingual extensions is highlighted by an equally extensive and equally interesting discussion of cleft sentences ("That's what I meant," and its ilk) and other clausal anaphora, and their translated equivalents; any syntactician reading this section would yearn for such a tool. This is a good example of how corpus linguistics can inform theoretical linguistics, as well as language learning. Chapter 10, "Unlocking the power of the SMEMUC," by Raphael Salkie, of the University of Brighton, coins what the author admits is an "ugly acronym" for Small and MEdium-sized MUltilingual Corpus. He argues that such corpora are "a good way forward for those of us who want to take corpora out of the computer laboratory and into the hands of teachers, students, and language researchers," (p. 148) and goes on to describe the step-by-step development and subsequent pedagogic uses of INTERSECT, a FrenchEnglish parallel corpus massaged to fit the needs of ParaConc (discussed in Chapter 7). His conclusion is one that is easy to agree with: "Sometime in the future, when today's computers seem like little toys and the Internet is fast and freely available, large multilingual corpora will be available for everyone. For now, it is corpora like INTERSECT which can take a lead in convincing linguists, language teachers and translators that multilingual corpora have a lot to offer them" (p. 156). Chapter 11, "Corpus-Based Contrastive Lexicography: The Case of English with and its German Translation Equivalents," by Josef Schmied and Barbara Fink of the University of Chemnitz, focuses on the use of a bilingual parallel corpus to research the syntax and semantics of the preposition with, in all its uses and collocations. The lexicographic results are the stars here, while the software plays a supporting role; this is a good example of the kind of research that would have been impossible even to conceive of, let alone carry out, before the advent of aligned multilingual corpora. It will be of interest not only to computational linguists, but also to translators, semanticists, lexicographers, and language teachers. Finally, Chapter 12, "Parallel Alignment in English and Chinese," by Tony McEnery, Scott Piao, and Xu Xin of the University of Lancaster, addresses the challenges for multilingual parallel corpus research posed by non-European and non-Indo-European languages. Many new methods are still needed, and so far the work is largely experimental and the results rather sketchy. Nevertheless, the authors produce a useful discussion of the problems they encountered and report on one alignment method, based on bi- Copyright  2001, ISSN 1094-3501 21 Language Learning & Technology http://llt.msu.edu/vol5num3/review1/ September 2001, Vol. 5, Num. 3 pp. 19-23 variate distribution, that they tried out on a sample corpus. They conclude, "Aligning languages which are not genetically related is a challenge for computational linguists, and may well stretch the 'language independence' claim of some current alignment algorithms to the breaking point." The chapter includes an appendix containing a short set of tags that were used in the alignment task. Overall, this is a really interesting book for a linguist to read. All the articles are well-written and accessible at any level of knowledge about corpora (although readers of chapters 1 and 12 might benefit from familiarity with Oakes, 1998), and the problems encountered are diverse and challenging enough to engage anyone with an interest in language. This would serve nicely as a source of additional readings for courses in corpus linguistics, translation theory, or software design, as well as being a good source of good ideas and potential pitfalls for corpus and software designers themselves. For such a useful book, though, it is a shame that the index is so sparse, consisting of only seven pages, each of which is mostly white space, with one 12-character-wide column on either side. The index could have been printed in three pages with more appropriate use of space, especially when one considers that the entry for "standard error," a cross-reference to the immediately preceding entry for "standard deviation" on page 207, takes up an entire quarter-page (see Figure 1). Figure 1. Index entries Copyright  2001, ISSN 1094-3501 22 Language Learning & Technology http://llt.msu.edu/vol5num3/review1/ September 2001, Vol. 5, Num. 3 pp. 19-23 Indexes are hard to make, and good quality control is often outside the reach even of editors, but a wellmade index repays an editor's labor in the form of usefulness for readers. There are a few other infelicities; in addition to the ones remarked on in Dash (2001), such as the absence of Section 3.1.1 mentioned on page 179, I might add the running head for chapter 7, which renames the chapter to "Parallel texts in English teaching." But all these are very minor matters; this is a really good book, worth its price and bound to be useful for a long time to come. ABOUT THE REVIEWER John Lawler, Associate Professor of Linguistics at the University of Michigan, Ann Arbor, former chair of the LSA Computer Committee, and software author (MONOSYL, A World of Words, The Chomskybot), has published on topics including metaphor, Acehnese syntax, generic reference, secondlanguage learning, English syntax and semantics, negation and logic, sound symbolism, UNIX, and popular English usage, and has consulted on software development for industry and academia. E-mail: jlawler@umich.edu REFERENCES Bayer, S., Aberdeen, J., Burger, J., Hirschman, L., Palmer, D., & Vilain, M. (1998). Theoretical and computational linguistics: Toward a mutual understanding. In J. Lawler & H. Dry (Eds.), Using Computers in Linguistics (pp. 231-255). New York: Routledge. A chapter overview is availableon the Web: http://www.routledge.com/linguistics/introduction.html#chapter.8. Dash, N. S. (2001). Review of Botley, McEnery, & Wilson (2000), Multilingual Corpora in Teaching and Research. LINGUIST, 11(2537). Retreived June 1, 2001 from the World Wide Web: http://linguistlist.org/issues/11/11-2537.html. Hoard, J. E. (1998). Language Understanding and the Emerging Alignment of Linguistics and Natural Language Processing. In J. Lawler & H. Dry (Eds.), Using Computers in Linguistics (pp. 197-230). New York: Routledge. A chapter overview is available on the Web: http://www.routledge.com/linguistics/introduction.html#chapter.7. Laffling, J. (1992). On Constructing a Transfer Dictionary for Man and Machine. Target 4(1), 17-31. Lawler, J. M. (2000). Review of MonoConc Pro 2.0 Concordancing Software. LINGUIST, 11(1411). Retrieved June 1, 2001 from the World Wide Web: http://linguistlist.org/issues/11/11-1411.html. Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press. Copyright  2001, ISSN 1094-3501 23 Language Learning & Technology http://llt.msu.edu/vol5num3/review2/ September 2001, Vol. 5, Num. 3 pp. 24-27 REVIEW OF PATTERNS AND MEANINGS: USING CORPORA FOR ENGLISH LANGUAGE RESEARCH AND TEACHING Patterns and Meanings: Using Corpora for English Language Research and Teaching Alan Partington Studies in Corpus Linguistics Elena Tognini-Bonelli, series editor: 1998 ISBN 1 55619 396 3 US $ 27.95 (paperback) 163 + vii pp. John Benjamins Publishing Company Amsterdam, The Netherlands Reviewed by József Horváth, University of Pécs Patterns and Meanings: Using Corpora for English Language Research and Teaching, Partington's slim but thorough monograph, is a welcome contribution to the field of corpus linguistics. It illustrates how using computer corpora in the study of language phenomena can enhance the internal validity and reliability of linguistic findings. The volume (the second in the Studies in Corpus Linguistics series) represents an example of the uses of corpora for practical purposes, following, in part, the paradigm established by Johns (1991) and Leech (1997): to exploit corpora for language teaching and learning. Partington, together with colleagues, assembled an unannotated corpus of 5 million words of journalistic texts for the case studies at the University of Bologna. The English component of the corpus was derived from The Independent, The Telegraph and The Times, with what the author calls the "sister" subcorpus accessed from an Italian broadsheet, Il Sore 24 Ore. Two heavy-weight concordancers, Microconcord (Scott & Johns, 1993) and WordSmith Tools (Scott, 1996, reviewed in this issue), were used for the analyses. The eight chapters of Patterns and Meanings report on what the author labels case studies, addressing different levels or aspects of language use. The coverage is broad: discussions of collocation, translation, connotation, syntax, cohesion, metaphor, and phraseology signify the main stages of the effort, each of which combines "language description with suggestions for pedagogical application" (p. 1). The style and presentation are superb, with only a few slips and minor typos (such as the one on p. 63, "the number of wholly reliable true friends … is probable fewer than is … imagined"). The author ties in his observations with a useful and clearly presented review of the literature, which presents some contrasting views. Further, the reader is given a concrete description of the methods, procedures, and techniques applied in the analyses. In the concluding sections of most chapters, Partington charts directions for further study, and frequently offers useful and original tips for both teachers and students in corpus linguistics courses. As the intended audience of the volume includes newcomers to the field of corpus linguistics, the Introduction defines most basic terms and issues, referring mainly to studies from the 1990s. Special emphasis is given to areas where the application of corpus linguistics in language pedagogy can plug the gap that some practitioners perceive between theory and practice, and between teaching and learning. It Copyright  2001, ISSN 1094-3501 24 Reviewed by József Horváth Review of Patterns and Meanings… includes a relevant description of the data-driven learning (DDL) approach as well as details about the corpus for the studies and the methods applied. To ensure that the following chapters are accessible to all readers, there is also a brief illustrative section on the keyword-in-context (KWIC) concordance output, a description of the sorting features of concordance programs, and a technical how-to for dealing with corpus files on a computer. In the first three chapters ("Collocation and Phrase Patterns," "Collocation and Synonymy," and "True and False Friends") the focus is on lexical issues. The author gives a splendid introduction to how concordance samples can enhance our understanding of denotation and contrasts data from the corpus with information from dictionaries to highlight the interaction of collocation, text types, and stylistic variations. The definitions are solid, and the examples carefully selected. Furthermore, principles underlying the phenomena are always explained clearly, assisting the reader in discovering the significance of the findings. Especially revealing is the study of collocation and synonymy (chapter 2), in which one of the problems that many EFL learners and professional translators face is tackled: choosing among seemingly similar vocabulary items. Partington provides a detailed study of the collocates of the adjectives sheer, pure, complete, and absolute, as he investigates the many different lexical choice patterns, making this part of the book a commendable resource aiding translation theory and practice. Although the approach and the findings of chapter 2 are valid, I have difficulty seeing the relevance to Partington's claim in the conclusion that a thesaurus "is positively dangerous for the non-native speaker." For one thing, the use of the term "non-native speaker" is problematic. We all are native speakers of one language or another. The author may be alluding to the dichotomy of the L1 and L2 speaker, and the point he is making appears to be that, because of the intricacies of collocation and synonymy, the use of a thesaurus may result in strange, non-native-like language. Does this suggest, then, that Partington would not sanction the use of the thesaurus in any EFL course? Whether or not Partington would go this far, the claim appears to be based on a limited view of both the foreign language learner and the pedagogical context of using a thesaurus. Students can learn how to use a thesaurus for specific purposes, the same way as they learn to use a dictionary -- either traditional or corpus-based. Also, suggesting that thesauruses represent a danger raises the issue of equal linguistic rights, to which learners are entitled as much as native speakers. It would be interesting to conduct an empirical comparison of the naturalness of the writing of learners who used a thesaurus and those who did not for a given writing task. In addition, learner use of a thesaurus during DDL work may result in improvement in range and appropriateness of vocabulary, making this another area worthy of empirical research. Chapter 4 continues the exploration of the corpus by examining connotation in terms of semantic prosody: It investigates the connotational significance of lexis. Definitions and examples were extracted from 10 dictionaries, both traditional and corpus-based, so that the corpus findings could be contrasted with how the lexical and connotational features of set in, peddle, and dealings are presented in the two groups of dictionaries. Partington notes that even current non-learner dictionaries have little place for information of this kind (p. 72) and suggests that cross-linguistic prosodic differences require further study, which will be especially beneficial in terms of raising translators' awareness of them. Chapters 5 and 6 ("Syntax" and "Cohesion in Text") serve two purposes: first, to identify further features of patterns and meanings; second, to demonstrate that with a corpus one can go beyond the lexical domain and look at other chunks of text. It is also here that the author makes his theoretical position explicit: He belongs to the school that investigates "the interface between lexis and syntax" (p. 79). Partington refers to Francis's claim that the lexical and the syntactic domains are mutually dependent on each other: "It is impossible to look at one independently of the other …. The interdependence of syntax and lexis is such that they are ultimately inseparable" (1993, p. 147). Partington's analysis reveals that what is taught about conditionals, for example, is not always what the corpus attests. At one point (p. 84), he suggests that when students can review and analyze a large number of concordance citations for "If," they may realize that what underlies the syntax of conditionals is best viewed as a model, rather than a constraint. By Language Learning & Technology 25 Reviewed by József Horváth Review of Patterns and Meanings… analyzing the corpus, Partington reveals and groups conditional and non-conditional dependencies in if constructions, and suggests that similar investigations could be carried out on other conditional markers and subordinators. He also states that this DDL approach can help students "more clearly understand the distinctions highlighted in grammars and textbooks" (p. 87). Unfortunately, however, there are no concrete tips on the format and content of this procedure, although many readers may have found such a practical element interesting. "Metaphor" and "'Unusuality'" come toward the end of the book (chapters 7 and 8). The former applies frequency and concordance data drawn from the business journalism section of the English corpus (about 800,000 words from The Independent and The Times), the latter undertakes to highlight not the typical, but the figurative, in language. In chapter 7, the author first provides a succinct summary of three theories of metaphor, and then analyzes dead and dying metaphors, metaphorical intent, collocation, and fossilized collocations. In chapter 8, he presents unusual newspaper headlines from five sections of The Independent: home news, international news, arts, business, and sports. Clearly, headline language is text that many EFL students will skim. Here we get a scanning of this sample: The focus is on preconstructed word strings (proverbs, quotations, expressions, and the like). The list of headlines assembled is a rich resource: examples such as "Prints Charming," "Sail of the Quincentenary," "You Could Hear A Superlative Drop," and "Industrial Resolution" are just four of the scores of examples that have been classified and interpreted by the author (and his associates). In addition to presenting examples of collocational patterns, Partington shares his view of the sociolinguistic and psycholinguistic nature of these journalistic chunks. The concluding chapter addresses some of the limitations of corpus-based studies, with Partington synthesizing the most common critiques leveled against the approach. Issues discussed include the difficulty of establishing external validity of corpus-based studies, which results from the fact that any findings can be interpreted only within the context of the given corpus. It seems, however, that the theoretical dilemmas of representativity would have been better placed, and more thoroughly analyzed, in an earlier section, before the case studies. This would have helped readers new to the field keep the limitation in mind. That Partington chose to feature this subject at the end of the book suggests that he had found no easy answer to a question of corpus linguistics: Why bother analyzing any corpus, however large, if what is found can be claimed to characterize only one single corpus? For the time being, one needs to be content with exploring general and specialized corpora assembled using clear principles and be cautious in drawing conclusions from such studies (Clear, 1992). Overall, Partington's richly illustrated studies, the relevance of research questions to language pedagogy, and the new knowledge that this volume offers, especially about collocation, synonymy, phraseology, and unusual language, make Patterns and Meanings a very well focused and engaging read, one that has already found its way into several corpus linguistics courses worldwide -- rightfully so. ABOUT THE REVIEWER József Horváth holds a PhD in Applied Linguistics from the University of Pécs (Hungary). He has developed the JPU Corpus, a collection of over 400,000 words of Hungarian EFL students' writing. He teaches Writing and Research Skills, Corpus Linguistics, and Translation Studies courses at the Department of English Applied Linguistics, University of Pécs E-mail: joe@btk.pte.hu Language Learning & Technology 26 Reviewed by József Horváth Review of Patterns and Meanings… REFERENCES Clear, J. (1992). Corpus sampling. In G. Leitner (Ed.), New directions in English language corpora: Methodology, results, software development (pp. 21-31). Berlin: Mouton de Gruyter. Francis, G. (1993). A corpus-driven approach to grammar: Principles, methods and examples. In M. Baker, G. Francis, & E. Tognelli-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 137-156). Amsterdam: John Benjamins. Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. ELR Journal, 4, 1-16. Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 2-23). London: Longman. Scott, M. (1996). Wordsmith tools [Computer software]. Oxford, UK: Oxford University Press. Scott, M., & Johns, T. (1993). Microconcord [Computer software]. Oxford, UK: Oxford University Press. Language Learning & Technology 27 Language Learning & Technology http://llt.msu.edu/vol5num3/review3/ September 2001, Vol. 5, Num. 3 pp. 28-31 REVIEW OF EXPLORING ACADEMIC ENGLISH Exploring Academic English: A Workbook for Student Essay Writing Jennifer Thurstun and Christopher Candlin 1997 ISBN 1-864083-74-3 AU $23.95 144 pp. NCELTR Sydney, Australia Reviewed by Paul Thompson, University of Reading Exploring Academic English is an innovative concordance-based workbook for use either in an English for Academic Purposes (EAP) writing class or for independent learning. What makes it innovative is that it is the first workbook to utilize corpus study methods to systematically introduce and explore the use of certain words to perform rhetorical functions in academic written English. Thus, it should be of interest to both native and non-native speakers of English, who have at least intermediate proficiency in English, and who are preparing to enter, or already have entered, tertiary education. There are two ways that linguistic corpora can be exploited for pedagogical purposes (Partington, 1998): teachers can either analyse corpora for material/syllabus design (Flowerdew, 1993), or they can train students to use corpora directly. The latter use is designed to promote what Tim Johns has described as data-driven learning, or DDL.1 Exploring Academic English offers an interesting combination of both methods. The authors have used a specialized corpus of academic English which they first analyzed in order to determine the syllabus of the book, and they have also presented selected output from the same corpus as data for learner activities in which the learner acts as language researcher. Exploring Academic English is a methodical and clearly presented workbook. Each of its six units deals with a "rhetorical function," as follows: • • • • • • stating the topic referring to the literature reporting the research of others discussing processes undertaken in the study expressing opinions tentatively drawing conclusions For each function, three or four lexical items are focussed on, with each unit following the same fourstage path. Firstly, in the "Look" stage, a set of concordance lines is presented, sorted by the first word to the right of the search term. In the case of analysis, for example, this means that all the concordance lines containing "analysis of" are placed together, and they appear after "…any analysis must …" (see Figure 1). As concordances can be difficult to read for first-time readers (key word in context, or KWIC, concordances are incomplete sentences), the learner is advised not to try to understand every word, but rather, to concentrate on the words around the search term. Copyright  2001, ISSN 1094-3501. 28 Paul Thompson Review of Exploring Academic English Figure 1. KWIC concordance of analysis In the second, or "Familiarize," stage, students are given a set of tasks related to these concordance lines, in which they identify lexical patterns around the key word: which prepositions follow the word, and in what contexts; which words commonly precede the word; and so on. They are also asked to decide which of a number of suggested senses the word can have based on the evidence available from the data, and this often involves interpreting possible gradations of meaning. In the third stage, "Practise," students are asked to do gap-fill and matching exercises without referring back to the concordances. Finally, in the "Create" stage, they write a sentence or paragraph on a specified topic in which they practise the use of the key word. In the chapter on "Drawing Conclusions and Summarising," for example, the learner is asked to write a paragraph summarising the main differences between the terms conclusions and summaries which they have studied in the Look, Familiarise, and Practise stages. An important point to note is that students work throughout not on concocted examples, but on data drawn from a corpus of authentic academic texts, whether these be concordance lines or the sentences for the gap-fill exercises. Suggested answers to all the exercises are given at the end of the book with commentary provided where appropriate. The corpus used by Thurstun and Candlin is the Microconcord Corpus of Academic Texts,2 an electronic collection of academic books and papers from a range of disciplines, with a total word count of over one million words. The authors first identified words in the University Word List (see Nation, 1990) that could be used in the performance of the specific rhetorical functions outlined above. Using the Microconcord programme, the authors then produced sets of KWIC concordance lines in order to observe frequencies of use as well as the lexical patterning around these words. Finally, they extracted those lines that concisely represent the most common collocational features surrounding the word that had been searched for (a full account of the procedures and the principles underlying them can be found in Thurstun & Candlin, 1998). Language Learning & Technology 29 Paul Thompson Review of Exploring Academic English Three points are worth making. Firstly, the corpus used is broadly representative of academic writing. Against this, it can be argued convincingly that the texts chosen should have been closer to the types of texts that students themselves will have to write, rather than a collection of texts written by expert writers for a general academic audience, but such a corpus, of sufficient size, was not available at the time that the book was developed. The Microconcord Corpus of Academic Texts is a far more relevant source of data for EAP teaching than any of the large general corpora, and there are, to date, no large corpora of native speaker student-generated academic text publicly available. Secondly, the words included are all what can be termed "semi-technical vocabulary": lexical items that are more likely to appear in scientific or academic than in more general texts, and also likely to appear in a wide range of academic texts. Thirdly, the authors have sifted the concordance lines to reduce the amount of lines that learners will have to look through. One of the criticisms of Data-Driven Learning is that students can be overwhelmed by the sheer quantity of information if they are asked to investigate corpora by themselves. This workbook circumvents the problem by sorting through the raw data in advance and distilling the output to a manageable level. I found some of the so-called "Create" exercises, especially in the earlier parts of the book, mechanical, and felt that they were neither creative nor did they test the learner's understanding. In Unit 3, for example, to practise the use of the verb claim, the learner is asked to report the following statement (and two others) using claim: "Even today, Canadians are not nearly so far away from the tradition of Victorian gentility as we imagine (Waddington, 1989)." There is no need for the learner to invest any original thought in this exercise in syntactic manipulation. In the unit on "Expressing Ideas Tentatively," the manipulation required is less demanding; the learner is asked to rewrite three sentences, using may in order to make the sentences tentative, for example, "This alteration is excitatory or inhibitory: that is, it makes the receiving cell more or less likely to emit impulses itself." The rewriting involves changing the first "is" to "may be" and "makes" to "may make," which is a simple task and does not require the writer to demonstrate an understanding of what tentativeness is, nor when it is necessary to be tentative. A further problem is that if the learner does think about the sentence analytically, he/she will note that the insertion of "may" does not actually make the statement tentative -it remains a factive statement, explaining the two (known) types of alteration. In such cases, the teacher might decide to leave out those exercises, and devise their own activities in place of the them. Generally speaking, though, the book is an excellent implementation of corpus-informed (and informing) insights. Because the concordances are already sifted and are available in a paper format, they are immediately accessible. The repeated use of the four-stage approach also trains the learner in effective corpus analysis skills. As the authors themselves acknowledge in their article, a possible criticism of the approach is that a great deal of time is invested on a relatively small number of words (19 in all). Learners may well feel that they could invest their energies more profitably in acquiring a larger vocabulary in the same period of time, with a little less depth. For example, the three exponents of one particular function dealt with in the book, that of "Reporting the Research of Others," do not provide sufficient lexical resources for the developing academic writer: "according to," "claim," and "suggest" are a beginning but will soon prove painfully restricting unless the repertoire is supplemented. If this book is to be used as part of a writing course, therefore, learners will need extension activities, so that they can explore other related key vocabulary items for each function. Provided that they have access to appropriate corpora facilities (an academic text corpus of adequate size, with good documentation, and concordancing software), they could be asked to work in groups on different lexical items (hedging words, for example, as listed in the appendices of Hyland, 2000, pp. 188-189) and present reports of their findings to the whole class. A wealth of ideas for using concordancing in the classroom can be found in Tribble and Jones (1990) and in Partington (1998). Language Learning & Technology 30 Paul Thompson Review of Exploring Academic English It should also be pointed out that a concordance-driven approach is primarily inductive: Learners are invited to look for patterns in the data, and to form generalisations that can account for the patterns they find. Not all learners like such an approach (Dudley-Evans & St John, 1998, p. 86), and teachers need to consider whether such consciousness-raising activities are appropriate for their learners. For those who are attracted to such an approach, however, using concordances in the classroom is a stimulating and highly rewarding experience, both for teachers and learners. Exploring Academic English makes the use of concordances in the EAP classroom much easier by following a highly systematic approach and presenting sets of ready-made concordance lines, and is an impressive new departure both in the field of EAP writing teaching materials, and of foreign language teaching materials writing in general. It is reasonably priced and can either be used as a classroom textbook (I would see it as most useful as a supplementary workbook), or for self-study, provided that learners are given some training in working with concordance lines first. NOTES 1. For the definitive bibliography on DDL, see Tim Johns' Web site (http://web.bham.ac.uk/johnstf/biblio.htm). 2. Originally sold by Oxford University Press as an optional companion to the Microconcord concordancing programme (Scott & Johns, 1993), but sadly now out of print. ABOUT THE REVIEWER Paul Thompson is a Research Fellow in the School of Linguistics and Applied Language Studies, at the University of Reading, UK. His research interests are: second language writing pedagogy, the corpusbased analysis of academic discourse, and applications of Information Technology to language teaching. E-mail: p.a.thompson@reading.ac.uk REFERENCES Dudley-Evans, A., & St John, M. (1998). Developments in English for specific purposes. Cambridge, UK: Cambridge University Press. Flowerdew, J. (1993). Concordancing as a tool in course design. System, 21(2), 231-244. Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. London: Longman. Nation, P. (1990). Teaching and learning vocabulary. New York: Newbury House. Partington, A. (1998). Patterns and meanings: Using corpora for English language research and teaching. Amsterdam: John Benjamins. Scott, M., & Johns, T. (1993). Microconcord. Oxford, UK: Oxford University Press. Thurstun, J., & Candlin, C. (1998). Concordancing and the teaching of the vocabulary of academic English. English for Specific Purposes 17(3), pp. 267-280. Tribble, C., & Jones, G. (1990). Concordances in the classroom. London: Longman Language Learning & Technology 31 Language Learning & Technology http://llt.msu.edu/vol5num3/review4/ September 2001, Vol. 5, Num. 3 pp. 32-36 REVIEW OF MONOCONC PRO AND WORDSMITH TOOLS Title Developer Platform MonoConc Pro Version 2.0 PC Hardware/ Windows 95 or higher System Requirements Program Information Publisher Support Languages Audience ISBN Price Athelstan info@athel.com On-line help and a small manual Can be used with different languages Beginning to advanced users not applicable US $85 single user; US $550 15 user site WordSmith Tools Version 3.0 Mike Scot PC Minimum 80386 processor, VGA display or better, Windows 3.1x or Windows 95, minimum 4 MB RAM (8 MB if used with Windows 95). http://www.oup.com:8080/elt/global/catal ogue/multimedia/wordsmithtools3/ Oxford University Press On-line help and an extensive manual Can be used with different languages Beginning to advanced users 0-19-45-92863 51.95 British pounds Reviewed by Randi Reppen, Northern Arizona University The recent interest in corpus linguistics and the use of authentic materials has created a need for software packages that allow teachers and researchers to carry out corpus-based investigations. These corpus-based investigations can be used to augment classroom instruction so that ESL/EFL students are exposed to real language rather than artificial texts and made-up examples. Teachers and researchers can also begin to explore some of the more subtle areas of language use where our intuitions often lead us in the wrong direction. In this review, I will take a close look at WordSmith Tools (Version 3) and MonoConc Pro (Version 2), two of the more readily available and reasonably priced packages for working with corpora, in order to contrast the different options that they offer teachers and researchers. As with any software purchase, the needs of the user should play a key role in deciding which program is most appropriate. Both programs include many of the same features, such as the ability to create word lists (in both alphabetical order and frequency order), generate concordance output, and give collocation information. Both programs easily handle large corpora and work with either tagged or untagged texts. As with any software package, the user needs to check the default settings (e.g., minimum or maximum number of hits displayed) to make certain that they are set according to the users' desires. In the following paragraphs, I describe the major features shared by the two programs as well as some of the more specialized features offered by only one or the other. One of the major innovations of these packages is that they allow users to analyze any collection of ASCII texts. This is in marked contrast to earlier concordancing packages which required the user to build a database of texts before using the program for analyses. This was usually an elaborate process, and sometimes required sending texts to the software author or publisher before the concordancing tools could be used. Further, the database normally could not be modified once it was constructed. Thus, the database needed to be rebuilt any time additional texts were added. WordSmith and MonoConc Pro differ from these earlier packages in that they allow the user to select any group of texts for analysis every time the system is started. Better yet, additional texts can be added "on the fly," so that the corpus being analyzed can be tailored to directly fit the immediate research questions. Copyright  2001, ISSN 1094-3501 32 Randi Reppen Review of MonoConc Pro and WordSmith Tools The primary research use of both software packages is to generate concordances, or listings of all the occurrences of any given word in a given text, with words shown in context. Concordance listings can be useful for exploring the use and meanings of specific words. Often when looking at concordance lines, users may want to expand the context so that they can get a better sense of the meaning or use. Here is one area where the two programs differ quite a bit. Both programs allow the user to adjust the settings of the concordance program to display more or less text on the concordance screen. However, MonoConc Pro has an additional feature that is especially attractive for researchers: the split screen display allows users to expand the context of an entry line simply by highlighting the line, which displays the fuller context in the upper window (see Figure 1). In WordSmith, the entire display must be expanded or reduced, so the context is expanded for all of the entries being viewed rather than for a single highlighted entry. Figure 1. MonoConc Pro screen display of concordance lines Another nice feature of MonoConc Pro is that the total number of words in the corpus is always displayed in the lower right hand corner (as shown in Figure 1). This information is vital for comparisons of texts of unequal lengths, as the normalization of counts of linguistic features, a process that allows such comparisons to be carried out accurately, relies on text length (for more information, see Methodology Box 6 in Biber, Conrad, & Reppen, 1998, pp. 263-264). Both programs have sort functions that allow users to sort concordance lines in several ways (e.g., by search word, then first word right; or by first word). Sorting words and seeing the collocation Language Learning & Technology 33 Randi Reppen Review of MonoConc Pro and WordSmith Tools immediately to the left or right of the target word can provide insights on word senses and uses. Another feature found in both programs is the ability to "blank out" target words in the concordance output, which can be useful to teachers for the development of vocabulary activities and cloze tests. By using corpora, rather than teacher-made examples, teaching and testing materials reflect the language found in authentic texts and thus provide learners with more exposure to real language. Concordance displays are quite similar in both programs. In addition to the functions that these programs have in common, WordSmith is able to perform a number of useful tasks that MonoConc Pro is not. For example, WordSmith can provide information about the distribution of a feature in a single text or across texts. Distributions are shown with a graph that plots the occurrences of the target item in the text or corpus (see Figure 2). The distribution of a particular lexical or grammatical feature across a text or series of texts can provide interesting information about the text structure and also about how the feature functions across various texts. A similar tool is available in MonoConc Pro; however, I was unable to interpret the bar graph display used in MonoConc Pro. Figure 2. WordSmith plot distribution by text for the occurrence of thank WordSmith also allows the user to compare word lists. The Key Word function allows the user to compare a given text to a target text or target register, which can be particularly useful for cross-register comparisons. For example, a teacher or researcher could compare biology textbooks to geology textbooks in order to see what lexical similarities or differences occur. The Key Word function provides a quick glimpse of what the text is about, since the list is not based on absolute frequency but rather the unique words that are frequent in the particular text. The Cluster function is the WordSmith feature that is perhaps most innovative since it is quite powerful and can be very useful. With this function, the user can specify from two to eight word clusters from a concordance list and then see which words tend to co-occur (see Figure 3). Co-occurring words are often idioms or set phrases. Language Learning & Technology 34 Randi Reppen Review of MonoConc Pro and WordSmith Tools Figure 3. WordSmith screen with clusters WordSmith also has a feature that allows the user to align two texts and create a new file that contains one displayed over the other. This is extremely useful for comparing translations or two versions of the same text. The texts are displayed in different colors for ease of reading. See Figure 4 for an example of this feature used to check a translation against the original text. Figure 4. Aligning two texts to check a translation (excerpt from WordSmith on-line manual) Language Learning & Technology 35 Randi Reppen Review of MonoConc Pro and WordSmith Tools The main advantage of MonoConc Pro over WordSmith is that it is much easier to use. For example, when MonoConc Pro is launched, a clear easy-to-use screen appears with a bar across the top, providing the options available. On the other hand, when WordSmith is launched there are many screens that appear, and until the user becomes familiar with the program, just getting the program going can be a bit of a challenge. For someone starting out with corpus analysis, and wanting to focus mostly on concordancing, MonoConc Pro is more user-friendly. The screens are clearer, and since they resemble the screens of many word processing programs, users may feel more comfortable. In summary, both programs offer users powerful tools for searching texts and exploring how language is used in natural settings, thus providing valuable resources for teachers and researchers. However, the two programs have different strengths: for users who are less comfortable with computers, MonoConc Pro's interface is much more user-friendly than that of WordSmith. However, for those who are comfortable with computers and plan to carry out more powerful text analysis, WordSmith would be a better choice. So, while both MonoConc Pro and WordSmith offer attractive options for exploring texts, the best choice will depend on the specific goals and experience of the user. ABOUT THE REVIEWER Randi Reppen is an Assistant Professor in Northern Arizona University's MA-TESL/PhD-Applied Linguistics Program, and the Director of the Program in Intensive English. She is co-author of Corpus Linguistics: Investigating Language Structure and Use with Douglas Biber and Susan Conrad (1998). Her research interests include corpus linguistics and the use of corpora in materials development. E-mail: Randi.Reppen@NAU.EDU REFERENCE Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge, UK: Cambridge University Press. Language Learning & Technology 36 Language Learning & Technology http://llt.msu.edu/vol5num3/lee/ September 2001, Vol. 5, Num. 3. 3 pp. 37-72 GENRES, REGISTERS, TEXT TYPES, DOMAINS, AND STYLES: CLARIFYING THE CONCEPTS AND NAVIGATING A PATH THROUGH THE BNC JUNGLE David YW Lee Lancaster University, UK ABSTRACT In this paper, an attempt is first made to clarify and tease apart the somewhat confusing terms genre, register, text type, domain, sublanguage, and style. The use of these terms by various linguists and literary theorists working under different traditions or orientations will be examined and a possible way of synthesising their insights will be proposed and illustrated with reference to the disparate categories used to classify texts in various existing computer corpora. With this terminological problem resolved, a personal project which involved giving each of the 4,124 British National Corpus (BNC, version 1) files a descriptive "genre" label will then be described. The result of this work, a spreadsheet/database (the "BNC Index") containing genre labels and other types of information about the BNC texts will then be described and its usefulness shown. It is envisaged that this resource will allow linguists, language teachers, and other users to easily navigate through or scan the huge BNC jungle more easily, to quickly ascertain what is there (and how much) and to make informed selections from the mass of texts available. It should also greatly facilitate genre-based research (e.g., EAP, ESP, discourse analysis, lexicogrammatical, and collocational studies) and focus everyday classroom concordancing activities by making it easy for people to restrict their searches to highly specified sub-sets of the BNC using PC-based concordancers such as WordSmith, MonoConc, or the Web-based BNCWeb. INTRODUCTION Most corpus-based studies rely implicitly or explicitly on the notion of genre or the related concepts register, text type, domain, style, sublanguage, message form, and so forth. There is much confusion surrounding these terms and their usage, as anyone who has done any amount of language research knows. The aims of this paper are therefore two-fold. I will first attempt to distinguish among the terms because I feel it is important to point out the different nuances of meaning and theoretical orientations lying behind their use. I then describe an attempt at classifying the 4,124 texts in the British National Corpus (BNC) in terms of a broad sense of genre, in order to give researchers and language teachers a better avenue of approach to the BNC for doing all kinds of linguistic and pedagogical research. Categorising Texts: Genres, Registers, Domains, Styles, Text Types, & Other Confusions Why is it important to know what these different terms mean, and why should corpus texts be classified into genres? The short answer is that language teachers and researchers need to know exactly what kind of language they are examining or describing. Furthermore, most of the time we want to deal with a specific genre or a manageable set of genres, so that we can define the scope of any generalisations we make. My feeling is that genre is the level of text categorisation which is theoretically and pedagogically most useful and most practical to work with, although classification by domain is important as well (see discussion below). There is thus a real need for large-scale general corpora such as the BNC to clearly label and classify texts in a way that facilitates language description and research, beyond the Copyright  2001, ISSN 1094-3501 37 David Lee Genres, Registers, Text Types, Domains, and Styles very broad classifications currently in place. It is impossible to make many useful generalisations about "the English language" or "general English" since these are abstract constructions. Instead, it is far easier and theoretically more sound to talk about the language of different genres of text, or the language(s) used in different domains, or the different types of register available in a language, and so forth. Computational linguists working in areas of natural language processing/language engineering have long realised the need to target the scope of their projects to very specific areas, and hence they talk about sublanguages such as air traffic control talk, journal articles on lipoprotein kinetics, navy telegraphic messages, weather reports, and aviation maintenance manuals. (see Grishman & Kittredge, 1986; Kittredge & Lehrberger, 1982, for detailed discussions of "sublanguages"). The terminological issue I grapple with here is a very vexing one. Although not all linguists will recognise or actively observe the distinctions I am about to make (in particular, the use of the term text type, which can be used in a very vague way to mean almost anything), I believe there is actually more consensus on these issues than users of these terms themselves realise, and I hope to show this below. Internal Versus External Criteria: Text Type & Genre One way of making a distinction between genre and text type is to say that the former is based on external, non-linguistic, "traditional" criteria while the latter is based on the internal, linguistic characteristics of texts themselves (Biber, 1988, pp. 70 & 170; EAGLES, 1996).1 A genre, in this view, is defined as a category assigned on the basis of external criteria such as intended audience, purpose, and activity type, that is, it refers to a conventional, culturally recognised grouping of texts based on properties other than lexical or grammatical (co-)occurrence features, which are, instead, the internal (linguistic) criteria forming the basis of text type categories. Biber (1988) has this to say about external criteria: Genre categories are determined on the basis of external criteria relating to the speaker's purpose and topic; they are assigned on the basis of use rather than on the basis of form. (p. 170) However, the EAGLES (1996)2 authors would quibble somewhat with the inclusion of the word topic above and argue that one should not think of topic as being something to be established a priori, but rather as something determined on the basis of internal criteria (i.e., linguistic characteristics of the text): Topic is the lexical aspect of internal analysis of a text. Externally the problem of classification is that there are too many possible methods, and no agreement or stability in societies or across them that can be built upon ... The boundaries between ... topics are ultimately blurred, and we would argue that in the classification of topic for corpora, it is best done on a higher level, with few categories of topic which would alter according to the language data included. There are numerous ways of classifying texts according to topic. Each corpus project has its own policies and criteria for classification … The fact that there are so many different approaches to the classification of text through topic, and that different classificatory topics are identified by different groups indicates that existing classification[s] are not reliable. They do not come from the language, and they do not come from a generally agreed analysis. However they are arrived at, they are subjective, and … the resulting typology is only one view of language, among many with equal claims to be the basis of a typology. (p. 17) So perhaps it is best to disregard the word "topic" in the quote from Biber above, and take genres simply as categories chosen on the basis of fairly easily definable external parameters. Genres also have the property of being recognised as having a certain legitimacy as groupings of texts within a speech community (or by sub-groups within a speech community, in the case of specialised genres). This is Language Learning & Technology 38 David Lee Genres, Registers, Text Types, Domains, and Styles essentially the view of genre taken by Swales (1990, pp. 24-27), who talks about genres being "owned" (and, to varying extents, policed) by particular discourse communities. Without going into the minutiae of the EAGLES' recommendations, all I will say is that detailed, explicit recommendations do not yet exist in terms of identifying text types or, indeed, any so-called "internal criteria." That is, there are as yet, no widely-accepted or established text-type-based categories consisting of texts which cut across traditionally recognisable genres on the basis of internal linguistic features (see discussion below). On the subject of potentially useful internal classificatory criteria, the EAGLES authors mention the work of Phillips (1983) under the heading of topic (the "aboutness" or "intercollocation of collocates" or "lexical macrostructures" of texts), and the work of Biber (1988, 1989) and Nakamura (1986, 1987, 1992, 1993) under the heading of style (which the EAGLES' authors basically divide into "formal/informal," combining this with parameters such as "considered/impromptu" and "one-way/interactive"). However, the authors offer no firm recommendations, merely the observation that "these are only shafts of light in a vast darkness" (p. 25), and they do not mention what a possible text type could be (in fact, no examples are even given of possible labels for text types). At present, all corpora use only external criteria to classify texts. Indeed, as Atkins, Clear, & Ostler (1992, p. 5) note, there is a good reason for this: The initial selection of texts for inclusion in a corpus will inevitably be based on external evidence primarily … A corpus selected entirely on internal criteria would yield no information about the relation between language and its context of situation. The EAGLES (1996) authors add that [the] classification of texts based purely on internal criteria does not give prominence to the sociological environment of the text, thus obscuring the relationship between the linguistic and non-linguistic criteria. (p. 7) Coming back to the distinction between genre and text type, therefore, the main thing to remember here is what the two different approaches to classification mean for texts and their categorisation. In theory, two texts may belong to the same text type (in Biber's sense) even though they may come from two different genres because they have some similarities in linguistic form (e.g., biographies and novels are similar in terms of some typically "past-tense, third-person narrative" linguistic features). This highly restricted use of text type is an attempt to account for variation within and across genres (and hence, in a way, to go "above and beyond" genre in linguistic investigations). Biber's (1989, p. 6) use of the term, for example, is prompted by his belief that "genre distinctions do not adequately represent the underlying text types of English …; linguistically distinct texts within a genre represent different text types; linguistically similar texts from different genres represent a single text type." Paltridge (1996), in an article on "Genre, Text Type, and the Language Learning Classroom," makes reference to Biber (1988; but, crucially, not to Biber 1989)3 and proposes a usage of the terms genre and text type which he claims is in line with Biber's external/internal distinction, as delineated above. It is clear from the article, however, that what Paltridge means by "internal criteria" differs considerably from what Biber meant. Paltridge proposes the following distinction: Language Learning & Technology 39 David Lee Genres, Registers, Text Types, Domains, and Styles Table 1. Paltridge's Examples of Genres and "Text Types" (based on Hammond, Burns, Joyce, Brosnan, & Gerot, 1992) Genre Recipe Personal letter Advertisement Police report Student essay Formal letter Format letter News item Health brochure Student assignment Biology textbook Film review Text Type Procedure Anecdote Description Description Exposition Exposition Problem–Solution Recount Procedure Recount Report Review As can be seen, what Paltridge calls "text types" are probably better termed "discourse/rhetorical structure types," since the determinants of his "text types" are not surface-level lexicogrammatical or syntactic features (Biber's "internal linguistic features"), but rhetorical patterns (which is what Hoey, 1986, p. 130, for example, calls them). Paltridge's sources, Meyer (1975), Hoey (1983), Crombie (1985) and Hammond et al. (1992) are all similarly concerned with text-level/discoursal/rhetorical structures or patterns in texts, which most linguists would probably not consider as constituting 'text types' in the more usual sense. Returning to Biber's distinction between genre and text type, then, what we can say is that his "internal versus external" distinction is attractive. However, as noted earlier, the main problem is that linguists have still not firmly decided on or enumerated or described in concrete terms the kinds of text types (in Biber's sense) we would profit from looking at. Biber's (1989) work on text typology (see also Biber & Finegan,1986) using his factor-analysis-based multi-dimensional (MD) approach is the most suggestive work so far in this area, but his categories do not seem to have been taken up by other linguists. His eight text types (e.g., "informational interaction," "learned exposition," "involved persuasion") are claimed to be maximally distinct in terms of their linguistic characteristics. The classification here is at the level of individual texts, not groups such as "genres," so texts which nominally "belong together" in a "genre" (in terms of external criteria) may land up in different text types because of differing linguistic characteristics. An important caveat to mention, however, is that there are many questions surrounding the statistical validity, empirical stability, and linguistic usefulness of the linguistic "dimensions" from which Biber derives these "text types," or clusters of texts sharing internal linguistic characteristics (see Lee, 2000, for a critique) and hence these text typological categories should be taken as indicative rather than final. Kennedy (1998) has said, for example, that Some of the text types established by the factor analysis do not seem to be clearly different from each other. For example, the types "learned" and "scientific" exposition … may differ only in some cases because of a higher incidence of active verbs in the "learned" text type. (p. 188) One could also question the aptness or helpfulness of some of the text type labels (e.g., how useful is it to know that 29% of "official documents" belong to the text type "scientific exposition"?). It therefore still remains to be seen if stable and valid dimensions of (internal) variation, which can serve as useful criteria for text typology, can be found. At the risk of rocking the boat, I would also like to say that, personally, I am not convinced that there is a pressing need to determine "all the text types in the Language Learning & Technology 40 David Lee Genres, Registers, Text Types, Domains, and Styles English language" or to balance corpora on the basis of these types. Biber (1993) notes that it is more important as a first step in compiling a corpus to focus on covering all the situational parameters of language variation, because they can be determined prior to the collection of texts, whereas there is no a priori way to identify linguistically defined types ... [however,] the results of previous research studies, as well as on-going research during the construction of a corpus, can be used to assure that the selection of texts is linguistically as well as situationally representative [italics added]. (p. 245) My question, however, is: what does it mean to say that a corpus is "linguistically representative" or linguistically balanced? Also, why should this be something we should strive towards? The EAGLES' (1996) authors say that we should see progress in corpus compilation and text typology as a cyclical process: The internal linguistic criteria of the text [are] analysed subsequent to the initial selection based on external criteria. The linguistic criteria are subsequently upheld as particular to the genre … [Thus] classification begins with external classification and subsequently focuses on linguistic criteria. If the linguistic criteria are then related back to the external classification and the categories adjusted accordingly, a sort of cyclical process ensues until a level of stability is established. (p. 7) Or, as the authors say later, this process is one of "frequent cross-checking between internal and external criteria so that each establishes a framework of relevance for the other" (p. 25). Beyond these rather abstract musings, however, there is not enough substantive discussion of what text types or other kinds of internally-based criteria could possibly look like or how exactly they would be useful in balancing corpora. In summary, with text type still being an elusive concept which cannot yet be established explicitly in terms of linguistic features, perhaps the looser use of the term by people such as Faigley and Meyer (1983) may be just as useful: they use text type in the sense of the traditional four-part rhetorical categories of narrative, description, exposition and argumentation. Steen (1999, p. 113) similarly calls these four classes "types of discourse."4 Stubbs (1996, p. 11), on the other hand, uses text type and genre interchangeably, in common, perhaps, with most other linguists. At present, such usages of text type (which do not observe the distinctions Biber and EAGLES try to make) are perhaps as consistent and sensible as any, as long as people make it clear how they are using the terms. It does seem redundant, however, to have two terms, each carrying its own historical baggage, both covering the same ground. "Genre," "Register," and "Style" Other terms often used in the literature on language variation are register and style. I will now walk into a well-known quagmire and try to distinguish between the terms genre, register, and style. In his Dictionary of Linguistics and Phonetics, Crystal (1991, p. 295) defines register as "a variety of language defined according to its use in social situations, e.g. a register of scientific, religious, formal English." (Presumably these are three different registers.) Interestingly, Crystal does not include genre in his dictionary, and therefore does not try to define it or distinguish it from other similar/competing terms. In Crystal & Davy (1969), however, the word style is used in the way most other people use register: to refer to particular ways of using language in particular contexts. The authors felt that the term register had become too loosely applied to almost any situational variety of language of any level of generality or abstraction, and distinguished by too many different situational parameters of variation. (Using style in the same loose fashion, however, hardly solves anything, and, as I argue below, goes against the usage of style by most people in relation to individual texts or individual authors/speakers.) The two terms genre5 and register are the most confusing, and are often used interchangeably, mainly because they overlap to some degree. One difference between the two is that genre tends to be associated Language Learning & Technology 41 David Lee Genres, Registers, Text Types, Domains, and Styles more with the organisation of culture and social purposes around language (Bhatia, 1993; Swales, 1990), and is tied more closely to considerations of ideology and power, whereas register is associated with the organisation of situation or immediate context. Some of the most elaborated ideas about genre and register can be found within the tradition of systemic functional grammar. The following diagram (Martin & Matthiessen, 1991, reproduced in Martin, 1993, p. 132), shows the relation between language and context, as viewed by most practitioners of systemic-functional grammar: Figure 1. Language and context in the systemic functional perspective In this tradition, register is defined as a particular configuration of field, tenor, and mode choices (in Hallidayan grammatical terms), in other words, a language variety functionally associated with particular contextual or situational parameters of variation and defined by its linguistic characteristics. The following diagram illustrates this more clearly: Language Learning & Technology 42 David Lee Genres, Registers, Text Types, Domains, and Styles Figure 2. Metafunctions in relation to register and genre6 Genre, on the other hand, is more abstractly defined: A genre is known by the meanings associated with it. In fact the term "genre" is a short form for the more elaborate phrase "genre-specific semantic potential" … Genres can vary in delicacy in the same way as contexts can. But for some given texts to belong to one specific genre, their structure should be some possible realisation of a given GSP Generic Structure Potential … It follows that texts belonging to the same genre can vary in their structure; the one respect in which they cannot vary without consequence to their genre-allocation is the obligatory elements and dispositions of the GSP. (Halliday & Hasan, 1985, p. 108) [T]wo layers of context are needed -- with a new level of genre [italics added] posited above and beyond the field, mode and tenor register variables … Analysis at this level has concentrated on making explicit just which combinations of field, tenor and mode variables a culture enables, and how these are mapped out as staged, goal-oriented social processes [italics added]. (Eggins & Martin, 1997, p. 243) These are rather theory-specific conceptualisations of genre, and are therefore a little opaque to those not familiar with systemic-functional grammar. The definition of genre in terms of "staged, goal-oriented social processes" (in the quote above, and in Martin, Christie, & Rothery, 1987), is, in particular, slightly confusing to those who are more concerned (or familiar) with genres as products (i.e., groupings of texts). Ferguson (1994), on the other hand, offers a less theory-specific discussion. However, he is rather vague, and talks about (and around) the differences between the two terms while never actually defining them precisely: He seems to regard register as a "communicative situation that recurs regularly in a society" (p. 20) and genre as a "message type that recurs regularly in a community" (p. 21). Faced with such comparable definitions, readers will be forgiven for becoming a little confused. Also, is register only a "communicative situation," or is it a variety of language as well? In any case, Ferguson also seems to equate sublanguage with register (p. 20) and offers many examples of registers (e.g., cookbook recipes, stock market reports, regional weather forecasts) and genres (e.g., chat, debate, conversation, recipe, obituary, scientific textbook writing) without actually saying why any of the registers cannot also be Language Learning & Technology 43 David Lee Genres, Registers, Text Types, Domains, and Styles thought of as genres or vice versa. Indeed, sharp-eyed readers will have noted that recipes are included under both register and genre. Coming back to the systemic-functional approach, it will be noted that even among subscribers to the "genre-based" approach in language pedagogy (Cope & Kalantzis, 1993), opinions differ on the definition and meaning of genre. For J. R. Martin, as we have seen, genre is above and beyond register, whereas for Gunther Kress, genre is only one part of what constitutes his notion of register (a superordinate term). The following diagram illustrates his use of the terms: Figure 3. Elements of the composition of text (Kress, 1993, p. 35) Kress (1993) appears to dislike the fact that genre is made to carry too much baggage or different strands of information: There is a problem in using such a term [genre] with a meaning that is relatively uncontrollable. In literary theory, the term has been used with relative stability to describe formal features of a text -- epitaph, novel, sonnet, epic -- although at times content has been used to provide a name, [e.g.] epithalamion, nocturnal, alba. In screen studies, as in cultural studies, labels have described both form and content, and at times other factors, such as aspects of production. Usually the more prominent aspect of the text has provided the name. Hence "film noir"; "western" or "spaghetti western" or "psychological" or "Vietnam western"; "sci-fi"; "romance"; or "Hollywood musical"; and similarly with more popular print media. (pp. 31-2) In other words, Kress is complaining about the fact that a great complex of factors is condensed and compacted into the term -- factors to do with the relations of producer and audience, modes of production and consumption, aesthetics, histories of form and so on. (p. 32) He claims that many linguists, educators, and literacy researchers, especially those working within the Australian-based "genre theory/school" approach, use the term in the same all-encompassing way. Also, he is concerned that the work of influential people like Martin and Rothery has been focussed too much on presenting ideal generic texts and on the successive "unfolding" of "sequential stages" in texts (which are said to reflect the social tasks which the text producers perform; Paltridge, 1995, 1996, 1997): The process of classification … seems at times to be heading in the direction of a new formalism, where the 'correct' way to write [any particular text] is presented to students in the form of generic models and exegeses of schematic structure. (Kress, 1993, p. 12) Language Learning & Technology 44 David Lee Genres, Registers, Text Types, Domains, and Styles Those familiar with Kress' work in critical discourse analysis (e.g., Kress & Hodge, 1979) should not be surprised to learn, however, that in his approach to genre the focus is instead: … on the structural features of the specific social occasion in which the text has been produced [, seeing] these as giving rise to particular configurations of linguistic factors in the text which are realisations of, or reflect, these social relations and structures [,…e.g.] who has the power to initiate turns and to complete them, and how relations of power are realised linguistically. In this approach "genre" is a term for only a part of textual structuring, namely the part which has to do with the structuring effect on text of sets of complex social relations between consumers and producers of texts. [all italics added] (p. 33) As can be seen, therefore, there is a superficial terminological difference in the way genre is used by some theorists, but no real, substantive disagreement because they both situate it within the broader context of situational and social structure. While genre encompasses register and goes "above and beyond" it in Martin's (1993, Eggins & Martin, 1997) terms, it is only one component of the larger overarching term register in Kress' approach. My own preferred usage of the terms comes closest to Martin's, and will be described below. Before that, however, I will briefly consider two other attempts at clearing up the terminological confusion. Sampson (1997) calls for re-definitions of genre, register, and style and the relationships among them, but his argument is not quite lucid or convincing enough. In particular, his proposal for register to be recognised as fundamentally to do with an individual's idiolectal variation seems to go against the grain of established usage, and is unlikely to catch on. Biber (Finegan & Biber, 1994, pp. 51-53; 1995, pp. 7-10) does a similar survey, looking at the use of the terms register, genre, style, sublanguage, and text type in the sociolinguistic literature, and despairingly comes to the conclusion that register and genre, in particular, cannot be teased apart. He settles on register as "the general cover term associated with all aspects of variation in use" (1995, p. 9), but in so doing reverses his choice of the term genre in his earlier studies, as in Biber (1988) and Biber & Finegan (1989). (Further, as delineated in Finegan & Biber, 1994, Biber also rather controversially sees register variation as a very fundamental basis or cause of social dialect variation.) While hoping not to muddy the waters any further, I shall now attempt to state my position on this terminological issue. My own view is that style is essentially to do with an individual's use of language. So when we say of a text, "It has a very informal style," we are characterising not the genre to which it belongs, but rather the text producer's use of language in that particular instance (e.g., "It has a very quirky style"). The EAGLES (1996) authors are not explicit about their stand on this point, but say they use style to mean: the way texts are internally differentiated other than by topic; mainly by the choice of the presence or absence of some of a large range of structural and lexical features. Some features are mutually exclusive (e.g. verbs in the active or passive mood), and some are preferential, e.g. politeness markers and mitigators. (p. 22) As noted earlier, the main distinction they recommend for the stylistic description of corpus texts is formal/informal in combination with parameters such as the level of preparation (considered/impromptu), "communicative grouping" (conversational group; speaker/writer and audience; remote audiences) and "direction" (one-way/interactive). This chimes with my suggestion that we should use the term style to characterise the internal properties of individual texts or the language use by individual authors, with "formality" being perhaps the most important and fundamental one. Joos's (1961) five famous epithets "frozen," "formal," "informal," "colloquial," and "intimate" come in handy here, but these are only suggestive terms, and may be multiplied or sub-divided endlessly, since they are but five arbitrary points on a sliding scale. On a more informal level, we may talk about speakers or writers having a "humorous," Language Learning & Technology 45 David Lee Genres, Registers, Text Types, Domains, and Styles "ponderous," or "disjointed" style, or having a "repertoire of styles." Thus, describing one text as "informal" in style is not to say the speaker/writer cannot also write in a "serious' style," even within the same genre. The two most problematic terms, register and genre, I view as essentially two different points of view covering the same ground. In the same way that any stretch of language can simultaneously be looked at from the point of view of form (or category), function, or meaning (by analogy with the three sides of a cube), register and genre are in essence two different ways of looking at the same object.7 Register is used when we view a text as language: as the instantiation of a conventionalised, functional configuration of language tied to certain broad societal situations, that is, variety according to use. Here, the point of view is somewhat static and uncritical: different situations "require" different configurations of language, each being "appropriate" to its task, being maximally "functionally adapted" to the immediate situational parameters of contextual use. Genre is used when we view the text as a member of a category: a culturally recognised artifact, a grouping of texts according to some conventionally recognised criteria, a grouping according to purposive goals, culturally defined. Here, the point of view is more dynamic and, as used by certain authors, incorporates a critical linguistic (ideological) perspective: Genres are categories established by consensus within a culture and hence subject to change as generic conventions are contested/challenged and revised, perceptibly or imperceptibly, over time. Thus, we talk about the existence of a legal register (focus: language), but of the instantiation of this in the genres of "courtroom debates," "wills" and "testaments," "affidavits," and so forth (focus: category membership). We talk about a formal register, where "official documents" and "academic prose" are possible exemplar genres. In contrast, there is no literary register, but, rather, there are literary styles and literary genres, because the very essence of imaginative writing is idiosyncrasy or creativity and originality (focus on the individual style). My approach here thus closely mirrors that of Fairclough (2000, p. 14) and Eggins & Martin (1997). The latter say that "the linguistic features selected in a text will encode contextual dimensions, both of its immediate context of production (i.e., register) and of its generic identity (i.e., genre), what task the text is achieving in the culture" (p. 237), although they do not clearly set out the difference in terms of a difference in point of view, as I have done above. Instead, as we have seen, they attempt in rather vague terms to define register as a variety "organised by metafunction" (Field, Tenor, Mode) and genre as something "above and beyond metafunctions." In Biber's (1994) survey of this area of terminological confusion, he mentions the use of terminology by Couture (1986), but fails to note a crucial distinction apparently made by the author: Couture's examples of genres and registers seem to be more clearly distinguished than in other studies of this type. For example, registers include the language used by preachers in sermons, the language used by sports reporters in giving a play-by-play description of a football game, and the language used by scientists reporting experimental research results. Genres include both literary and non-literary text varieties, for example, short stories, novels, sonnets, informational reports, proposals, and technical manual. [all italics added] (Finegan & Biber, 1994, p. 52) Biber does not point out that a key division of labour between the two terms is being made here which has nothing to do with the particular examples of activity types, domains, topics, and so forth: whenever register is used, Couture is talking about "the language used by…", whereas when genre is used, we are dealing with "text varieties" (i.e., groupings of texts). I contend that it is useful to see the two terms genre and register as really two different angles or points of view, with register being used when we are talking about lexico-grammatical and discoursal-semantic patterns associated with situations (i.e., linguistic patterns), and genre being used when we are talking about memberships of culturally-recognisable categories. Genres are, of course, instantiations of registers (each genre may invoke more than one register) and so will have the lexico-grammatical and discoursal- Language Learning & Technology 46 David Lee Genres, Registers, Text Types, Domains, and Styles semantic configurations of their constitutive registers, in addition to specific generic socio-cultural expectations built in. Genres can come and go, or change, being cultural constructs which vary with the times, with fashion, and with ideological movements within society. Thus, some sub-genres of "official documents" in English have been observed to have changed in recent times, becoming more conversational, personal, and familiar, sometimes in a deliberate way, with manipulative purposes in mind (Fairclough 1992). The genres have thus changed in terms of the registers invoked (an aspect of intertextuality), among other changes, but the genre labels stay the same, since they are descriptors of socially constituted, functional categories of text. Much of the confusion comes from the fact that language itself sometimes fails us, and we end up using the same words to describe both language (register or style) and category (genre). For example, "conversation" can be a register label ("he was talking in the conversational register"), a style label ("this brochure employs a very conversational style"), or a genre label ("the [super-]genre of casual/face-to-face conversations," a category of spoken texts). Similarly, weather reports are cited by Ferguson (1994) as forming a register (from the point of view of the language being functionally adapted to the situational purpose), but they are surely also a genre (a culturally recognised category of texts). Ferguson gives "obituaries" as an example of a genre, but fails to recognise that there is not really a recognisable "register of obituaries" only because the actual language of obituaries is not fixed or conventionalised, allowing considerable variation ranging from humorous and light to serious and ponderous. Couture (1986) also offers an additional angle on the distinction between register and genre: While registers impose explicitness constraints at the level of vocabulary and syntax, genres impose additional explicitness constraints at the discourse level … Both literary critics and rhetoricians traditionally associate genre with a complete, unified textual structure. Unlike register, genre can only be realized in completed texts or texts that can be projected as complete, for a genre does more than specify kinds of codes extant in a group of related texts; it specifies conditions for beginning, continuing, and ending a text. (p.82) The important point being made here is that genres are about whole texts, whereas registers are about more abstract, internal/linguistic patterns, and, as such, exist independently of any text-level structures. In summary, I prefer to use the term genre to describe groups of texts collected and compiled for corpora or corpus-based studies. Such groups are all more or less conventionally recognisable as text categories, and are associated with typical configurations of power, ideology, and social purposes, which are dynamic/negotiated aspects of situated language use. Using the term genre will focus attention on these facts, rather than on the rather static parameters with which register tends to be associated. Register has typically been used in a very uncritical fashion, to invoke ideas of "appropriateness" and "expected norms," as if situational parameters of language use have an unquestionable, natural association with certain linguistic features and that social evaluations of contextual usage are given rather than conventionalised and contested. Nevertheless, the term has its uses, especially when referring to that body of work in sociolinguistics which is about "registral variation," where the term tells us we are dealing with language varying according to socio-situational parameters. In contrast, the possible parallel term "genre/generic variation" does not seem to be used, because while you can talk about "language variation according to social situations of use," it makes no sense to talk about "categories of texts varying according to the categories they belong to." Of course, I am not saying that genres do not have internal variation (or sub-genres). I am saying that "genre variation" makes no sense as a parallel to "register variation" because while you can talk about language (registers) varying across genres, it is tautologous to talk about genres (text categories) varying across genres or situations. In other words, when we study differences among genres, we are actually studying the way the language varies because of social and Language Learning & Technology 47 David Lee Genres, Registers, Text Types, Domains, and Styles situational characteristics and other genre constraints (registral variation), not the way texts vary because of their categorisation. Genres as Basic-Level Categories in a Prototype Approach One problem with genre labels is that they can have so many different levels of generality. For example, some genres such as "academic discourse" are actually very broad, and texts within such a high-level genre category will show considerable internal variation: that is, individual texts within such a genre can differ significantly in their use of language (as, for example, Biber, 1988, has shown). A second problem, as Kress noted, is that different "genres" can be based on so many different criteria (domain, topic, participants, setting, etc.). There is a possible solution to this. Steen (1999) is an interesting attempt at applying prototype theory (Rosch, 1973a, 1973b, 1978; Taylor, 1989) to the conceptualisation of genre (and hence to the formalisation of a taxonomy of discourse; cf. also Paltridge, 1995, who made a similar argument but from a different perspective). Basically, the prototype approach can be summarised by Table 2 (which represents my understanding of Steen's ideas; my own suggestions are marked by "?"): Table 2. A Prototype Approach to Genre SUPERORDINATE Mammal BASIC-LEVEL Dog/Cat SUBORDINATE [PROTOTYPE] Cocker spaniel / Siamese Literature ["SUPERGENRE"?] Novel, Poem, Drama [GENRE] Advertising ["SUPERGENRE'"] Western, Romance, Adventure [SUB-GENRE] Print ad, Radio ad, TV ad, Tshirt ad [SUB-GENRE] Advertisement [GENRE] Basic-level categories are those which are in the middle of a hierarchy of terms. They are characterised as having the maximal clustering of humanly-relevant properties (attributes), and are thus distinguishable from superordinate and subordinate terms: "It is at the basic level of categorization that people conceptualize things as perceptual and functional gestalts" (Taylor, 1989, p. 48). A basic-level category, therefore, is one for which human beings can easily find prototypes or exemplars, as well as less prototypical members. Subordinate-level categories, therefore, operate in terms of prototypes or fuzzy boundaries: some are better members than others, but all are valid to some degree because they are cognitively salient along a sliding scale. We can also extend this fuzzy-boundary approach to the other levels (basic-level and superordinate) to account for all kinds of mixed genres and super-genres (e.g., to what degree can Shakespeare's dramas be said to be different from poetry? When does good advertising become a form of literature or vice versa?). Steen (1999) applies the idea of basic-level categories and their prototypes to the conceptualisation of genre as follows: It is presumably the level of genre that embodies the basic level concepts, whereas subgenres are the conceptual subordinates, and more abstract classes of discourse are the superordinates. Thus the genre of an advertisement is to be contrasted with that of a sermon, a recipe, a poem, and so on. These genres differ from each other on a whole range of attributes … The subordinates of the genre of the advertisement are less distinct from each other. The press advertisement, the radio commercial, the television commercial, the Internet advertisement, and so on, are mainly distinguished by one feature: their medium. The superordinate of the genre of the ad, advertising, is also systematically distinct from the other superordinates by means of only one principal attribute, the one of domain: It is "business" for advertising, but it exhibits the respective Language Learning & Technology 48 David Lee Genres, Registers, Text Types, Domains, and Styles values of "religious", "domestic" and "artistic" for the other examples. [all italics added] (p. 112) Basically, Steen is proposing that we can recognise genres by their cognitive basic-level status: True genres, being basic-level, are maximally distinct from one another (in terms of certain "attributes" to be discussed below), whereas members at the level of sub-genre (which operate on a prototype basis) or "super-genre"8 have fewer distinctions among themselves. The proposal is for genres to be treated as basic-level categories which are characterised by (provisionally) a set of seven attributes: domain (e.g., art, science, religion, government), medium (e.g., spoken, written, electronic), content (topics, themes), form (e.g., generic superstructures, à la van Dijk (1985), or other text-structural patterns), function (e.g., informative, persuasive, instructive), type (the rhetorical categories of "narrative," "argumentation," "description," and "exposition") and language (linguistic characteristics: register/style[?]). Steen offers only a preliminary sketch of this approach to genre (and hence to a taxonomy of discourse), and, as it stands, it appears to be too biased towards written genres. Other attributes can (and should) be added: for example, setting or activity type, to distinguish a broadcast interview from a private interview; or audience level, to distinguish public lectures from university lectures (and both attributes to distinguish the latter from school classroom lessons). Another point is that dependencies among the attributes exist (many values for domain, medium, and content are typically co-selected, for instance). Nevertheless, the approach looks like a promising one, and when fully developed will help us sort out genres from sub-genres. "GENRES" IN CORPORA Applying this "fuzzy categories" way of looking at genre to corpus studies, we can see that the categories to which texts have been assigned in existing corpora are sometimes genres, sometimes sub-genres, sometimes "super-genres" and sometimes something else altogether. (This is undoubtedly why the catchall term "text category" is used in the official documentation for the LOB and ICE-GB corpora. Most of these "text categories" are equivalent to what I am calling "genres" in the BNC Index.) For example, consider ICE-GB corpus categories in Table 3. Table 3. Text Categories in ICE-GB (figures in parentheses indicate the number of 2,000-word texts in each category) Medium I Medium II (?) or Interaction Type (?) Super-genre or Function Private (100) Dialogue (180) Public (80) SPOKEN (300) Monologue (100) Unscripted (70) Scripted (30) Mixed (20) Language Learning & Technology Genres or Sub-genres face-to-face conversations (90) phone calls (10) classroom lessons (20) broadcast discussions (20) broadcast interviews (10) parliamentary debates (10) legal cross-examinations (10) business transactions (10) spontaneous commentaries (20) unscripted speeches (30) demonstrations (10) legal presentations (10) broadcast talks (20) non-broadcast speeches (10) broadcast news (20) 49 David Lee Genres, Registers, Text Types, Domains, and Styles Non-Printed (50) Non-professional writing (20) Correspondence (30) Academic writing (40) WRITTEN (200) Printed (150) Non-academic writing (40) Reportage (20) Instructional writing (20) Persuasive writing (10) Creative writing (20) student essays (10) student examination scripts (10) social letters (15) business letters (15) humanities (10) social sciences (10) natural sciences (10) technology (10) humanities (10) social sciences (10) natural sciences (10) technology (10) press news reports (20) administrative/regulatory (10) skills/hobbies (10) press editorials (10) novels/stories (20) The top row of the table is my attempt at describing what attribute(s) or levels the terms within each column represent. The terms within the last column are what end-users of the corpus normally work with, and can be seen to be either genres or sub-genres, viewed from a prototype perspective (e.g., "broadcast interview" is probably best seen as a sub-genre of "interview," differing mainly in terms of the setting, and business letters differ from social letters mainly in terms of domain). Most of the terms in the third column can be said to describe "super-genre" or "super-super-genres," with the exception of "instructional writing" and "persuasive writing" (shaded), which seem more like functional labels.9 The British National Corpus (BNC), in contrast, has no text categorisation for written texts beyond that of domain, and no categorisation for spoken texts except by "context" and demographic/socio-economic classes. The following diagram shows the breakdown of the BNC: Figure 4. Domains in the British National Corpus (BNC) Language Learning & Technology 50 David Lee Genres, Registers, Text Types, Domains, and Styles It can be seen that for the written texts, domains are broad "subject fields" (see Burnard, 1995). These are closely paralleled for the spoken texts by even broader "context" categories covering the major spheres of social life (leisure, business, education, and institutional/public contexts). Apart from considering all the demographically sampled conversations as constituting one super-genre of "casual conversation" and all the written imaginative texts as forming a super-genre "literature," genres cannot easily be found at all under the current domain scheme. More about these BNC categories and their (non-) usefulness will be said in later sections. Moving on to the LOB corpus (Table 4), we see that it is mostly composed of a mixture of genre and subgenre labels: Table 4. Genres in the LOB Corpus LOB Corpus (Written) Press: reportage Press: editorial Press: reviews Religion Skills, trades & hobbies Popular Lore Belles lettres, biography, essays Misc (gov docs, foundation reports, industry reports, college reports, inhouse organ) Learned/scientific writings General fiction Mystery & detective fiction Science fiction Adventure & western fiction Romance & love story Humour Examined in terms of Steen's genre attributes, the shaded cells in Table 4 above are clearly sub-genres of some general super-genre of "fiction" (both "novels" and "short stories" -- the basic-level genres in Steen's taxonomy -- are included). "Religion," on the other hand, appears to be a domain label since it brings together disparate books, periodicals and tracts whose principal common feature is that they are concerned with religion (in this case Christianity).10 Why do we have all these different levels or types of categorisation? It is tempting to believe that this is the case because the corpus compilers felt that these were the most useful, salient, or interesting categories -- perhaps these are basic-level genres, or prototypical sub-genres (especially those which keep appearing in different corpora). But is it a problem that the categories differ in terms of their defining attributes and in terms of generality? My personal opinion is that it is not. Cranny-Francis (1993, p. 109) touches on this point and asks: If "genre" has this range of different meanings and classificatory procedures -- by formal characteristics, by field -- we might ask what is its value? Why is it so useful to educators, linguists and critics, as well as to publishers, filmmakers, booksellers, readers and viewers? She suggests that the reason is simply because genre "is never simply formal or semantic [based on field or subject area] and it is not even simply textual." Using the terms as defined in this paper, we could Language Learning & Technology 51 David Lee Genres, Registers, Text Types, Domains, and Styles paraphrase this to read, "genre is never just about situated linguistic patterns (register), functional cooccurrences of linguistic features (text types), or subject fields (domain), and it is not even simply about text-structural/discoursal features (e.g., Martin's [1993] generic stages, Halliday & Hasan's [1985] GSPs, van Dijk's [1985] macrostructures, etc.)." It is, in fact, all of these things. This makes it a messy and complex concept, but it is also what gives it its usefulness and meaningfulness to the average person. They are all genres (whether sub- or super-genres or just plain basic-level genres). The point of all this is that we need not be unduly worried about whether we are working with genres, sub-genres, domains, and so forth, as long as we roughly know what categories we are working with and find them useful. We have seen that the categories used in various corpora are not necessarily all "proper" genres in a traditional/rhetorical sense or even in terms of Steen's framework, but they can all be seen as "genres" at some level in a fuzzy-category, hierarchical approach. A genre is a basic-level category, which has specified values for most of the seven attributes suggested above and which is maximally distinct from other categories at the same level. "Sub-genres" and "super-genres" are simply other (fuzzy) ways of categorising texts, and have their uses too. The advantages of the prototype approach are that (a) gradience or fuzziness between and within genres is accorded proper theoretical status, and (b) overlapping of categories is not a problem (thus texts can belong to more than one genre). From one point of view, until we have a clear taxonomy of genres, it may be advisable to put most of our corpus genres in quotation marks, because genre is also often used in a folk linguistic way to refer to any more-or-less coherent category of text which a mature, native speaker of a language can easily recognise (e.g., newspaper articles, radio broadcasts), and there are no strict rules as to what level of generality is allowable when recognising genres in this sense. In a prototype approach, however, it does not seriously matter. Some text categories may be based more on the domain of discourse (e.g., "business" is a domain label in the BNC for any spoken text produced within a business context, whether it is a committee meeting or a monologic presentation). Spoken texts, which tend to be even more loosely classified in corpus compilations, may simply be categorised on whether they are spontaneous or planned, broadcast or spoken face-to-face, as in the London-Lund Corpus, for instance, which means the categories are "genres" only in a very loose sense. This goes to show that there are still serious issues to grapple with in the conceptualisation of spoken genres (written ones are, in contrast, typically easier to deal with) but that a prototype approach, with its many levels of generality and a set of defining attributes, may help to tighten up our understanding. These brief visits to the various corpora suggest that there should not be any serious objections (theoretical or otherwise) to the use of the term genre to describe most of the corpus categories we have seen. Such usage reflects a looser approach, but there is no requirement for genres to actually be established literary or non-literary genres, only for them to be culturally recognisable as groupings of texts at some level of abstraction. The various corpora also show us that the recognition of genres can be at different levels of generality (e.g., "sermons" vs. "religious discourse"). In the LOB corpus, the category labels appear to be a mix: some are sub-genre labels (e.g., "mystery fiction" and "detective fiction"), while others are more properly seen as domain labels ("Skills, trades, & hobbies," "Religion"). My own preferred approach with regard to developing a categorisation scheme is to use genre categories where possible, and domain categories where they are more practical (e.g., "Religion"11). THE BNC JUNGLE: THE NEED FOR A PROPER NAVIGATIONAL MAP Having clarified some of the terminology and concepts and looked at the categories used in a few existing corpora, I want to move on to consider some of the problems with the British National Corpus as it now stands, and then introduce a new resource called the BNC Index which (it is hoped) will make it easier for researchers and language learners/teachers to navigate through the numerous texts to find what they need. Language Learning & Technology 52 David Lee Genres, Registers, Text Types, Domains, and Styles Some Existing Problems Overly Broad Categories. The first problem that prompts the need for a navigational map has to do with the broadness and inexplicitness of the BNC classification scheme. For example, academic and nonacademic texts under the domains "Applied Science," "Arts," "Pure/Natural Science," "Social Science," and so forth, are not explicitly differentiated. (It is interesting to note, in this connection, that under the attribute of "genre" in the "text typology" of Atkins et al., 1992, p. 7, no mention is made of the useful distinction between academic and non-academic prose, even though this is employed in one of the earliest corpora, the LOB corpus, where the "learned" category has proved to be among the most popular with linguists.) Another example that points to the inadequacy of the BNC's categorisation of texts is the way "imaginative" texts are handled. A wide variety of imaginative texts (novels, short stories, poems, and drama scripts) is included in the BNC, which is a good thing because the LOB, for example, does not contain poetry or drama. However, such inclusions are practically wasted if researchers are not actually able to easily retrieve the sub-genres on which they want to work (e.g., poetry) because this information is not recorded in the file headers or in any documentation associated with the BNC. There is at present no way to know whether an "imaginative" text actually comes from a novel, a short story, a drama script or a collection of poems (unless the title actually reflexively includes the words "a novel" or "poems by XYZ"). For example, given text files with titles like "For Now" or "The kiosk on the brink," there is no way of knowing that both of these are actually collections of poems. All the BNC bibliography and file headers tell us is that these are "imaginative" texts, taken from "books." Classification Errors and Misleading Titles. In the process of some previous research, I found that there were many classificatory mistakes in the BNC (and also in the BNC Sampler): some texts were classified under the wrong category, usually because of a misleading title. For the same reason, even though a limited, computer-searchable bibliographical database of the BNC texts exists12 (compiled by Adam Kilgarriff), not enough information is included there, and researchers cannot always rely on the titles of the files as indications of their real contents: For example, many texts with "lecture" in their title are actually classroom discussions or tutorial seminars involving a very small group of people, or were popular lectures (addressed to a general audience rather than to students at an institution of higher learning). A good reason for a navigational map, then, is so that we can go beyond the existing information we have about the BNC files (and beyond the mistakes) and to provide genre classifications, so that researchers do not have just the titles of files to go on. Sub-Genres Within a Single File. Another problem, which will only be touched on briefly because there is no real solution, is that some BNC files are too big and ill-defined in that they contain different genres or sub-genres. For example, newspaper files described in the title as containing "editorial material" include letters-to-the-editor, institutional editorials (those written by the editor), and personal editorials (commentaries/personal columns written by journalists or guest writers), and some courtroom files contain both legal cross-examinations (which are dialogic) as well as legal presentations (summing-up monologues by barristers or judges). This is a problem for lines of linguistic enquiry that rely on relatively homogeneous genres. It is a problem, however, which cannot be solved easily because the splitting of files is beyond the scope of most end-users of the BNC. The problem is just mentioned here as a caution to researchers. Domains Versus Genres: The BNC Sampler & Why We Need Genre Information The BNC Users' Reference Guide states that only three criteria were used to "balance" the corpus: domain, time, and medium. In choosing texts for inclusion into the BNC Sampler (the 2-million word sub-set of the BNC), domain was probably the most important criterion used to ensure a wide-enough coverage of a variety of texts. On the BNC Web page for the Sampler, the following comment on its representativeness is made: Language Learning & Technology 53 David Lee Genres, Registers, Text Types, Domains, and Styles In selecting from the BNC, we tried to preserve the variety of text-types represented, so the Sampler includes in its 184 texts many different genres [italics added] of writing and modes of speech. It should be noted that no real claim to representativeness is made, and that what they really meant was that many different texts were chosen on the basis of domain and other criteria.13 The fact that the Sampler contains many different genres is not in doubt, but the texts were not chosen on this basis, since they had no genre classification, and hence the Sampler cannot (and, indeed, it does not) claim to be representative in terms of "genre." It is my belief that it is because "domain" is such a broad classification in the BNC that the Sampler turned out to be rather unrepresentative of the BNC and of the English language. Anyone wishing to use the Sampler should be under no illusion that it is a balanced corpus or that it represents the full range of texts as in the full BNC. The Sampler may be broadly balanced in terms of the domains, but when broken down by genre, a truer picture emerges of exactly how (un)representative it really is. Appendix A lists missing or unrepresentative genres in the Sampler BNC which demonstrate this. "Genre" is perhaps a more insightful classification criterion than "domain," as least as far as getting a representatively balanced corpus is concerned. If the compilers of the BNC Sampler had known the genre membership of each BNC text, they would probably have created a more balanced and representative subcorpus. As things stand, however, any conclusions about "spoken English" or "written English" made on the basis of the BNC Sampler will have to be evaluated very cautiously indeed, bearing in mind the genres missing from the data. There is another example of how large, undifferentiated categories similar to domain can unhelpfully lump disparate kinds of text together. Wikberg (1992) criticises the LOB text category E ("Skills, trades, and hobbies") as being too baggy or eclectic. He demonstrates how, on the evidence of both external and internal criteria, the texts in Category E can actually be better sub-classified into "procedural" versus "non-procedural" discourse. He also notes that it is not just text categories that can be heterogeneous. Sometimes texts themselves are "multitype" or mixed in terms of having different stages with different rhetorical or discourse goals. He thus concludes with the following comment: An important point that I have been trying to make is that in the future we need to pay more attention to text theory when compiling corpora. For users of the Brown and the LOB corpora, and possibly other machine-readable texts as well, it is also worth noting the multitype character of certain text categories. (p. 260) This is a piece of advice worth noting. THE BNC (BIBLIOGRAPHICAL) INDEX The BNC Index spreadsheet I am about to describe was created as one solution to the previously mentioned problems and difficulties. It is similar to the plain text ones prepared by Adam Kilgarriff that I have benefited from and found rather useful.14 However, those files do not contain all the details which are needed for compiling your own sub-corpus (author type, author age, author sex, audience type, audience sex, section of text sampled, [topic] keywords, etc.). Sebastian Hoffmann's files were useful too, in a complementary way, but these do not include (a) keywords and (b) the full bibliographical details of files. A third existing resource, the "bncfinder.dat" file that comes with the standard distribution of the BNC (version 1) has most of the header information, but in the form of highly abbreviated numeric codes, and also does not include any bibliographical information about the files or keywords. The BNC Index consolidates the kinds of information available in the above three resources, but, in addition, includes (a) BNC-supplied keywords (as entered in the file headers by the compilers); (b) COPAC keywords15 for published non-fiction texts16 (topic keywords entered by librarians); (c) full bibliographical details Language Learning & Technology 54 David Lee Genres, Registers, Text Types, Domains, and Styles (including title, date and publisher for written texts, and number of participants for spoken files); (d) an extra level of text categorisation, "genre," where each text is assigned to one of the 70 genres or subgenres (24 spoken and 46 written) developed for the purposes of this Index; (e) a column supplying "Notes & Alternative Genres," where texts which are interdisciplinary in subject matter or which can be classified under more than one genre are given alternative classifications. Also entered here are extra notes about the contents of files (e.g., where a single BNC file contains several sub-genres within it, such as postcards, letters, faxes, etc., these are noted). These extra notes are the result of random, manual checks: not all files have been subjected to such detailed analysis. For some written texts taken from books, the title of the book series is also given under this column (e.g., file BNW, "Problems of unemployment and inflation," is part of the Longman book series "Key issues in economics and business"). It is hoped that this will be a comprehensive, user-friendly, "one-stop" database of information on the BNC. All the information is presented using a minimum of abbreviations or numeric codes, for ease of use. For example, m_pub (for "miscellaneous published") is used instead of a cryptic numeric code for the medium of the text, and domains are likewise indicated by abbreviated strings (e.g., W_soc_science, S_Demog_AB) rather than numbers. It should be noted that I carried out the genre categorisation of all the texts by myself: This ensures consistency, but it also means that some decisions may be debatable. The pragmatic point of view I am taking is that something is better than nothing, and that it is beneficial to start with a reasonable genre categorisation scheme and then let end-users report problem/errors and dictate future updates and improvements. When compiling a sub-corpus for the purpose of research, classroom concordancing, genre-based learning, and so forth, you need all the available information you can get. With the BNC Index, it is now possible, for example, to separate children's prose fiction from adult prose fiction by combining information from the "audience age" field and the newly introduced "genre" field (using domain alone would have included poems as well). All the information in the spreadsheet is up-to-date and as accurate as possible, and supersedes the information given in the actual file headers and the "bncfinder.dat" file distributed with the BNC (version 1), both of which are known to contain many errors. Changes and corrections to erroneous classifications were made both after extensive manual checks and on the basis of error reports made by others. The following section lists and explains all the columns/fields of information given in the BNC Index. Some of the genre categories are still being worked on, however, and may change in the final release of the Index. Notes on the BNC Index For spoken files, there are only eight relevant fields of information, giving the following self-explanatory details (abbreviations are explained in Table 6):17 File ID Domain Genre Keywords natural & S_cg_ed pure FLX S_classroom ucation science; chemistry Word Interaction Mode Total Type 5,142 Dialogue S Bibliographical Details 11th year science lesson: lecture in chemistry of metal processing (Edu/inf). Rec. on 23 Mar 1993 with 2 partics, 381 utts Note that Mode only distinguishes broadly between spoken (S) and written (W). To further restrict searches to only "demographic" files or only "context-governed" files, the Domain field should be used. For written files, there can be up to 19 fields of information (depending on the file: fields which do not apply to a particular file are left blank). As an example, the entry for AE7 is as follows: Language Learning & Technology 55 David Lee Genres, Registers, Text Types, Domains, and Styles Notes & Alternative Genres W_nat_ W_non_ Also science ac_nat_ W_non_ac_hu science manities_arts File Medium Domain Genre ID COPAC Keywords Keywords AE book 7 Biology Philosophy molecular genetics Total Circulation Period Sampling Words Status Composed The problems of biology. 36,115 mid M 1985-1994 Maynard Smith, John. Oxford: OUP, 1989, pp. 9-109. 1686 s-units. Bibliographical details Audience Audience Audience Age Sex Level adult Mode W mixed Author Author Age Sex 60+ yrs Male high Author Type Sole The information fields are explained more fully in the BNC User's Reference Guide, but here is a brief explanation of some of them: The table above tells us that file AE7 is a sample extracted from the middle (Sample Type) of a book (Medium), whose Circulation Status is Medium (this refers to the number of receivers of the text),18 whose author (Author Age/Sex/Type) is 60+ yrs old (age band 6 in terms of BNC codes), is Male and is the Sole author of the text. The text has been manually classified as "non_academic prose, natural sciences" (Genre), although it also deals with philosophical issues (COPAC Keywords) and thus may also be considered under "W_non_ac_humanities_arts." The target audience for the text are adults, of both sexes (mixed), and high-level (original BNC numerical code="level 3"). The BNC compilers have classified it under "natural sciences" (Domain),19 and the text was composed in the period 1985-1994 (Period Composed).20 The Bibliographical Details field gives us the title of the text (The Problems of Biology), its author, publisher, and so forth, and an indication of the number of sentences ("s-units"), while the (BNC compilers') Keywords field supplies the detail that the book is about molecular genetics (COPAC and BNC keywords tend to be about topic, and are sometimes useful for sub-genre identification). The page numbers under Bibliographical Details were, in this case and many others, not actually given in the original BNC bibliography, but were manually added to the Index after I had searched in the file for the page break SGML elements. This is to allow proper, complete referencing (the original bibliographical reference would have been "pp. ??"). However, some files did not have page breaks encoded at all, and thus their bibliographical references remain incomplete. A list of all possible values for the closed-set fields (the keyword fields are open-ended) is given in Appendix B. With all these fields of information put together in a one database/spreadsheet, where they can be combined with one another, it becomes easy to scan the BNC for whatever particular kinds of text you are interested in. Further Notes on the Genre Classifications The genre categories used in the BNC Index were chosen after a survey of the genre categorisation schemes of other existing corpora (e.g., LLC, LOB, ICE-GB) and will thus be familiar to users and compatible with these other corpora, allowing comparative studies based on genres taken from different corpora. These genre labels have been carefully selected to capture as wide a range as possible of the numerous types of spoken and written texts in the English language, and the divisions are more finegrained than the domain categories used in the BNC itself. Note that some genre labels are hierarchically nested so that, for example, if you simply want to study "prototypical academic English" and are not concerned with the sub-divisions into social sciences, humanities, and so forth, you can find all such files by searching for "W_ac*" and specifying "high" for "audience level."21 Or if you are interested in the Language Learning & Technology 56 David Lee Genres, Registers, Text Types, Domains, and Styles language of the social sciences, whether spoken or written, you can similarly use wildcards to search for "*_soc_science." In general, where further sub-genres can be generated on-the-fly through the use of other classificatory fields, they are not given their own separate genre labels, to avoid clutter. For instance, "academic texts" can be further sub-divided into" (introductory) textbooks" and "journal articles," but since this can very easily be done by using the medium field (i.e., by choosing either "book" or "periodical"), the sub-genres have not been given their own separate labels. Instead, end-users are encouraged to use available fields to create their own sub-classificatory permutations. The "genre" labels here are therefore meant to provide starting points, not a definitive taxonomy. Table 5 shows the breakdown of the genre categories used in the BNC Index spreadsheet more clearly than in the earlier table, and also shows the super-genres that some researchers may want to study (made possible by the use of hierarchical genre labels). Table 5. Breakdown of BNC Genres in proposed classificatory scheme22 BNC SPOKEN Super Genre S_brdcast_discussn S_brdcast_documentar Broadcast y S_brdcast_news S_classroom S_consult S_conv S_courtroom S_demonstratn S_interview Interviews S_interview_oral_histor y S_lect_commerce S_lect_humanities_arts S_lect_nat_science Lectures S_lect_polit_law_edu S_lect_soc_science S_meeting S_parliament S_pub_debate S_sermon S_speech_scripted Speeches S_speech_unscripted S_sportslive S_tutorial S_unclassified Language Learning & Technology BNC WRITTEN W_ac_humanities_arts W_ac_medicine W_ac_nat_science W_ac_polit_law_edu W_ac_soc_science W_ac_tech_engin W_admin W_advert W_biography W_commerce W_email W_essay_sch W_essay_univ W_fict_drama W_fict_poetry W_fict_prose W_hansard W_institut_doc W_instructional W_letters_personal W_letters_prof W_misc W_news_script W_newsp_brdsht_nat_arts W_ newsp_brdsht_nat _commerce W_ newsp_brdsht_nat _editorial W_ newsp_brdsht_nat _misc W_ newsp_brdsht_nat _reportage W_ newsp_brdsht_nat _science W_ newsp_brdsht_nat _social Super Genre Academic prose Non-printed essays Fiction23 Letters Broadsheet national newspapers 57 David Lee Genres, Registers, Text Types, Domains, and Styles W_ newsp_brdsht_nat _sports W_newsp_other_arts W_newsp_other_commerce W_newsp_other_report W_newsp_other_science W_newsp_other_social W_newsp_other_sports W_newsp_tabloid W_non_ac_ humanities_arts W_non_ac_medicine W_non_ac_nat_science W_non_ac_polit_law_edu W_non_ac_soc_science W_non_ac_tech_engin W_pop_lore W_religion Regional & local newspapers Tabloid newspapers Non-academic prose (non-fiction) It will be noted that aspects of this genre classification scheme mirror the ICE-GB corpus (see Table 5 for the ICE-GB categories), although I have made finer distinctions in some cases (e.g., the lecture and broadsheet sub-genres) and grouped texts differently (e.g., I have "nested" all broadsheet newspaper material together rather than into separate functional groups as in the ICE-GB (cf. "Reportage" and "Persuasive writing" in Table 5). In some respects, the scheme also follows the Lancaster-Oslo/Bergen (LOB) corpus quite closely. This was done deliberately, to facilitate diachronic/comparative research.24 For example, here is how the various subject disciplines are categorised in the LOB corpus and in the BNC Index: Table 6. LOB Corpus Categories Broken Down into Component Disciplines LOB (& BNC Index) Category Humanities Social sciences Natural sciences Medicine Politics, Law, Education Technology & Engineering Subjects/Disciplines Philosophy, History, Literature, Art, Music Psychology, Sociology, Linguistics, Social Work Physics, Chemistry, Biology --Computing, Engineering One difference from the LOB corpus is that economics texts in the BNC Index are not put under "politics, law and education," but are instead put under the "W_commerce" genre. Also, archaeology and architecture have been classified as humanities or arts subjects under the present scheme, while geography is classed either as a social or natural science depending on the branch of geography. Geology has been classed as a natural science. One mathematics textbook file for primary/elementary schools was simply put under "miscellaneous," while university-level mathematical texts were put under either "natural_sciences" or "technology & engineering" depending on whether they were pure or applied.25 It should also be noted that some texts are a mixture of disciplines (e.g., history and politics often go hand in hand, but the two are separate categories under this scheme). In such cases, a more or less arbitrary assignment was made, based on what was judged to be the dominant point of view in the text, and, in the case of printed publications, after consultation of the keywords for the text in library catalogues (see discussion which follows). Language Learning & Technology 58 David Lee Genres, Registers, Text Types, Domains, and Styles Some genres are deliberately broad because they can be easily sub-divided using other fields. For example, "institutional documents" includes government publications (including "low-brow'" informational booklets and leaflets/brochures), company annual reports, and university calendars and prospectuses. However, these texts can be fairly easily separated out using "Medium," "Audience level," or "Keywords." The "non-academic" genres relate to written texts (mainly books) sometimes called "non-fiction" which have subject matters belonging to one of the disciplines listed above. They are usually texts written for a general audience, or "popularisations" of academic material, and are thus distinguished from texts in the parallel academic genres (which are targeted at university-level audiences, insofar as this can be determined). In deciding whether a text was academic or not, a variety of cues was used: (a) the "audience level (of difficulty)" estimated by the BNC compilers (coded in the file headers) (b) whether COPAC lists the book as being in the "short loan" collections of British universities (this works in one direction only: absence is not indicative of a work not being academic) (c) the publisher and publication series (academic publishers form a small and recognisable set, and some books have academic series titles, which help to place them in context). The spoken "lecture" genres in the Index refer only to university lectures. Thus, many "A"-level or nonuniversity lectures are classified as "S_speech_unscripted." Similarly, "S_tutorial" refers only to university-level tutorials or classroom "seminars." Other non-tertiary-level or home tutorial sessions are classified under "S_classroom." Genres labels are deliberately non-overlapping for spoken and written texts. For example, parliamentary speeches audio-transcribed by the BNC transcribers are labelled "S_parliament" for the spoken corpus, whereas the parallel, official/published version is labelled "W_hansard" for the written corpus. Also, for spoken texts, the "leftover" files (which do not really belong to any of the other spoken genres used in this scheme, e.g., baptism ceremony, auctions, air-traffic control discourse, etc.) are labelled as "S_unclassified," whereas leftover written files are labelled "W_misc." As mentioned in the first part of this paper, deciding what a coherent genre or sub-genre is can be far from easy in practice, as (sub-)genres can be endlessly multiplied or sub-divided quite easily. Moreover, the classificatory decisions of corpus compilers may not necessarily be congruent with that of researchers. For example, what is considered "applied science"? In the present scheme, "applied science" excludes medicine (which is instead placed in a category of its own), engineering (which is put under "technology"), and computer science (also under "technology"). For the purposes of the BNC Index, a particular "level of delicacy" has been decided on for the genre scheme, based on categories already in use in existing corpora and in the research literature. Users may further sub-divide or collapse/combine genres as they see fit. The present scheme is only an aid; it helps to narrow down the scope of any subcorpus building task. In this connection, it should be noted that due to the way the material was recorded and collated, many of the spoken files (especially "conversation") are less well-defined than the written ones because they are made up of different task and goal types, as well as varying topics and participants (e.g., a single "conversation" file can contain casual talk between both equals and unequals, and "lecture" files often contain casual preambles and concluding remarks in addition to the actual lectures themselves). Researchers wanting discoursally well-defined and homogeneous texts will have to sub-divide texts themselves. If the distribution of linguistic features among "genres" is important to a particular piece of research, then obviously the research can be affected or compromised by the definition/constitution of the "genres" in the first place. For this reason, users of the BNC Index are advised to read the notes/documentation given here, and to be clear what the various domain and genre labels mean.26 To illustrate: the BNC compilers have classified some texts into the "natural/pure sciences" domain (e.g., text CNA, which is taken from the British Medical Journal), which I would consider as belonging to "applied science" or else simply Language Learning & Technology 59 David Lee Genres, Registers, Text Types, Domains, and Styles "medicine" as a separate category. On the other hand, the BNC compilers appear to have a rather loose definition "applied science." Anything which is not directly classifiable or recognisable as being purely about theoretical physics, chemistry, biology or medicine is apparently considered "applied." For example, consider Text ID Medium Domain Bibliographical Details FYX book W_app_science Black holes and baby universes. Hawking, Stephen W. London: Bantam (Corgi), 1993, pp. 1-139. 1927 s-units. AMS book W_app_science Global ecology. Tudge, Colin. London: Natural History Museum Pub, 1991, pp. 1-98. 1816 s-units. AC9 book W_app_science Science and the past. London: British Museum Press, 1991, pp. ??. 1696 s-units. The first book is a popularisation by Stephen Hawking and is an application of physics to the study of the universe or outer space. In the BNC Index genre scheme, I would consider this to be part of the "nonacademic natural sciences" genre (rather than "applied science"). It is a similar situation with the second and third books (which concern ecology and archaeological/historical work, respectively). It is true that these are also about applying scientific ideas in some way, but they do not quite fit in with the more common understanding of "applied science." In the present scheme, text AMS would be under "academic: natural science," and AC9 under "non-academic: humanities." As another example of the classificatory system used here, consider the case of linguistics. Some linguists, including myself, would consider our discipline to be a social science (although others would place us in the humanities). In any case, consider the way the following BNC texts were (inconsistently) classified by the compilers: Text ID B2X Medium periodical CGF book EES m_unpub FAC book FAD book Domain Details W_app_science Journal of semantics. Oxford: OUP, 1990, pp. 321-452. 847 sunits. W_arts Feminism and linguistic theory. Cameron, Deborah. Basingstoke: Macmillan Pubs Ltd, 1992, pp. 36-128. 1581 s-units. W_app_science Large vocabulary semantic analysis for text recognition. Rose, Tony Gerard. u.p., n.d., pp. ??. 2109 s-units. W_soc_science Lexical semantics. Cruse, D A. Cambridge: CUP, 1991, pp. 1124. 2261 s-units. W_soc_science Linguistic variation and change. Milroy, J. Oxford: Blackwell, 1992, pp. 48-160. 1339 s-units. It may be the case that the actual content/topic of these linguistics-related texts makes them seem less like social science texts than arts or applied science texts (e.g., text ESS is a dissertation on computer handwriting recognition by a student from a department of computing,). But if so, what does it make of the general public's understanding of domain labels like "linguistics" and "social sciences," then? These are important questions when one is seeking to draw conclusions about the distribution of linguistic features found in particular genres. For the present purposes, therefore, one particular stand has been taken on how to classify texts, and readers should bear this in mind. (In the case of the above example, all were classified as "academic: social science" except EES, which was put under "academic: technology and engineering.") What About Library Classificatory Codes? At this point, some people may be wondering if the classification systems used by libraries might be of use in helping us determine the proper genre labels. Atkins et al. (1992, p. 8) note in their discussion of the corpus attribute topic that "It is necessary to draw up a list of major topics and subtopics in the Language Learning & Technology 60 David Lee Genres, Registers, Text Types, Domains, and Styles literature. Library science provides a number of approaches to topic classification." This is an area that is beyond my expertise and the scope of this article, but I will make a few brief comments here.27 Several library classification/cataloguing systems are in use all over the world. They are all principally about subject areas (or topic) rather than about genre, although the two are, of course, related in many cases. A familiar scheme, the Dewey Decimal Classification system, is shown in Table 7. Table 7. Dewey Decimal Classification System Classmark Class 0 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 [Broad Area] & Subject Areas [GENERALITIES] Generalities; Catalogues; Newspapers; Computing [PHILOSOPHY & PSYCHOLOGY] Philosophy; Psychology; The Mind; [RELIGION] Religion [SOCIAL SCIENCES] Social Sciences; Law; Government; Society; Commerce; Education; [LANGUAGE] Linguistics; Scientific Study of Language [NATURAL SCIENCES & MATHEMATICS] Pure Sciences; Mathematics; Physics; Chemistry; Biology; [TECHNOLOGY (APPLIED SCIENCES)] Applied Sciences; Engineering; Medicine; Manufacturing; [THE ARTS] The Arts; (Music, Drama etc.) Recreations; Hobbies; [LITERATURE & RHETORIC] Literature [GEOGRAPHY & HISTORY] Geography; History; Information about localities In addition to the Classmark, however, library materials are also given keywords which generally consist of Library of Congress subject headings (usually related to topic[s]). These are very useful when it comes to finding out what a text is about (or, in the case of fiction texts, what a text is).28 In the case of literary texts, actual genre labels are sometimes given as keywords, and a frighteningly large number of subgenres have been identified by the British Library cataloguers. These may prove useful to those who desire detailed sub-genre information on literary texts. A few examples will suffice here: Adventure stories, Detective and mystery stories, Picaresque literature, Robinsonades, Romantic suspense novels, Sea stories, Spy stories, Thrillers, Allegories, Didactic fiction, Fables, Parables, Alternative histories, Dystopias, Bildungsromane, Arthurian romances, Autobiographical fiction, Historical fiction, Satire, Christmas stories, Medical novels, Folklore, Domestic fiction, Ghost stories, Horror tales, Magic realism, Occult fiction, Feminist fiction, and Tall tales. In addition to these fascinatingly categorised sub-genres,29 the library also includes "form headings," which are meant to "define a type of fiction in terms of specific presentation, provenance, intended audience, form of publication."30 Examples include Young adult fiction, Children's stories, Readers (Elementary), Plot-your-own stories, Diary fiction, Epistolary fiction, Movie novels, Scented books, Glow-in-the-dark books, Toy and movable books, Graphic novels, Radio and television novels, Sound effects books, Musical books, and Upside-down books. As can be seen, therefore, library catalogues are a potentially valuable source of information as far as the genre classification of fiction texts and the identification of subject topics in non-fiction texts are concerned. Such information was, in fact, used in the process of creating the BNC Index, during the manual stage of checking and correcting the initial genre classifications I had made. Using the BNC Index The BNC Index will be distributed in the Microsoft Excel® spreadsheet format as well as in a tabdelimited format (it will also be incorporated into two custom-built, user-friendly programs: see below).31 On a practical note, the advantage of using the Excel format is that there is a quick way of displaying only the texts which match your chosen criteria through the use of the relatively user-friendly "Autofilter" function (under the "Data" menu in the program, choose "Filter" and then "Autofilter"). With the Autofilter switched on, the top row of every field (column) will have a drop-list which can be used to Language Learning & Technology 61 David Lee Genres, Registers, Text Types, Domains, and Styles instantly filter down to the texts you want displayed (clicking on the drop-list button reveals all the possible values for that field (e.g., genre), and you just select the one you want). Fields are combinable, so you can, for example, first restrict the display to only "social science" texts under domain, then further restrict this to only "periodicals" under medium, and end up with social science periodicals. It is also possible to make more advanced searches, by activating the "Custom" filter dialogue box from the relevant drop-list. This will allow you to filter the fields using wildcards. One caveat needs to be issued to users, however: They should not rely entirely on the genre labels, but should also check the "Alternative Notes" column and scan/browse the files, too. For example, texts labelled "S_brdcast_discussion" also contain news reportage (in between the broadcast talk shows/programmes). This is unavoidable, since some BNC files combine genres and sub-genres and can only be labelled in terms of the majority type. Some of the BNC-supplied fields are also not entirely accurate. Many of the files which are coded as "monologue" (under the Interaction Type column), for example, actually include some dialogue as well (i.e., they are mostly monologue, but not exclusively). A stand-alone Windows® program, called BNC Indexer®, has been developed by Antonio Moreno Ortiz using the information contained in my spreadsheet.32 A web-based facility, BNC Web Indexer, is also being developed at Lancaster, which does essentially the same thing.33 Both programs are similar in layout and function. They are much easier to use than the Excel spreadsheet since they do not require any knowledge of spreadsheet/database programs and have very simple, intuitive interfaces (perfect for classroom situations). All the information fields (domain, genre, audience age, author sex, etc.) and their values are displayed on screen and users simply select the values they want to use and then press a button to execute the query. A results panel shows all the texts which match the filtering criteria, along with bibliographical and other information. (With BNC Indexer, individual texts can also be deselected from the output list if so desired, and can be browsed first by double-clicking on the relevant line.) Output file lists containing the file IDs of the BNC files which matched the criteria can be generated and fed into concordancers such as WordSmith or MonoConc,34 which can use a list of filenames to specify a subcorpus to which future queries are to be restricted. Note that with both BNC Indexer and BNC Web Indexer, individual files can always be deleted from the output list if so desired, so users do not have to accept the classification decisions wholesale but can vet individual texts before allowing them into a subcorpus. It is beyond the scope of the present article to give more practical instructions or examples on how to use the BNC Index spreadsheet or the Indexer programs. Users will, in any case, surely find their own favourite ways of doing things, or may visit the relevant web sites for further information. THE USES OF GENRE In this paper, I have examined the different usages of the terms genre, text type, register, domain, style, and so forth. Which of these concepts is most useful for researchers, or for teachers to use in the context of classroom concordancing? I suggest that it is fruitful to start by looking at genres (categories of texts), and end up by generalising (through induction) about the existence of registers (linguistic characteristics) or even "text types" in Biber's sense (categories of texts empirically based on linguistic characteristics). The work by Carne (1996), Cope & Kalantzis (1993), Flowerdew (1993), Hopkins & Dudley-Evans (1988), Hyland (1996), Lee (in press), McCarthy (1998a, 1998b), Thompson (in press), and Tribble (1998, 2000), to name but a few, show how a genre-based approach to analysing texts can yield interesting linguistic insights and may be pedagogically rewarding as well. Thompson's paper, for example, shows how genre-based cross-linguistic analyses of travel brochures and job advertisements can reveal subtle, linguistically-coded differences in culture and point of view. Such genre analyses of relatively small, focussed and manageable sets of texts are now possible with the help of the BNC Index, opening up a rich resource for all kinds of learning and research activities. By searching for keywords in the various database fields, teachers and researchers can now quickly find even such rare sub-genres as Language Learning & Technology 62 David Lee Genres, Registers, Text Types, Domains, and Styles postcards, lecture notes, shopping lists and school essays ("rare" in the sense that they were not included in previous-generation general corpora and are hard to get hold of in machine-readable format even nowadays). The personal BNC Index project described here is an attempt at classifying the corpus texts into genres or super-genres, and putting this and other types of information about the texts into a single, informationrich, user-friendly resource. This Index may be used to navigate through the mass of texts available. Users can then see at once how many texts there are that match certain criteria, and the total number of words they constitute. In this way, sub-corpora can then be easily created for specialised research or teaching/learning activities (e.g., it is now easy to retrieve BNC texts for ESP lessons to do with law, medicine, physics, engineering, computing, etc.). Ultimately, one would wish that a deeper understanding of genres (their forms, structures, patterns) would be a "transformative" exercise for all investigators. As Cranny-Francis (1993) says, Genre is a category which enables the individual to construct critical texts; by manipulating genre conventions to produce texts which engender [critical analysis.] It also enables, therefore, the construction of a new, different consciousness … A concept of genre allows the critic or analyst to explore [the] complex relationships in which a text is involved, relationships which ultimately relate back to what a text means. This is because what a text says and how it says it cannot be separated; this is fundamental to our notion of genre. Because of this, genre provides the link between text and context; between the formal and semantic properties of texts; between the text and the intertextual, disciplinary and technological practices in which it is embedded. (pp. 111-113) I hope that the disparate users and potential users of the BNC, whether researchers, teachers or students, will find the genre-enhanced BNC Index useful for all kinds of linguistic enquiry, and that some of the above transformative goals will be realised for them. APPENDIX A SPOKEN BNC Sampler: Missing or Unrepresentative Genres • Consultations: medical (none) • Consultations: legal (none) • Classroom discourse (only 3 texts) • Public debates (only 3 texts) • Job interviews (none) • Parliamentary debates (none) • News broadcasts (none) • Legal presentations (there are 2 legal cross-examinations, but no presentations, i.e., monologues) • University lectures (none) • Telephone conversations (no pure telephone conversations in the BNC as a whole) • Sermons (only 1 text) • Live sports discussions (none) • TV/radio discussions (only 4 texts) • TV documentaries (only 2 texts) Language Learning & Technology 63 David Lee Genres, Registers, Text Types, Domains, and Styles WRITTEN BNC Sampler: Missing or Unrepresentative Genres • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Academic prose: humanities (none) Academic prose: medicine (none) Academic prose: politics, law and education (only 2 texts on law, none on politics or education) Academic prose: natural sciences (nothing on chemistry, only 1 on biology & 3 on physics) Academic prose: social sciences (nothing on the core subject areas of sociology or social work, nor on linguistics, which is arguably a social science, even though it is often treated as a humanities subject) Academic prose: technology & engineering (nothing on engineering) Administrative prose (only 1 text) Advertisements (none) Broadsheets: the only broadsheet material included consisted entirely of foreign news, and only from the Guardian. Broadsheets: sports news (none) Broadsheets: editorials and letters (none) Broadsheets: society/cultural news (none) Broadsheets: business & money news (none) Broadsheets: reviews (none) Biographies (none) E-mail discussions (none) Essays: university (only 1 text) Essays: school (none) Fiction: Drama (only 1 text) Fiction: Poetry (only 2 texts) Fiction: Prose (insufficient texts, and only 1 short story) Parliamentary proceedings/Hansard (none) Instructional texts (none) Personal letters (none) Professional letters (none) News scripts (only 1 radio sports news script) Non-academic: humanities (only 2 texts) Non-academic: medicine (none) Non-academic: pure sciences (none) Non-academic: social sciences (2 rather odd texts, and 1 which possibly could be non-academic) Non-academic pure science material (i.e. popularisations of science texts: there were none of these in the Sampler) News scripts (classified as 'written-to-be-spoken' in the main BNC. None included in the Sampler) Official documents (only 1 text) Tabloid newspapers (only Today and East Anglian Daily Times, the latter of which is not really a tabloid, but a regional newspaper) Language Learning & Technology 64 David Lee Genres, Registers, Text Types, Domains, and Styles APPENDIX B Information Fields and Possible Values in the BNC Index (the abbreviations/codes are in bold) Field Medium Domain Genre (70 in total) Possible Values [Written texts only] book, m_pub (miscellaneous, published), m_unpub (miscellaneous unpublished), periodical (magazines, journals, etc.), to_be_spoken (written-to-be-spoken) S_cg_business (context-governed, business), S_cg_education (c-g, educational), S_cg_leisure (c-g, leisure), S_cg_public (c-g, public/institutional), S_Dem_AB/C1/C2/DE/Unc (spoken demographic classes for the casual conversation files; 'Unc' = 'unclassified'), W_app_science (applied science), W_arts, W_belief_thought (belief & thought), W_commerce (commerce & finance), W_imaginative (imaginative/creative), W_leisure (leisure), W_nat_science (natural sciences), W_soc_science (social sciences), W_world_affairs (world affairs). [Spoken texts, 24 genres]: S_brdcast_discussn (TV or radio discussions), S_ brdcast_documentary (TV documentaries), S_brdcast_news (TV or radio news broadcasts), S_classroom (non-tertiary classroom discourse), S_consult (mainly medical & legal consultations), S_conv (face-to-face spontaneous conversations), S_courtroom (legal presentations or debates), S_demonstratn ('live' demonstrations), S_interview (job interviews & other types), S_interview_oral_history (oral history interviews/narratives, some broadcast), S_lect_commerce (lectures on economics, commerce & finance), S_lect_humanities_arts (lectures on humanities and arts subjects), S_lect_ nat_science (lectures on the natural sciences), S_lect_polit_law_edu (lectures on politics, law or education), S_lect_soc_ science (lectures on the social & behavioural sciences), S_meeting (business or committee meetings), S_parliament (BNC-transcribed parliamentary speeches), S_pub_debate (public debates, discussions, meetings), S_sermon (religious sermons), S_speech_scripted (planned speeches), S_speech_unscripted (more or less unprepared speeches), S_sportslive ('live' sports commentaries and discussions), S_tutorial (university-level tutorials), S_unclassified (miscellaneous spoken genres). [Written texts, 46 genres] W_ac_humanities_arts (academic prose: humanities), W_ac_medicine (academic prose: medicine), W_ac_nat_ science (academic prose: natural sciences), W_ac_polit_law_edu (academic prose: politics, laws, education), W_ac_soc_ science (academic prose: social & behavioural sciences), W_ac_tech_engin (academic prose: technology, computing, engineering), W_admin (adminstrative and regulatory texts, in-house use), W_advert (print advertisements), W_biography (biographies/autobiographies), W_commerce (commerce & finance, economics), W_email (e-mail sports discussion list), W_essay_school (school essays), W_essay_univ (university essays), W_fict_drama, W_fict_poetry, W_fict_prose (drama, poetry and novels), W_hansard (Hansard/parliamentary proceedings), W_institut_doc (official/govermental documents/leaflets, company annual reports, etc.; excludes Hansard), W_instructional (instructional texts/DIY), W_letters_personal, W_letters_prof (personal and professional/business letters), W_misc (miscellaneous texts), W_news_script (TV autocue data), W_newsp_brdsht_nat_arts (broadsheet national newspapers: arts/cultural Language Learning & Technology 65 David Lee Mode Author age Author sex Author type Audience age Audience sex Audience level Sampling Circulation Status Genres, Registers, Text Types, Domains, and Styles material), W_newsp_brdsht_nat_commerce (broadsheet national newspapers: commerce & finance), W_newsp_brdsht_nat_editorial (broadsheet national newspapers: personal & institutional editorials, & letters-to-the-editor), W_newsp_brdsht_nat_misc (broadsheet national newspapers: miscellaneous material), W_newsp_brdsht_nat_report (broadsheet national newspapers: home & foreign news reportage), W_newsp_brdsht_nat_science (broadsheet national newspapers: science material), W_newsp_brdsht_nat_social (broadsheet national newspapers: material on lifestyle, leisure, belief & thought), W_newsp_brdsht_nat_sports (broadsheet national newspapers: sports material), W_newsp_other_arts (regional and local newspapers), W_newsp_other_commerce, W_newsp_other_report, W_newsp_other_science, W_newsp_other_social, W_newsp_other_sports, W_newsp_tabloid (tabloid newspapers), W_non_ac_humanities_arts (non-academic/non-fiction: humanities), W_non_ac_medicine (non-academic: medical/health matters), W_non_ac_nat_science (non-academic: natural sciences), W_non_ac_polit_law_edu (non-academic: politics, law, education), W_non_ac_soc_ science (non-academic: social & behavioural sciences), W_non_ac_tech_engin (non-academic: technology, computing, engineering), W_pop_lore (popular magazines), W_religion (religious texts, excluding philosophy). W (written), S (spoken) 0-14 yrs (band 1), 15-24 yrs (band 2), 25-34 yrs (band 3), 35-44 yrs (band 4), 45-59 yrs (band 5), 60+ yrs (band 6), --- (unclassified) Male, Female, Mixed, Unknown, --- (not applicable/available) Corporate, Multiple, Sole, Unknown/unclassified child, teen, adult, --- (unclassified) male, female, mixed, --- (unclassified) low (level 1), medium (level 2), high (level 3), --- (unclassified) whole text (whl), beginning sample (beg), middle sample (mid), end sample (end), composite (cmp), unknown/not applicable (--). (formerly "reception status"): Low, Medium, High (blank for unclassified texts) NOTES 1. In contrast, Nuyts (1988) uses "text type" in a rather idiosyncratic way to mean "a variety of written text" (as opposed to "conversation type" for spoken texts). Many other people similarly use "text type" in a rather loose way to mean "register" or "genre." 2. EAGLES is the Expert Advisory Group on Language Engineering Standards, an initiative set up by the European Union to create common standards for research and development in speech and natural language processing. At present, most EAGLES documents take the form of preliminary guidelines from which it is hoped that standards will later emerge. 3. In Biber's (1989) article on text typology, the nature of his "internal criteria" are more clearly shown. His "text types" are groupings of texts based on statistical clustering procedures which make use of co-occurrence patterns of surface-level linguistic features. 4. Wikberg (1992, p. 248) calls these rhetorical types "discourse categories" (German Texttyp), as opposed to "text types" (German Textsorte) which is equivalent to what I am here calling genres. Language Learning & Technology 66 David Lee Genres, Registers, Text Types, Domains, and Styles 5. The GeM project at Stirling University illustrates an interesting new usage of genre. As it says on their Web site, "The GeM project analyses expert knowledge of page design and layout to see how visual resources are used in the creation of documents, both printed and electronic. The genre of a page -- whether it's an encyclopaedia entry, a set of instructions, or a Web page -- plays a central role in determining what graphical devices are chosen and how they are employed …. The overall aim of the project is to deliver a model of genre [italics added], layout, and their relationship to communicative purpose for the purposes of automatic generation of possible layouts across a range of document types, paper and electronic." 6. This diagram is from Martin (in press), but a similar one may be found in Eggins & Martin (1997, p. 243). 7. On a more speculative note, we could perhaps borrow from the tagmemic/particle physics perspective and talk in terms of particles (registers), waves (styles) and fields (genres). (Mike Hoey, personal communication.) 8. Martin (1993, 121) uses the term "macro-genre" to mean roughly the same thing. 9. Also, face-to-face conversations do not, arguably, form a proper genre as such (cf. Swales, 1990). However, for many research purposes, they form a coherent, useful super-genre. 10. Perhaps "religion" could also be considered a very broad content or topic label (?). In any case, this exceptional category apparently came about due to the unique nature of the texts: the corpus compilers note that the texts could "embrace any of the stylistic characteristics of [several other LOB categories]," yet they all belonged together in some sense. All "committed religious writing" was therefore put together under "Religion" (cf. Johansson, Leech, & Goodluck, 1978, 16). 11. As the EAGLES (1996) authors say, where there is a division into "factual" (informative) vs. "fictional" (imaginative), then "to avoid controversy, religious works are given a separate category of their own" (p.8). 12. Available on the Web at ftp://ftp.itri.bton.ac.uk/pub/bnc/bib-dbase. Titles of files in this resource are truncated to the first 80 characters, which limits its usefulness for some purposes. 13. The quote also contains an example of the term text types being used in a non-technical/loose fashion to mean "types/varieties of text." 14. Kilgarriff's list only includes the first 80 characters or so of the title of each file, which means some titles are truncated (thus no good for searching by), and author names (for the written texts) are not included. 15. COPAC is an on-line system for unified access to the (combined) catalogues of some of the largest university research libraries in the UK and Ireland. Keywords were manually copied from the Web catalogue entries and put into a separate column in the BNC Index to allow researchers to search by proper library keywords in addition to the keywords provided by the BNC compilers. These keywords will greatly facilitate the identification of sub-genres, (sub-)topics, etc., by people who wish to have finer sub-classifications for specific research purposes. 16. For an explanation of why only non-fiction works are given keywords, see note 28. 17. Note that for the demographic files (conversations) the Keywords field is empty for almost all the files. 18. The somewhat confusing term reception status is used in the BNC Users' Reference Guide instead of circulation status. Since it refers to the size of the readership or the circulation level (not the social status of the text), I have changed the label to reflect this. Circulation status should be used with Language Learning & Technology 67 David Lee Genres, Registers, Text Types, Domains, and Styles caution, because it is relative to genre: A newspaper with "low" reception status may still have a lot more readers than a "medium-reception" book of poetry or office memo. The field (Target) Audience level, on the other hand, is an estimate (by the compilers) of the level of difficulty of the text, or the amount of background knowledge of its subject matter which is assumed. 19. Note that Genre classifications (assigned by me) do not always agree with the Domain classifications of the BNC compilers (i.e., the official domain classifications as given in the standard distribution of the corpus). 20. This follows the new 4-way classification scheme employed in the BNC World Edition: alltim0 (--[unclassified]); alltim1 (1960-1974); alltim2 (1975-1984); alltim3 (1985-1994). 21. Using "audience level=high" will roughly filter out introductory textbooks and texts written for both an academic and a more general audience. 22. Some of the genre names in the actual spreadsheet are further abbreviated for practical reasons. 23. Note that, in addition, there are four BNC files (EUY, HD6, KA2, KAV) which contain a roughly even mix of poetry and prose. These have been placed under the "W_misc" genre. 24. The LOB corpus already has, of course, a modern-day correlative: the FLOB (Freiburg LOB) corpus. My categorisations will allow the BNC to also be used in comparative studies using these corpora. 25. People who disagree with these classifications may use the "Keywords" and "Title" fields to find the relevant files and re-classify them as desired. 26. The domain labels in the BNC Index are largely unchanged (i.e., they reflect the decisions of the BNC compilers). Some egregious errors were corrected, however, and reported to the BNC project for fixing in the new release, BNC World Edition. 27. The British Library Web site (http://www.bl.uk) offers some detailed information & links. 28. A British Library "Fiction Indexing Policy" document states, "When indexing non-fiction it is right to attempt to express what the work as a whole is about, since it is usual for non-fiction to focus on one or more specific topics. By contrast, a work of fiction is rarely 'about' a topic at all. Instead, most works of fiction contain within them subjects as themes or settings. What they are 'about' is conveyed in the story as a whole. It is only themes, settings and characters which can be picked out easily by means of subject headings" (see http://www.bl.uk/services/bsds/nbs/marc/655polc.html). 29. As the EAGLES (1996) authors further point out, there are "alarming possibilities of double classification [i.e., mixed genres] -- spy thriller, historical romance, etc." 30. From the document at http://www.bl.uk/services/bsds/nbs/marc/655list2.html, which also gives a full listing of the literary sub-genres identified by the British Library. 31. The BNC Index spreadsheet, when ready, will be distributed initially at http://members.xoom.com/davidlee00/corpus_resources.htm. Suggestions for hosting on other sites are welcome. 32. Available at http://personal5.iddeo.es/tone/BNCIndexer. It is priced at 50 Euros for either an individual or institutional licence (up to 15 users). 33. BNC Web Indexer is the result of a collaboration between Paul Rayson (UCREL, Lancaster University) and myself. The URL will be announced on the CORPORA and CLLT (Corpus Linguistics and Language Teaching) mailing lists when available. Language Learning & Technology 68 David Lee Genres, Registers, Text Types, Domains, and Styles 34. Or using the Web-based concordancer for the BNC developed at Zürich, BNCweb, at http://escorp.unizh.ch (restricted usage). The new version of SARA developed for the BNC World Edition is also expected to have more sophisticated sub-corpus querying facilities. ABOUT THE AUTHOR David YW Lee recently completed his doctoral studies at Lancaster University and is currently a visiting researcher and part-time tutor there. His PhD research involved applying Douglas Biber's multidimensional (MD) methodology to fresh spoken and written data from the British National Corpus (BNC) and a consequent critique of that factor-analysis-based methodology. At present, he is working on publishing his findings as a book, and is writing various articles for journals. E-mail: david_lee00@hotmail.com REFERENCES Atkins, S., Clear, J., & Ostler, N. (1992). Corpus Design Criteria. Journal of Literary and Linguistic Computing, 7(1), 1-16. Bhatia, V. (1993). Analysing genre: Language use in professional settings. London: Longman. Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press. Biber, D. (1989). A typology of English texts. Linguistics, 27(1), 3-43. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243-257 Biber, D. (1994). An analytical framework for register studies. In D. Biber & E. Finegan (Eds.), Sociolinguistic perspectives on register (pp. 31-56). New York: Oxford University Press. Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge, UK: Cambridge University Press. Biber, D. & Finegan, E. (1986). An initial typology of text-types. In J. Aarts & W. Meijs (Eds.), Corpus linguistics II (pp. 19-46). Amsterdam: Rodopi. Biber, D., & Finegan, E. (1989). Drift and the evolution of English style: A history of three genres. Language, 65, 487-517. Burnard, L. (Ed.). (1995, April 25). The British national corpus users reference guide (SGML version, First release with version 1.0 of BNC). Oxford, UK: Oxford University Computing Services. Carne, C. (1996). Corpora, genre analysis and dissertation writing: An evaluation of the potential of corpus-based techniques in the study of academic writing. In S. Botley, J. Glass, T. McEnery, & A. Wilson (Eds.), Proceedings of teaching and language corpora 1996, UCREL Technical Papers Vol. 9 (pp. 127-137). Lancaster, UK: Lancaster University. Cope, B., & Kalantzis, M. (1993). Introduction: How a genre approach to literacy can transform the way writing is taught. In B. Cope & M. Kalantzis (Eds.), The powers of literacy: A genre approach to teaching writing (pp. 1-21). London: Falmer Press. Cope, B., & Kalantzis, M. (Eds.). (1993). The powers of literacy: A genre approach to teaching writing. London: Falmer Press. Couture, B. (1986). Effective ideation in written text: A functional approach to clarity and exigence. In B. Couture (Ed.), Functional approaches to writing: Research perspectives (pp. 69-91). Norwood, NJ: Ablex. Language Learning & Technology 69 David Lee Genres, Registers, Text Types, Domains, and Styles Cranny-Francis, A. (1993). Genre and gender: Feminist subversion of genre fiction and its implications for cultural literacy. In B. Cope & M. Kalantzis (Eds.), The powers of literacy: A genre approach to teaching writing (pp. 116-136). London: Falmer Press. Crombie, W. (1985). Discourse and language learning: A relational approach to syllabus design. Oxford, UK: Oxford University Press. Crystal, D., & Davy, D. (1969). Investigating English style. London: Longman. Crystal, D. (1991). A dictionary of linguistics and phonetics. Oxford, UK: Basil Blackwell. Expert Advisory Group on Language Engineering Standards. (1996, June). Preliminary recommendations on text typology. EAGLES Document EAG-TCWG-TTYP/P. [Available at http://www.ilc.pi.cnr.it/EAGLES96/texttyp/texttyp.html] Eggins, S., & Martin, J. R.. (1997). Genres and registers of discourse. In T. van Dijk, (Ed.), Discourse as structure and process (pp. 230-56). London: Sage. Faigley, L., & Meyer, P. (1983). Rhetorical theory and readers' classifications of text types. Text, 3, 305325. Fairclough, N. (1992). Discourse and social change. Cambridge, UK: Polity Press. Fairclough, N. (2000). New labour, new language? London: Routledge. Ferguson, C. (1994). Dialect, register and genre: Working assumptions about conventionalization. In D. Biber & E. Finegan (Eds.), Sociolinguistic perspectives on register (pp. 15-30). New York: Oxford University Press. Finegan, E., & Biber, D. (1994). Register and social dialect variation: An integrated approach. In D. Biber & E. Finegan (Eds.), Sociolinguistic perspectives on register (pp. 315-347). New York: Oxford University Press. Flowerdew, J. (1993). An educational or process approach to the teaching of professional genres. ELTJ, 47, 4305-4316. Grishman, R., & Kittredge, R. (Eds.). (1986). Analyzing language in restricted domains: Sublanguage description and procesing. Hillsdale, NJ: Lawrence Erlbaum. Halliday, M. A. K., & Hasan, R. (1985). Language context and text: Aspects of language in a socialsemiotic perspective. Oxford, UK: Oxford University Press. Hammond, J., Burns, A., Joyce, H., Brosnan, D., & Gerot, L. (1992). English for social purposes: A handbook for teachers of adult literacy. Sydney: National Centre for English Language Teaching and Research, Macquarie University. Hoey, M. (1983). On the surface of discourse. London: Allen and Unwin. Hoey, M. (1986). Clause relations and the writer's communicative task. In B. Couture (Ed.), Functional approaches to writing: Research perspectives (pp. 120-141). Norwood, NJ: Ablex. Hopkins, A., & Dudley-Evans, T. (1988). A genre-based investigation of the discussion sections in articles and dissertations. English for Specific Purposes, 7, 113-121. Hyland, K. (1996). Talking to the academy: Forms of hedging in scientific research articles. Written Communication, 13(2), 251-282. Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the LancasterOslo/Bergen corpus of British English, for use with digital computers. Oslo: Department of English, University of Oslo. Language Learning & Technology 70 David Lee Genres, Registers, Text Types, Domains, and Styles Joos, M. (1961). The five clocks. New York: Harcourt Brace & World. Kennedy, G. (1998). An introduction to corpus linguistics. London: Longman. Kittredge, R., & Lehrberger, J. (Eds.). (1982). Sublanguage: Studies of language in restricted semantic domains. Berlin: Walter de Gruyter. Kress, G. (1993). Genre as social process. In Cope, B., & Kalantzis, M. (Eds.), The powers of literacy: A genre approach to teaching writing (pp. 22-37). London: Falmer Press. Kress, G., & Hodge, R. (1979). Language as ideology. London: Routledge & Kegan Paul. Lee, David Y. W. (2000). Modelling variation in spoken and written language: The multi-dimensional approach revisited. Unpublished doctoral dissertation, Lancaster University. Lee, David Y. W. (in press). Defining core vocabulary and tracking its distribution across spoken and written genres: Evidence of a gradience of variation from the British National Corpus. Journal of English Linguistics. Martin, J. R. (in press). Cohesion and texture. Manuscript submitted for publication. Martin, J.R. (1993). A contextual theory of language. In Cope, Bill & Mary Kalantzis (Eds.), The Powers of Literacy: a genre approach to teaching writing (pp. 116-136). London: Falmer Press. McCarthy, M. (1998a). Taming the spoken language: Genre theory and pedagogy. The Language Teacher, 22(9). Retrieved June 20, 2000 from the World Wide Web: http://langue.hyper.chubu.ac.jp/jalt/pub/tlt/98/sep/mccarthy.html. McCarthy, M. (1998b). Spoken language and applied linguistics. Cambridge, UK: Cambridge University Press. Meyer, B. (1975). The organisation of prose and its effects on recall. New York. North Holland. Nakamura, J. (1986). Classification of English texts by means of Hayashi's Quantification Method Type III. Journal of Cultural and Social Science, 21, 71-86. Nakamura, J. (1987). Notes on the use of Hayashi's Quantification Method Type III for classifying English texts. Journal of Cultural and Social Science, 22, 127-145. Nakamura, J. (1992). Hayashi's Quantification Method Type III: A tool for determining text typology in large corpora. An annex to a general report on annotation tools of the NERC Report. Unpublished manuscript. Nakamura, J. (1993). Statistical methods and large corpora: A new tool for describing text types. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 293312). London: John Benjamins. Nuyts, J. (1988). IPrA survey of research in progress. Wilrijk, Belgium: International Pragmatics Association. Paltridge, B. (1995). Working with genre: A pragmatic perspective. Journal of Pragmatics, 23, 393-406. Paltridge, B. (1996). Genre, text type, and, and the language classroom. ELT Journal, 50(3), 237-243. Paltridge, B. (1997). Genre, frames and writing in research settings. Amsterdam: John Benjamins. Phillips, M. A. (1983). Lexical macrostructure in science text. Unpublished doctoral dissertation, University of Birmingham, UK. Rosch, E. (1973a). On the internal structure of perceptual and semantic categories. In T. E. Moore, (Ed.), Cognitive development and the acquisition of language (pp. 111-144). New York: Academic Press. Language Learning & Technology 71 David Lee Genres, Registers, Text Types, Domains, and Styles Rosch, E. (1973b). Natural categories. Cognitive Psychology, 4, 328-350. Rosch, E. (1978). Principles of categorisation. In E. Rosch, & B. Lloyd (Eds.), Cognition and categorisation. Hillside, NJ: Lawrence Erlbaum. Sampson, J. (1997). "Genre," "style" and "register". Sources of confusion? Revue Belge de Philologie et d'Histoire, 75(3), 699-708. Steen, G. (1999). Genres of discourse and the definition of literature. Discourse Processes, 28, 109-120. Stubbs, M. (1996). Text and corpus analysis: Computer assisted studies of language and culture. Oxford, UK: Blackwell. Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge, UK: Cambridge University Press. Taylor, J. R. (1989). Linguistic categorisation: Prototypes in linguistic theory. Oxford, UK: Clarendon. Thompson, G. (in press). Corpus, comparison, culture: Doing the same things differently in different cultures. In M. Ghadessy, R. Roseberry, & A. Henry (Eds.), The use of small corpora in language teaching. Manuscript submitted for publication. Tribble, C. (1998). Writing difficult texts. Unpublished doctoral dissertation, Lancaster University. Tribble, C. (2000). Genres, keywords, teaching: towards a pedagogic account of the language of Project Proposals. In L. Burnard, & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective: Papers from the third international conference on teaching and language corpora (Lodz Studies in Language; pp. 75-90). Hamburg: Peter Lang. Retrieved June 20, 2000 from the World Wide Web: http://ourworld.compuserve.com/homepages/Christopher_Tribble/Genre.htm. van Dijk, T. (Ed.). (1985). Handbook of discourse analysis. London: Academic Press. Wikberg, K. (1992). Discourse category and text type classification: procedural discourse in the Brown and the LOB corpora. In Leitner, Gerhard (Ed.), New directions in English language corpora: Methodology, results, software developments (pp. 247-261). Berlin: Mouton de Gruyter. Language Learning & Technology 72 Language Learning & Technology http://llt.msu.edu/vol5num3/aston/ September 2001, Vol. 5, Num. 3 pp. 73-76 TEXT CATEGORIES AND CORPUS USERS: A RESPONSE TO DAVID LEE Guy Aston University of Bologna, Italy In designing any corpus, it is necessary to decide what types of texts to include, and how many of each type. (I use the term "text type" as a neutral one which does not imply any specific theoretical stance.) The British National Corpus (Burnard, 1995) made an initial division into written texts and spoken ones (i.e. transcripts of recordings), and within each of these macrocategories, employed further categorisations and subcategorisations. For the spoken component, a first distinction was between "demographic" (conversations: 153 texts) versus "context-governed" (speech recorded in particular types of setting: 757 texts), and the "context-governed" component was further divided according to the nature of the setting (educational/informative; business; public/institutional; leisure: from 131 to 262 texts in each), paralleled by a monologue/dialogue distinction (40%/60%). For the written component, two principal parallel categorisations were used: "domain" (i.e., subject matter, divided into nine classes, viz., imaginative; arts; belief and thought; commerce; leisure; natural science; applied science; social science; world affairs: from 146 to 527 texts in each) and "medium" (five classes, viz., book; periodical; miscellaneous published; published; to-be-spoken: from 35 to 1,414 texts in each). All figures refer to the BNC World Edition (2001). Text categorisations, as Lee notes, are generally based on "external" criteria -- where/when the text was produced, by/for who, what it is about -- rather than "internal" ones based on its linguistic characteristics. The categorisations used in corpus design tend to be broad rather than delicate, since what corpus designers want to do is to enable users to generalize about and compare different categories. To generalize with any confidence, each category must contain a substantial number of different texts, so that no one text exerts an undue influence on that category (early corpora such as Brown and Lob, which were relatively small, got around this problem by including very short samples from a large number of texts); and each category must contain a wide variety of different texts, so that no one subcategory exerts an undue influence on that category as a whole (Biber, 1993): the greater the variance within a category, the more texts will be needed in order to document that variance. Thus, it may be decided to include roughly equal numbers of texts from different parts of the country, by authors of different sexes/ages or from different types of settings. Within the BNC "context-governed" component, for instance, the "educational/informative" category was designed to include lectures, talks, classroom interaction, and news commentaries, drawing these from different types of institutions in different areas and with a wide range of speakers and topics. Since corpora cannot be infinite, the delicacy of the categorisations to be employed is largely determined by practical considerations. The BNC, which contains just over 4,000 texts, uses a framework which guarantees at least 100 texts in most principal categories. You may or may not like the categories chosen, but the corpus arguably allows you to generalize about these categories -- about spoken and written texts, the nine different domains of written texts, the four different domains of "context-governed" spoken texts, and so forth -- with reasonable certainty that findings will not be unduly biased by any particular text or any particular subcategory of texts. These categories are indicated in the headers to individual texts as attributes of the <catRef> element, using which it is possible to restrict queries to a particular category or combination of categories. A number of errors of categorisation in the first release of the BNC have been corrected in the World Edition (2001). Users may, however, want to employ different categorisations from those employed by the corpus designers. David Lee provides one such categorisation, and the latest version of the SARA software (SARA98; Dodd, 2000) allows users to create their own subcorpora from the full BNC using his, or Copyright  2001, ISSN 1094-3501 73 Guy Aston Text Categories and Corpus Users… other, categories (Aston, in press). Users should, however, be aware that such categories may be poorly represented in the corpus, both numerically and in terms of their variance. The more delicate the categorisation employed, the more likely it is that this will be the case (Sinclair, 1991) -- but even where a categorisation appears relatively broad, not all its members may be adequately documented. Thus Lee divides the BNC's imaginative written texts into novels, poems, and drama. However, there are only two texts in the BNC which fall into his drama category, so it would be pretty unwise to generalize about drama on their evidence. Why aren't there more? Some drama was included in the BNC in order to capture variance within the category of imaginative writing, but the quantity of drama is the result of decisions concerning the relative weight of drama in this category, just as the quantity of imaginative writing in the corpus is the result of decisions concerning the weight of imaginative writing in contemporary text production and reception as a whole. To include more drama would have either meant changing these design decisions or increasing the size of the corpus by an analogous factor. All this means that if you want to generalize about contemporary British drama (or indeed about many of Lee's many other text categories), you would do much better to compile your own specialized corpus (though you may want to compare your findings with the BNC in order to see whether the features you identify are specific to the text-type in question). But you can't really complain about the BNC just because it doesn't contain more texts in a particular specialized category you happen to be interested in, whether this be e-mails, lectures, or business letters. That isn't what general mixed reference corpora are designed for, and you would clearly do better to start from a text archive instead, or from the Web. But isn't a categorisation like Lee's what many users would like, and shouldn't the BNC have used such a categorisation to determine its composition? The main problem with Lee's approach, based on what he considers "prototypical" genres, is that it does not consider either the weight of these genres in the culture (in particular their frequency of reception and production), or the variance to be found within them. Lee appears to think that the BNC really ought to have provided representative samples for all 70 of his mutually-exclusive categories. But in order to include a minimum of, say, 50 texts in each category, either the corpus would have to have been very much larger, or else it would have had to weight these categories more or less equally (70 x 50 = 3,500: the BNC contains just over 4,000 texts). Lee's three genres of imaginative writing (novels, poetry, and drama) would hardly seem to have the same frequency and variance within British culture, where much more fiction is read and published than poetry or drama, and, I suspect, of many more different kinds. So why should the corpus include the same amount of each? Or take prayers. For some reason, prayers aren't one of Lee's genres, though I would have thought them as good a candidate for prototypical status as sermons, which are. There is only one text of prayers in the BNC, falling into the to-be-spoken written medium category (and into the belief and thought domain). The same to-be-spoken category, on the other hand, contains no fewer than 32 texts of television and radio news scripts. This disproportion seems fair enough when judged by production and reception standards -- news broadcasts play a much bigger part in British text production and reception than prayers do, alas. Yet, Lee's argument would suggest that they ought to have similar weighting, insofar as they have similar prototypical status (or else that the corpus should be much, much larger). Lee's criticisms seem particularly unwarranted as far as the BNC Sampler (1999) is concerned (for the record, this contains no prayers, only one drama text, and only one news script). The Sampler -which, like sampler music CDs, was designed to give a "taste" of the full BNC rather than to mirror its composition in detail -- consists of 184 texts for a total of roughly 2 million words, half speech and half writing. Lee complains that many of his categories are totally absent from the Sampler. But with this total number of texts, there is no way in which the Sampler could have adequately Language Learning & Technology 74 Guy Aston Text Categories and Corpus Users… documented 70 different categories while allowing reasonable generalizations at more macroscopic levels, such as speech versus writing. Would Lee really have wanted the number of university lectures on science in the Sampler to equal the number of casual conversations? Only, I think, if he were not interested in spoken texts in general, but particularly interested in science lectures, of which there would still not have been enough to say much about them. A further problem with Lee's genre labels is that they may not match entire texts anyway. As he himself notes, virtually any single text may be analysed as composed of a number of different subtexts which can be assigned to different genres. For instance, there are 30 texts in the BNC consisting exclusively of poems, which Lee categorises as W_fict_poetry. However we find much more poetry occurring in texts belonging to other categories (as quotations, or when the hero of a novel breaks forth into song, etc.), 3,048 poems in 410 texts overall. Lee's categorisation is not going to be of much help to the user who wants to study poetry using the BNC. Rather than just those texts classed by Lee as poetry, s/he would be better advised to choose all those parts of the corpus texts which are tagged as <poem> elements in the markup (an easy task using SARA; Aston & Burnard, 1998). With this last caveat in mind, where Lee does have a point is from what Gavioli (2001; Gavioli & Aston, 2001) calls an example rather than a sample perspective. Corpora like the BNC are designed to provide sample data from which to infer generalisations about the language as a whole, or about particular broad categories of texts, concerning frequencies of occurrence and co-occurrence (collocation, colligation, and so on). However it is also possible to use corpora -- at one's peril -- as text archives (Atkins, Clear, & Ostler, 1992) from which to retrieve examples of a particular texttype. If I am a teacher of religious education, and what I need for my lesson tomorrow is some prayers to use with my class -- why not look in the BNC? Since prayers are not a category used in the BNC text categorisation, to find candidate texts I will have to hope that either the text or its header (perhaps the text title, or its keywords) contains a form of the lemma prayer or a related word or phrase (perhaps Amen). A more detailed categorisation of the corpus texts, particularly one which uses prototypical "folk" genre labels, could be very useful as an aid to find examples of this kind. This could also be a useful approach when we want to investigate a particular "user category" of texts. Not, I repeat, in order to generalize about that category, since the corpus cannot be relied upon to document it adequately, but in order to find examples from which to generate hypotheses. As mentioned earlier, there are 32 texts in the BNC containing radio or television news scripts (W_news_script in Lee's taxonomy). Given their limited number, and the fact that they come from a limited range of sources (two broadcasting stations), it would clearly be unwise to generalize from these to the genre of broadcast news scripts tout court. What they may provide, however, is a source of hypotheses about this genre -- hypotheses which must clearly be tested against a different corpus, one which has been constructed to comprise an adequately-sized sample of texts of this type, and which satisfactorily covers the variance within this category. From an "example" perspective, the more descriptive categorizations that are provided within a corpus the better. For this reason, the incorporation of Lee's categories in the BNC World Edition (2001) is a very welcome development. For each text, his categorization forms the content of a <classCode> element in the header of the text (with the attribute scheme="DLee"), and using SARA98 it is possible to restrict searches to one or more of his categories, and to define corresponding subcorpora -- subcorpora which can of course be adjusted if the user does not agree with Lee's attribution of particular texts to particular categories. I have, for instance, used a subcorpus of lectures from the BNC with a group of trainee conference interpreters who will need to work with academic monologue, selecting all those texts which Lee categorizes as lectures and then discarding two or three which seemed too informal and interactive for my purposes. There are nearly 50 lectures overall, on a wide range of topics and by a fair variety of lecturers, and it has proved a Language Learning & Technology 75 Guy Aston Text Categories and Corpus Users… useful collection from which to retrieve examples of particular discourse phenomena for teaching purposes and from which to generate hypotheses about the ways that lectures seem to work. Useful, that is, as long as you don't try to interpret it as a "representative sample" allowing reliable generalizations about lectures as a genre. ABOUT THE AUTHOR Guy Aston is Professor of English Linguistics in the School of Modern Languages for Interpreters and Translators at the University of Bologna, Italy. His main research interests concern the uses of corpora in language learning and in translation. E-mail: guy@sslmit.unibo.it REFERENCES Aston, G. (in press). The learner as corpus designer. In B. Kettemann (Ed.), Teaching and language corpora 4 (provisional title). Amsterdam: Rodopi. Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh, UK: Edinburgh University Press. Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1-16. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243257. The BNC Sampler. (1999). Oxford, UK: Oxford University Computing Services. The BNC World Edition. (2001). Oxford UK: Oxford University Computing Services. Burnard, L. (1995). Users reference guide for the British National Corpus. Oxford, UK: Oxford University Computing Services. Dodd, A. (2000). SARA98. Oxford, UK: Oxford University Computing Services. Gavioli, L. (2001). The learner as researcher: Introducing corpus concordancing in the classroom. In G. Aston (Ed.), Learning with corpora (pp. 108-137). Houston, TX: Athelstan. Gavioli, L. & Aston, G. (2001). Enriching reality: Language corpora in language pedagogy. ELT Journal 55(3), 238-246. Sinclair, J. McH. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press. Language Learning & Technology 76 Language Learning & Technology http://llt.msu.edu/vol5num3/kennedymiceli/ September 2001, Vol. 5, Num. 3 pp. 77-90 AN EVALUATION OF INTERMEDIATE STUDENTS' APPROACHES TO CORPUS INVESTIGATION Claire Kennedy and Tiziana Miceli Griffith University, Brisbane ABSTRACT This paper reports on our experience in using a corpus of our own compilation, Contemporary Written Italian Corpus (CWIC), in teaching intermediate students at Griffith University in Australia. After an overview of the corpus design and the training approach adopted, we focus on our initial evaluation of the effectiveness of the students' investigations. Much has been written on what can be done with corpora in language learning: what kinds of discoveries can be made with different types of corpora. There is relatively little on how learners actually go about investigations. Since we intend for our students to progress from classroom use to independent work as a result of using a Web-based version of CWIC, we have been seeking to understand how successful they are at extracting information from this corpus in the absence of a teacher. Our initial study highlighted the complexity of the process and the specialized skills required. We found that lack of rigor in observation and reasoning contributed greatly to the problems that arose, as did ignorance of common pitfalls and techniques for avoiding them. We, therefore, conclude the paper with an outline of proposed changes to our apprenticeship program, aimed at better equipping the students as "corpus researchers." INTRODUCTION Much of the literature on the use of corpora in language teaching relates to courses for advanced and highly motivated students of English for Specific/Academic Purposes (e.g., Johns, 1988, 1991a, 1991b; Levy, 1992; Mparutsa, Love, & Morrison, 1991; Stevens, 1991; Tribble, 1991) or translation (e.g., Aston, Gavioli, & Zanettin, 1998; Bernardini, 1998; Gavioli, 1996). So, when contemplating the introduction of work with corpora into the undergraduate Italian program at Griffith University in Australia, we were aware of the need to tailor the experience to quite a different target group, for whom Italian is usually a foreign rather than a second language and whose intentions for its use are less ambitious. In other contexts, our students might be regarded as reaching intermediate or higher-intermediate levels. Our aim was to provide these students with a corpus to use primarily as a reference resource while writing. In view of their proficiency level and the types of writing tasks in which they engage, we sought a corpus that would supply models of personal writing on everyday topics. At the time, the only corpus of contemporary written Italian available to us was a collection of newspaper material,1 so we first resolved to create our own corpus, which we have named CWIC, or Contemporary Written Italian Corpus. Secondly, we also decided to initiate the students into corpus use in a gradual and guided manner and thirdly to attempt to evaluate the effectiveness of their work with CWIC as soon as possible. This paper discusses the implementation of those three decisions, with the focus on an initial evaluation exercise and its implications for our approach to training students. Copyright  2001, ISSN 1094-3501 77 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… The CWIC Project: A Corpus for our Teaching and Learning Context2 Most of our students begin their university experience with no prior knowledge of Italian and can attend a maximum of 400 contact hours during their three years in our program. Additionally, they are not usually able to spend time in Italy during their studies nor are there many local opportunities for immersion. We estimate that, on the average, they graduate with "basic vocational proficiency" in reading and listening, on the scale used in Australia (Wylie & Ingram, 1999), while their ratings in speaking and writing are lower, somewhere between "basic social proficiency" and "basic vocational proficiency." In their second year, the students begin intensive writing practice with letters and diaries, some creative writing and informative pieces based on their own experience. In their third year, the work is more academic in the sense that they bring their analytical skills to bear on the topics. The writing tasks are defined as commentaries, reviews or short essays, and treat aspects of the novels and films studied or television news items and newspaper articles. In designing CWIC for this context, we were informed by various reports on the merits of small corpora for language learners, especially Tribble's advice that "the most useful corpus for learners … is the one which offers a collection of expert performances in genres which have relevance to the needs and interests of the learners" (1997, p. 3) and Aston's recommendation of corpora restricted to familiar text types and topics (1997, p. 62). We envisaged CWIC as complementing the newspaper corpus by providing models of texts by non-professional writers, including personal correspondence, although we chose to include some journalistic writing as well. We refined our general selection criterion of contemporary written usage to the following: short, written texts of specific text types (see Table 1), produced since 1990, by adult native speakers of Italian using non-specialist language. Table 1. Text types included in CWIC By non-professional writers private letters business and official letters private email messages business and official email messages email messages to mailing lists letters to experts in magazine columns By professional writers experts' responses articles in regular magazine columns film reviews Within the constraints of physical access to texts and the feasibility of obtaining permission to use them, our selection has been motivated by the desire to include a range of topics that our students might find interesting or relevant, in texts likely to be comprehensible. The email lists and magazine columns are a valuable source of material on a wide range of themes.3 Our interest in content stems from the expectation that students will come to appreciate the corpus not only as raw material for concordances and frequency lists but also as a database of whole texts, which can be interesting to browse through collectively or read individually. At the time of writing this article, we have approximately 570,000 words, in 2,200 texts by 930 different authors.4 While we make no claims regarding representativeness of language in general, we can say that CWIC provides models of expert performances in several of the text types that our students encounter and are required to produce, during as well as after their studies. It also offers a wealth of appropriate language that can be used in other writing tasks such as creative writing and essays. Language Learning & Technology 78 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… The Students' Apprenticeship Since Johns (1988, p. 24) raised the need for learners to develop strategies of observation for extracting information from the data, many teachers experimenting with concordancing in the classroom have favored a gradual and guided approach (Johns, 1991b, p. 31; Stevens, 1991, p. 39; Tribble & Jones, 1997 p. 58). Guiding learners through a series of preliminary concordance-based activities has been presented as a way of both familiarizing them with various types of investigations that can be conducted and stimulating this development of appropriate learning strategies through practice (Turnbull & Burston, 1998, p. 18). We too opted for an “apprenticeship” approach to training the students, intended to promote learning by example and by experience. We began with the second-year cohort, in a subject that includes a weekly two-hour writing workshop. For most of the training we used a sub-corpus of 50,000 words containing texts of each type, so the students could become familiar with the corpus without facing vast arrays of examples. The activities were initially carried out step by step, with the teacher giving directions through a series of leading questions, sometimes calling attention to particular examples. The students worked in pairs or small groups and reported back to the rest of the class. Interrogation of the corpus was not presented as an end in itself but rather as an integral part of the writing and grammar work being undertaken. There is considerable attention to morphology and syntax in the subject, since it is at intermediate level. We started concordancing activities in that context, by examining verb constructions with direct and indirect objects as well as the behavior and meaning of certain conjunctions and pronouns. After the first few sessions, we began to encourage the students to use the corpus while revising their own written work. Periodically, we presented the class with anonymous sample sentences from the previous week's writing and worked with them on ways of using the corpus to make corrections. In this way, they practiced formulating questions, such as "Should we use infine or finalmente here?", and devising appropriate searches. When marking their work, we pointed out where they might be able to make corrections themselves, with reference to the corpus. This meant dedicating some class time to individual problem-solving work, with the teacher circulating to assist as needed. Finally, we presented applications of the corpus in composing and in pre-writing work, for what we call "treasure-hunting": finding models of ways to express things. Several such activities were conducted with a sub-corpus of personal letters. The students first browsed freely through several letters, observing typical opening and closing sequences. Then, they looked for ways of expressing certain functions, such as apologizing for not writing sooner, thanking someone for a previous letter, or giving information on chosen topics such as work, family, or exams. They did this both by skimming sequentially and by searching on words they thought might be present. For example, ricevere produced the expression Non sai che piacere mi ha fatto ricevere tue notizie (You don't know how pleased I was to receive your news) and vita turned up La mia vita sentimentale è veramente uno schifo (My love life is truly lousy). The students also examined frequency lists for combinations of three or four words, which brought to light a host of useful sequences, such as Non vedo l'ora (I can't wait), Ci sono novità? (What's new?) and al più presto (as soon as possible). These proved to be interesting and entertaining to the students, not only as alternatives to overused expressions, but also as triggers for further searches. Neither in problem-solving nor in treasure-hunting work did we seek to engage the students in free exploration without a predetermined aim. There was always a defined goal: to find out how to phrase something specific in a given text. However, some experimented with "serendipity learning" (Johns, 1988, p. 21) during treasure-hunting activities and we encouraged them to continue to do so in their own time. Language Learning & Technology 79 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… Overall, we viewed this introductory semester as a time for preparing students for independent mode future work with larger corpora outside the classroom. Until now, corpus interrogation has been performed using the text database software DBT3 Database Testuale (Picchi, 1997), which is installed in our laboratories. We believe that DBT has a friendly and intuitive user interface and that offers an appropriate range of functions, including concordancing on single words or expressions (which can be quite complex), labeling of examples to identify sources, and sorting. Moreover, the length of context displayed for each example is configurable, and clicking on an example expands the context to fullscreen. There is also a browser for viewing whole texts and the battery of reporting functions includes frequency lists. Soon, however, students will have access to CWIC from home, since we are currently working to transfer it to a Web platform, with its own searching software, offering functionality and tools similar to DBT. In 2001, we will be involving our students in a pilot exercise using CWIC on the Web. A total of 7 class contact hours out of a total of 26 hours in the writing workshop strand of the subject were dedicated to CWIC during the semester. The students also worked with the corpus on assignment tasks for a few hours outside of class time. Only 3 of the 17 students who completed course evaluation questionnaires said that the amount of time spent was disproportionate to the usefulness of the exercise. In the future, we intend to more closely examine the relationship between the time invested in concordancing work training and the benefits attributable to the mastery of this type of reference tool. The questionnaires, combined with class discussions and individual interviews, were intended to draw out students' perceptions of certain aspects of the corpus induction experience. Because these findings are the subject of a separate study, we will only mention some of the main points here. On the positive side, most students reported that work with the corpus helped them to better understand Italian grammatical structure and boosted their confidence in correcting their own writing. Their various definitions of what made the corpus a useful resource can be grouped into three categories: it provides examples of real language; it allows exploration of the various uses of a given word in different contexts; and it illustrates the specific functions of certain words and expressions in particular types of text. On the negative side, some stressed the discouragement felt on not being able to understand all the examples or to identify relevant ones, and most admitted that they had on occasion found searches too time consuming and frustrating. Our first evaluation exercise was concerned with this aspect: what creates a successful investigation and what causes unproductive searches and frustration. EVALUATION: AIMS AND PROCEDURE In view of the proficiency level of our students and our intention that they use CWIC and other corpora outside the classroom, we were keen to understand how effectively they were able to use it on their own, specifically the mechanics of their investigations and the difficulties they encountered. We found little to inform us in developing an approach to such a study. Flowerdew (1996, p. 112) drew attention to "a paucity of critical perspectives in concordancing literature," but his call for more in-depth evaluative work does not appear to have borne fruit. Much has been written on what can be done with corpora in language learning -- what kinds of investigations can be conducted with different types of corpora and what kinds of discoveries are made, usually in a classroom context -- but relatively little in the literature on how students actually do this, and especially on how they fare on their own. Two of the studies we located, however, do reflect an interest in evaluating students' independent work. Turnbull and Burston (1998) analyzed the aims and outcomes of investigations conducted by advanced students after only minimal training with a concordancer, but mainly with the goal of demonstrating the importance of adequate training. Closer to our purposes was Bernardini's (1998) examination of the processes and outcomes of students' exploration of the British National Corpus, as a result of which she outlined suggestions for making this kind of work more systematic. Among the tendencies she noted were ignoring variants, not looking for alternative approaches when faced with an obstacle, and making only a Language Learning & Technology 80 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… summary analysis. While our students were not involved in free exploration of a large corpus, we anticipated that these kinds of problems were likely to characterize their work, too. We chose to focus on our students' handling of problem-solving activities while revising a text as the first stage of our evaluation since much of their work done in class had been like this. Essentially by asking them to "show their work," we collected data on how they went about using concordances to answer specific questions while correcting their own or others' work. The 10 students referred to in the discussion that follows (S1 to S10) came from the top and middle ranks of the cohort in terms of their achievement in our subjects. For the purposes of this paper, we numbered them according to their results in the written Italian subject in which the corpus apprenticeship was conducted as well as in its companion subject in spoken Italian. S1 was the top performer. Of the 10 students, 5 were enrolled in a languages and linguistics degree program, and the other 5 were studying history, law, or psychology. Some of the cases cited are drawn from activities individually carried out by the students during the semester. Evidence comes from their own accounts of how they used CWIC, sometimes in tasks set by us but oftentimes in those they set for themselves while in the process of editing their own compositions on given topics. The majority of the cases come from pair-work sessions held immediately after the end of semester, which were video-recorded and followed immediately by an interview aimed at extracting a retrospective account of the students' work. Eight students participated, and the sole criterion for pairing them was their availability at particular times. They were given two texts to revise. In the first, we set specific tasks by underlining certain words to indicate where there might be a problem. In the second, we invited them to decide what issues to deal with for themselves. We expected that in the investigations they initiated themselves the students would work on relatively familiar language points, approached with some degree of confidence. The set tasks, on the other hand, were intended to force them to address types of problems they might not otherwise tackle. In both the individual and pair-work situations, all the texts we provided for the activities had been selected from work submitted by students in that subject. Dictionaries and grammar books as well as the corpus were available at all times. The students were encouraged to use all three resources as they deemed appropriate. RESULTS Overview We found that the students made many successful investigations, demonstrating a general appreciation of the types of questions that can be posed, a certain ability to work by analogy, and a preparedness to review their strategies when a search was leading nowhere. However, our concern was to identify what went wrong or could be done more efficiently, in order to gain insight into how to improve the apprenticeship. Our observations suggested that, while knowledge and experience of the language undoubtedly played a part in how productive the students' work with the corpus was, lack of rigor in observation and reasoning contributed greatly to their difficulties, as did apparent ignorance of common pitfalls and techniques for avoiding them. We concluded that our training had not adequately equipped them as "corpus researchers." Our Analysis of Learner Investigations In order to understand what happens in a corpus investigation, we approached it as a four-step process: (a) formulating the question; (b) devising a search strategy; (c) observing the examples found and selecting relevant ones; and (d) drawing conclusions. This schema is illustrated below with reference to one of the Language Learning & Technology 81 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… set tasks from the pair-work sessions. The sentence concerned was Sto cercando l'orario per il corso LAL3093 (I'm looking for the timetable for the subject LAL3093). The problem is the choice of preposition: per is often used where for is used in English, but not in this context, where the pattern is orario di. An appropriate and efficient way of dealing with the task is described in Table 2. Table 2. Steps in a Corpus Investigation 1. Formulate the question "Which preposition can be used after orario when speaking of a timetable for something?" 2. Devise a search strategy Search on orario, with a view to checking what follows it. 3. Observe the examples and select relevant ones Look for examples in which the idea timetable for something is expressed. 4. Draw conclusions Check which word(s) are used with orario in those examples. Identify the combination orario di and insert it into the target sentence, making any necessary adaptations. Occasionally, the students' investigations did not conform exactly to this pattern, as they had no clear question in mind at the outset of their search. This happened if they were working on a set task and had no idea what the issue might be, so they performed a preliminary search on the underlined word or neighboring ones, just to see what came up. If nothing attracted their attention, they abandoned the task, but if they did notice something they formulated a question and then proceeded through the remaining steps as outlined above. The discussion that follows examines students' work on each step in some detail. We frequently use specific cases to illustrate the types of problems that led to an unsuccessful outcome. Despite this focus on what goes wrong, our intention is to convey how complex a corpus investigation is, rather than to present the students' performance as unsatisfactory. We trust the analysis serves to highlight the specialized skills the learners employed and the variety of factors they are required to bear in mind. We have not included cases in which an unsuccessful outcome was caused by lack of linguistic knowledge, although we recognize that proficiency is important, especially in Step 1 and Step 3. In Step 1, for instance, appreciation of whether it makes sense to ask a given question depends to some extent on familiarity with the target language. In Step 3, of course, not understanding the examples can undermine even an impeccably conducted investigation. However, our interest here is in identifying problems that did not appear to result from inadequate proficiency and that could perhaps be overcome by appropriate training. We, therefore, sum up the discussion of each step in the form of a list of tips for learners. We do not present these as rules ready to be imparted to future groups of trainees, but envisage drawing them up together with the students, through collective reflection on investigations carried out in class. Step 1: Formulating the Question Before examining what goes on in this step, it is important to note what types of questions were being dealt with in the investigations. They were not free exploration questions such as "What can I find out about x?" nor treasure-hunting questions like "In what ways can I express this function?" Instead, the questions were aimed at checking or correcting a given sentence. Those we encountered in the students' work were of just three types: (a) "What is/are the correct word(s) in this context to render this meaning?"; (b) "What construction do I need around this word (or these words) in this context?"; and (c) "What order should these words be in, in this context?" Each type can be expressed in open form, as above, or in closed form. For example, two closed forms of the first type of question are: "Can x be used to render this meaning in this context?" (yes/no form) and "Is x or y the correct word for this meaning in Language Learning & Technology 82 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… this context?" (multiple choice). Clearly, we use the words "yes/no" here as shorthand for "there is / is not evidence for this." We found that some of the questions the students formulated suggested that they might have misconceptions about the types of questions it is logical to ask, the kinds of information that can be obtained from a corpus, and the ways clusters of words behave. We grouped the problems we identified into five categories. First, there was sometimes insufficient attention to how specific or general a question should be. When dealing with Sto cercando l'orario per il corso, S3 and S6 asked "Can you say per il corso?" This is too general a question; while the answer is "yes," this does not help decide whether these words can be used in the given sentence. It seemed that the students needed to be more conscious of the fact that the actual combinations of words used in a language are only a subset of the potential combinations (Gavioli 1996, p. 124). Reflecting on the implications of general or specific questions in their native language could help students appreciate this problem (provided they do not assume answers can be transposed). For example, "Can you say for the course?" is not a useful question in English either. We say prerequisites for the course but aims of the course. An unnecessarily general question may well eventually lead to a successful outcome, but the investigation is likely to be inefficient, due to detours to deal with evidence in contexts not relevant to the case at hand. We observed this in students' handling of one of the individual set tasks, that of choosing between Il lunedì scorso siamo andati all'università and Lunedì scorso siamo andati all'università for Last Monday we went to university. The issue is whether the definite article is used with lunedì scorso (last Monday) and the answer is "no." Some asked the question "How does scorso (last) behave?" rather than "How does lunedì scorso behave?" This meant dealing with scorso in several contexts, some with an article and some without. Second, the students often did not seem to consciously choose whether to frame their questions in open or closed form. Primarily, they did not take into consideration that a closed question could lead them to a dead end and the need for a follow-up question. This happened to S1 and S7: after dealing with their question "Do you say orario per?" they found that they needed a second investigation aimed at answering "So what do you use after orario?" The third type of problem was apparent when a question arose only after looking at some examples. In this situation, students sometimes failed to formulate the question explicitly. One of the set tasks in the pair work was to check the sentence Auguri per il weekend, with which the writer had intended to say something like Have a good weekend. Here, auguri is out of place: it usually corresponds more to best wishes and is used for birthdays and other special occasions. S10 had an idea along those lines, suggesting, "Maybe they don't say wishes for the weekend, maybe they mean wishes like congratulations," but she and S8 did not turn that into the question, "So how do you say Have a good weekend?" which might have led them to search on other words. They just continued to muse upon the examples of auguri and eventually gave up. Fourth, there was the fatal lure of prepositions. The students' attention was often attracted to a preposition itself rather than to the words around it, on which it depended. In some cases, they treated a preposition as having a meaning in isolation, or as being in one-to-one correspondence with an English preposition, such as when S4 said to S9 "Doesn't da usually mean from?" Very common indeed was the habit of treating a preposition as linked only to the words following it. For example, when correcting her own sentence Il cane è troppo stanco … continuare il gioco (The dog is too tired to continue the game), S5 asked "What preposition do I want before continuare?" rather than "How do I construct too <adjective> to do something?" Language Learning & Technology 83 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… Fifth, there was a tendency to neglect lexical considerations in favor of grammatical ones, to focus on how to combine words rather than whether they could be used at all in the given context. For example, when presented with Non mi sorprenderebbe imparare che ho fatto molti errori (It wouldn't surprise me to learn I've made many mistakes), nearly all the students considered only the construction of the sentence. They did not question whether imparare could be used for learn in this sense. On the basis of our observations, an initial set of tips for Step 1 might be as shown in Table 3. Table 3. Examples of Tips: Step 1 • Try to state your question precisely. • Ensure it is specific enough for the situation you are dealing with. • If it is in yes/no or multiple choice form, consider whether an open question would be more appropriate. For example, rather than asking "Does y come after x?" you might want to ask "What comes after x?" • Keep in mind both lexical and grammatical issues. • In your dealings with prepositions: When considering what word(s) a preposition might be linked to, look both to the right and to the left, and to a distance of a few words. If you are trying to choose a preposition for a particular context, remember the possibility that no preposition is required there. Step 2: Devising a Search Strategy for a Given Question We identified the components in the definition of a strategy as (a) choosing the word(s) to search on and (b) deciding whether and how to use other options such as sorting examples or consulting a dictionary or grammar book. Choosing the word(s) to search on is not necessarily just a matter of deciding which are the key words in the question. It may entail picking words that can be substituted for these, such as different forms of a lemma or words that belong to the same set (like days of the week, colors, possessive pronouns). Students did not always pay sufficient attention to exactly defining the construction they were dealing with and therefore distinguishing its fixed and variable parts. Often this coincided with a certain difficulty in framing the question. One example of many was the treatment of Non mi sorprenderebbe imparare (It wouldn't surprise me to learn) by S1 and S7. They wondered whether a preposition is required between the conjugated verb sorprenderebbe and the infinitive imparare. Their strategy was to search on imparare. It did not seem to occur to them that it was the behavior of sorprendere that mattered, that the construction is a variant of Non mi sorprenderebbe <infinitive> or, more generally, Non <object pronoun> <conjugated form of sorprendere> <infinitive>. Nor did students seem very concerned that a strategy be efficient. That is, they did not direct effort at obtaining a workable number of examples -- not too many -- with as many as possible of them likely to be relevant to the problem at hand. This means including as much as possible in a search combination without, of course, prejudicing a successful result by making it too restrictive. During the pair work, S4 and S9 set themselves the task of deciding between niente da fare and niente di fare for nothing to do. They searched on fare (to do) and sorted the examples so as to check on the left for di fare and da fare. Since fare is present in a myriad of idiomatic expressions, it provided a host of irrelevant examples to sort and scroll through. Language Learning & Technology 84 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… Additionally, there were several cases of students overlooking the option of trying other forms of the key word (or substituting another word for it altogether) in the event of not finding any examples. When dealing with Italian nouns, verbs, and adjectives, if a first search fails to produce sufficient evidence, it should be automatic to check different inflected forms. In one instance, S1 posed the question of whether the adjective estrema (extreme, singular feminine) should precede or follow the noun and made her decision on the basis of only one example. Had she searched on the masculine and plural forms as well (or used the stem together with a wildcard character: estrem*), she would have had several examples to consider, and she would have been able to detect the mobility of this adjective, with the choice of position reflecting degree of emphasis. Another aspect of this issue is that if students did think to search on another form of a verb, they tended to only try the infinitive, apparently transferring dictionary practice to corpus use. Clearly, there are many factors to take into account when devising a strategy, and it is not surprising that the students did not always think of all possible ways of fine-tuning their approaches. There were several occasions in which they neglected to use certain options to their best advantage. For example, when searching for a combination of words, students sometimes forgot to specify if they were interested in the words only when they were adjacent. This is quite simply achieved by setting a maximum-distance-apart parameter to 1. We noticed the converse problem too, of setting this parameter to 1 automatically, without considering whether the search words were likely to be separated in the examples by intervening words, phrases or even clauses. Sorting features were also used somewhat indiscriminately. The words linked to a keyword may well not be adjacent to it, and looking at sorted output sometimes distracted the students' attention from useful examples. Finally, there were times when the students were apparently so engrossed in the corpus that they forgot to use the dictionary. This was noticeable at moments when they realized they were dealing with a word that did not have the desired meaning in a certain context. They got as far as checking the wrong word in the corpus and establishing that there was no evidence for using it in the target sentence, but then simply relied on their own memory or imagination in determining what to use instead, rather than reaching for the English-Italian dictionary. In light of these observations, we drew up a basic set of tips for Step 2, shown in Table 4. Table 4. Examples of Tips: Step 2 • Think about how efficient your strategy will be. Is it likely to generate many irrelevant examples alongside the useful ones? If so, maybe you should restrict your search further. • Check if you are dealing with a variant of a general pattern, with a fixed part and a variable part, as you may want to search only on the fixed part. • If you are not satisfied with the examples found, think about using wildcards or substituting something else for one of the search words: another form of the same lemma or a word that may be equivalent in the context that interests you. • Remember the English-Italian dictionary if you are looking for potentially appropriate words. Step 3: Observing the Data and Selecting Examples Surprisingly often, students lost sight of the importance of selecting examples with a view to matching form and meaning closely to the requirements of the target sentence. For example, while editing a sentence using the adverb ancora to mean "again," S8 investigated the behavior of ancora, saying that she was interested in its position with respect to the verb it modified. Her eventual construction was fine, but in none of the four examples she cited as her models did ancora have the meaning again nor did it always modify a verb. Language Learning & Technology 85 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… Most of the time, the students did check the meaning and structure of examples, but they were not always bent on finding a close match, even when excellent evidence was readily available. It became clear to us that there were specific traps for those who were anything less than rigorous in the selection of examples. One was the distraction offered by a majority of examples being of one kind. In this situation, useful examples belonging to a minority category were easily ignored. In the case of lunedì scorso, most students who searched on scorso were attracted by the examples of il mese scorso (last month) and l'anno scorso (last year), which include the definite article, and used these as their model. This was despite the fact that 2 of the 15 examples found illustrated exactly what was needed, in the form of last Monday and last Thursday, without the article. Another trap for students who did not attend closely to meaning was the way some combinations turn up due to the chance juxtaposition of phrases, not because they form a lexical phrase themselves. A frequent problem was that of students not noticing something if it was not what they were looking for. This was the case when S4 and S9 were trying to establish whether they should use cercare a for to look for. They simply did not see the several examples on the screen of cercare used transitively to mean to look for, because they were intent on choosing a preposition. The problem of not noticing all the information given could be observed also at the moment of applying an example as a model in the target sentence. In an individual task, S9 wanted to see what verb construction to use in After bringing the stick back, and her first attempt included dopo restituendo (after <gerund>). She then looked up dopo and found an example, which included dopo aver chiuso (after having closed), or dopo aver <past participle>. However, she appeared to notice only the pattern dopo aver, and so she just inserted aver into her first guess, producing the hybrid dopo aver restituendo. Some tips for step 3 are shown in Table 5. Table 5. Examples of Tips: Step 3 • Remember to check the meaning of examples you want to use as evidence, and seek out those that most closely match the requirements of your target sentence. • Try not to be influenced by assumptions about what you will see in the examples. Look to the left and right of keywords to see which words are linked to them. The words you are expecting to find may not be present, and vice versa. • Try not to be attracted only to the types of usage of a word that occur most frequently. The type you are interested in may be a less common case. Step 4: Drawing Conclusions The observations we made on this step primarily concern problems in reasoning, particularly the implications students drew from the number of examples found by a search. When only one or very few examples were found, the students tended to lack confidence in the result, evidently assuming that many illustrative examples are necessary to establish a case. This reflected a lack of appreciation for the fact that what matters is the quality rather than the quantity of examples. Depending on the type of question addressed, one example that is suitably analogous to the target can be sufficient for a "watertight case." On the other hand, of course, if only a few examples are found, they may be the result of chance juxtapositions, the reality being that no relevant examples are present. This suggests a more general issue about numbers of examples: If many turn up when few are expected, or vice versa, the significance of this should be considered. The students sometimes expressed perplexity (for example, S10 said at one point, "Wouldn't you think there'd be a lot more examples?") but failed to act on this dilemma. Various invalid conclusions were drawn at times when no examples were found. These included, "The phenomenon does not exist" in place of "There is no evidence for it in this corpus"; "The answer is not x Language Learning & Technology 86 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… so it must be y" when it was only a matter of supposition that x and y should be the only options; and "The search didn't work" or "We didn't find out anything," because the search had not produced the expected results. Some tips based on these observations are shown in Table 6. Table 6. Examples of Tips: Step 4 • Even if you have only one example as evidence, it may be enough on which to base your case. Remember that what matters is how good your evidence is, not how much of it there is. • If you have found only a few examples when you were expecting many, or vice versa, you may need to think about what this means. Why were you expecting to find many or only a few? What has affected the result? • If you have found no examples, think carefully about what conclusion you can draw. Make sure you relate your conclusion to the question that you initially posed. WHERE TO FROM HERE? In the investigations we analyzed, difficulty in understanding examples was very rarely the sole or even the primary cause of invalid results. In fact, the above discussion is based entirely on cases in which it was unlikely that the examples dealt with were hard for the students to understand. Furthermore, in the pair work it was often the student with lower proficiency (as far as that can be measured by results in our subjects) who appeared more competent in using the corpus to tackle a problem. There were many instances in the pair work where S7, S9 and S10 led the way for S1, S4, and S8 respectively, by showing insight in formulating a question, using clear reasoning in devising a strategy, or paying attention to examples. By this we do not mean to suggest that language proficiency is irrelevant nor to deny how daunting arrays of examples can be. We simply intend to underline that, in each of the four steps, we identified specific problems that seemed to be due to inadequate corpus-investigation skills. These were accompanied by an evident lack of awareness on the students' part of how easy it is for an investigation to be derailed. So the apprenticeship now appears far more complex than we had thought. The evaluation has highlighted the need to focus on treating the students as trainee researchers. As in any other field of research, it is necessary for novices to acquire certain attitudes and habits of reasoning. They need to become acquainted with underlying principles and to master specific techniques, which are not necessarily intuitive. We are, therefore, reviewing our approach in two main areas. First, we are looking for ways to encourage students to distinguish between observation and interpretation of data so as to try to free the observation phase of assumptions. To prepare the students to exploit the "direct access to the data," that a corpus provides (Johns, 1991b p. 30), we must convey to them the importance of observation rigor to precede interpretation of what is observed. This means that work aimed at raising consciousness of the idea that language is made up of "lexical phrases" rather than single words and that putting a sentence together is a matter of arranging patterns of words, attending to "the ways they can be pieced together, along with the ways they vary and the situations in which they occur" (Nattinger, 1980, p. 341). Careful observation in relation to these aspects can be expected to help overcome assumptions. In addition to including explicit observation exercises in the training program, we are also making a much more general change. In order to "market" the benefits of observation to the students, we have decided to entirely reverse the order of our approach so as to start with treasure-hunting, and borrowing chunks of appealing language while composing texts. Subsequently, we will move on to the use of concordances to solve specific problems regarding word use, while revising texts. In this way, we mean to highlight, from the outset, the value of a corpus as a database of whole texts and of models of complete utterances and set Language Learning & Technology 87 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… phrases. We hope that by beginning with treasure-hunting we can encourage students to appreciate exploration of the corpus without prior assumptions about the data that will be found and cultivate in them a more open mind towards the ways strings of words belong together. The second key aspect of our new approach, as foreshadowed in the preceding section, will be that of engaging the students in reflection on the processes of their problem-solving investigations. Once they have some experience in using the corpus to answer questions, we will introduce exercises -- perhaps presented in the form of "spot what goes wrong" -- aimed at collectively deriving a checklist of tips along the lines of those drafted above. CONCLUSION We recognize that during corpus investigations by language learners, there is considerable room for error due to lack of knowledge of the target language. However, we propose that the development of appropriate research habits -- incorporating observation and logical reasoning as well as techniques in corpus searching -- could reduce other causes of error to a minimum. Although we do not go so far as to suggest that learners need formal training in logic in preparation for corpus work, our evaluation of the ways students go about problem-solving with CWIC has convinced us of the importance of an awareness of logical principles applicable to this kind of operation. The plan outlined above to revise our approach to training reflects this conviction. We expect that an apprenticeship oriented toward the development of "corpus research" skills will not only help students make the most of corpora but will also benefit other areas of their language learning as well, enhancing their capabilities with other reference tools in particular. Our next step will be to examine the effectiveness or lack thereof of the new approach, especially in work with CWIC on the Web. NOTES 1. The Corpus of Italian Newspapers, available from the Oxford Text Archive at http://ota.ahds.ac.uk, contains 1,200,000 words from four dailies. 2. A more detailed description of the corpus and compilation process is in a paper submitted for the proceedings of the conference Teaching and Language Corpora 2000. 3. Some of the themes of magazine columns selected so far are health, education, personal problems, young people's issues, pet care, home computing, current events, social issues, science, and spiritual and theological questions. We have explored email lists belonging to groups of women, gays and lesbians, animal liberationists, translators and interpreters, vegetarians, mountain climbers, Italians overseas and fans of Totò, and on issues to do with politics, entertainment, current events, and personal problems. 4. The composition of the corpus is roughly 50% email, 5% letters, 40% magazine material (including letters from the public), and 5% film reviews. Non-professional writers account for over 75% of the content. The number of texts by a single author ranges from 1-10 for most of these to 30-40 for magazine column hosts. ACKNOWLEDGMENTS We thank Dr. Mike Levy and the anonymous reviewers for valuable feedback on an earlier version of this paper. CWIC was developed with the assistance of grants from Griffith University and the Australian Government's Committee for University Teaching and Staff Development. Language Learning & Technology 88 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… ABOUT THE AUTHORS Claire Kennedy and Tiziana Miceli are lecturers in Italian at Griffith University in Brisbane. Email: C.Kennedy@mailbox.gu.edu.au, T.Miceli@mailbox.gu.edu.au REFERENCES Aston, G. (1997). Enriching the learning environment: Corpora in ELT. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles, (Eds.), Teaching and language corpora (pp.51-64). London: Longman. Aston, G., Gavioli, L., & Zanettin, F. (Eds), (1998). Proceedings of corpus use and learning to translate conference, University of Bologna, Bertinoro. Retrieved November, 8, 2000, from the World Wide Web: http://www.sslmit.unibo.it/cultpaps. Bernardini, S. (1998). Systematising serendipity: Proposals for large-corpora concordancing with language learners. Proceedings of TALC98 (pp. 12-16). Oxford, UK: Seacourt Press. Flowerdew, J. (1996). Concordancing in language learning. In M. Pennington (Ed.), The power of CALL (pp. 97-113). Houston, TX: Athelstan. Gavioli, L. (1996). Corpus di testi e concordanze: Un nuovo strumento nella didattica delle lingue straniere [Text corpora and concordances: A new tool for foreign language teaching]. Rassegna Italiana di Linguistica Applicata, 2, 121-146. Johns, T. (1988). Whence and whither classroom concordancing. In T. Bongaerts, P. De Haan, S. Lobbe, & H. Wekker (Eds.), Computer applications in language learning (pp. 9-27). Dordrecht, The Netherlands: Foris. Johns, T. (1991a). Should you be persuaded: Two samples of data-driven learning materials. English Language Research Journal, 4, 1-16. Johns, T. (1991b). From printout to handout: Grammar and vocabulary teaching in the context of datadriven learning. English Language Research Journal, 4, 27-45. Levy, M. (1992). Integrating computer-assisted language learning into a writing course. CAELL Journal, 3(1), 17-27. Mparutsa, C., Love, A., & Morrison, A. (1991). Bringing concord to the ESP classroom. English Language Research Journal, 4, 115-133. Nattinger, J. (1980). A lexical phrase grammar for ESL. TESOL Quarterly 14(3), 337-344. Picchi, E. (1997). DBT3 Database Testuale. Consiglio Nazionale delle Ricerche, Italy. Distributed by Lexis Progetti Editoriali s.r.l. See http://www.lexis.it. Stevens, V. (1991). Classroom concordancing: Vocabulary materials derived from relevant, authentic text. English for Special Purposes Journal, 10, 35-46. Tribble, C. (1991). Concordancing and an EAP writing program. CAELL Journal, 1(2), 10-15. Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. Paper presented at the first international conference "Practical Applications in Language Corpora," University of Lodz, Poland. Retrieved November 8, 2000, from the World Wide Web: http://ourworld.compuserve.com/homepages/Christopher_Tribble/Palc.htm#Top. Tribble, C., & Jones, G. (1997). Concordances in the classroom: Using corpora in language education. Houston, TX: Athelstan. Language Learning & Technology 89 Claire Kennedy and Tinizia Miceli An Evaluation of Intermediate Students' Approaches… Turnbull, J., & Burston, J. (1998). Towards independent concordance work for students: Lessons from a case study. ON-CALL, 12(2), 10-21. Wylie, E., & Ingram, D. (1999). International second language proficiency ratings: Master general proficiency version (English examples). Brisbane, Australia: Centre for Applied Linguistics and Languages, Griffith University. Language Learning & Technology 90 Language Learning & Technology http://llt.msu.edu/vol5num3/thompson/ September 2001, Vol. 5, Num. 3 pp. 91-105 LOOKING AT CITATIONS: USING CORPORA IN ENGLISH FOR ACADEMIC PURPOSES Paul Thompson Reading University Chris Tribble King's College London University & Reading University ABSTRACT Appropriate reference to other texts is an essential feature of most academic writing, and we should expect courses in academic writing to sensitize students to the choices that are available to them when they decide to refer to other texts. A brief review of popular EAP writing textbooks finds, however, that attention is given mainly to surface features of citation, focusing on quotation, summary, and paraphrase. Analysis of a purpose-built corpus of academic text can reveal much about what writers actually do, and can also generate rich speculation on why writers do what they do. Extending Swales' (1990) division of citation forms into integral or non-integral, we present a classification scheme and the results of applying this scheme to the coding of academic texts in a corpus. The texts are doctoral theses, written in two departments: Agricultural Botany and Agricultural Economics. The results lead into a comparison of the citation practices of writers in different disciplines and the different rhetorical practices of these disciplines. Comparison with Hyland (1999), which looks at citation types in research articles, also indicates differences between genres. We then look at examples of EAP student writing and apply the same analysis to these texts. The results show that the novice writers use a limited range of citation types, and we suggest that teaching should focus on extending the range of choices available to students. Lastly, we introduce a number of class activities in which students conduct their own analyses of citation practices in small corpora, to develop genre awareness, and we evaluate these activities. INTRODUCTION The growing interest in the application of corpus tools in language education, and the spread of "datadriven learning" (Tim John's coinage) is evidenced by the papers in this edition of Language Learning and Technology, and in recent publications (e.g., Burnard & McEnery, 2000). We will not, therefore, review the development of classroom concordancing in this short article or argue for its relevance to English language teaching, but will look immediately at an area of possible application for specific corpora.1 In this article, we report work that has used corpora to research a particular aspect of academic writing (citation practices, across the disciplines), how current ELT materials address the language features that were the focus of this research, and how corpus tools can be used to supplement published materials to give learners in EAP writing classes opportunities to extend their understanding of this central aspect of academic discourse. Copyright © 2001, ISSN 1094-3501 91 Paul Thompson & Chris Tribble Looking at Citations… CORPUS-BASED RESEARCH INTO CITATION PRACTICES Making references to the literature is an essential part of most academic writing, and it is also a source of considerable difficulty for most novice writers (Borg, 2000; Campbell, 1990). Some of the reasons that academic writers are expected to make references are to integrate the ideas of others into their arguments, to indicate what is known about the subject of study already, or to point out the weaknesses in others' arguments, aligning themselves with a particular camp/school/grouping. Novice writers may face problems because they are not at the appropriate stage of cognitive or intellectual development (Britton, Burgess, Martin, McLeod, & Rosen, 1975; Pennycook, 1996), or because of cultural factors (Connor, 1996; Fox, 1994). Failure to acknowledge the source of ideas can lead to charges of plagiarism, whereas inexpert phrasing of reporting statements can lead to confused or misleading indication of both the writer's, and the cited author's, stance (Groom, 2000). Swales (1981, 1986, 1990) has pioneered the study of citation analysis from an applied linguistic perspective. He created clear formal distinctions between non-integral and integral citation forms: The former are citations that are outside the sentence, usually placed within brackets, and which play no explicit grammatical role in the sentence, while the latter are those that play an explicit grammatical role within a sentence. The citation at the beginning of this paragraph is an integral citation. He also used the terms "short" and "extensive," to describe citations that are at a single sentence level and those that encompass more than one sentence. These distinctions provide useful starting points but they do not provide insights that will help student writers understand which citation type to use in which context. Alongside Swales' work, there has been substantial research into the correlation of verb tense and voice in reporting verbs with function (most notably Shaw, 1992, but also worthy of mention are Hanania & Akhtar, 1985, and Malcolm, 1987). Analysis of academic text corpora has the potential to inform our knowledge about the different forms and functions of citations in academic writing. Pickard (1995) used a small corpus of applied linguistics articles to investigate the citation practices of "expert" writers. On the premise that novice writers tend to overuse particular items in their references, such as "say," she investigated citation practices in the corpus to find out what expert writers do. Using concordancing software, she was able to produce statistical information to identify preferences among her writers for integral or for non-integral citation forms, and to identify the different grammatical forms of integral citations (subject, agent, genitive noun phrase, etc.). This was a useful preliminary study. The limitations were that the corpus was small, and there was little discussion of the reasons why writers choose one form rather than any other; the categories are based on syntactic distinctions rather than functional. More importantly, however, it is not clear whether her discoveries about the practices of a small number of applied linguistics writers can be generalized to "expert" writers across all the disciplines. It seems likely that writers in different disciplines follow different rhetorical conventions and have different preferences. Two recent studies of citation practices in academic texts that test this assumption are Hyland (1999) and Thompson (2000). These two studies were based on the analysis of more substantially sized corpora, each investigating a different genre of academic writing. Hyland looked at citations in a corpus of 80 research articles, composed of 10 journal articles from different disciplines (see Table 1 below for details), while Thompson (2000) examined differences in citation practices in a corpus of doctoral theses. The latter corpus contains 16 theses written in two departments at the University of Reading, 8 theses from the Department of Agricultural Botany, and 8 from the Department of Agricultural and Food Economics. Language Learning & Technology 92 Paul Thompson & Chris Tribble Looking at Citations… Table 1. Number of Citations in Hyland (1999) and Thompson (2000) Corpora Discipline Mechanical Engineering Physics Electronic Engineering Marketing Philosophy Applied Linguistics Sociology Biology Agricultural Botany Agricultural Economics Av. per paper 27.5 24.8 42.8 94.9 85.2 75.3 104.0 82.7 Av. per thesis 248.8 333.5 per 1,000 words 7.3 7.4 8.4 10.1 10.8 10.8 12.5 15.5 9.04 5.25 Both studies investigated variation in practice in disciplinary discourses and made use of frequency and concordance data to investigate dispersion, frequency, and patterning across large quantities of text. Table 1 shows the figures for instances of citation in the two corpora, with the middle column showing the average number of citations per text, and the right column showing the average number of citations per 1,000 words of running text. The lower density of citations amongst the science and technology articles (7.3-8.4) contrasted with higher incidence among the social science articles (10.1-12.5) while Biology stood out as exceptional with 15.5. Hyland postulated a difference in practice here between "hard" and "soft" disciplines, using terminology drawn from Becher (1989), and speculated that Biology stood out from the other sciences because it is a relatively new discipline. The distinction between "hard" and "soft" disciplines may, however, prove to be reductive; Becher himself prefers a multi-dimensional model with added axes of "applied" and "pure," "rural" and "urban." The fact that the Biology texts are so markedly different from the Physics and Engineering texts is evidence that the simple distinction between "hard" and "soft" is inadequate. As can be seen in Table 1, the density of citations in the doctoral theses is much lower. If we presume that the Agricultural Botany theses should be roughly comparable to the Biology articles, the density is approximately three fifths lower And while there is no easy comparison between the Agricultural Economics and any of the disciplines in Hyland's study, the figure of 5.25 is substantially lower than any of the figures for the research articles. It can be seen, therefore, that the two genres are marked by different degrees of use of citations. One explanation for this is that the types of texts produced in these two genres are of different lengths: Articles usually average between 2,000 and 5,000 words, while in Thompson's study, the average length of an Agricultural Botany thesis was 31,000, and the average length of an Agricultural Economics thesis was 63,000. As articles are shorter texts, there is presumably a need for a more condensed style of writing. Table 2 shows the relative percentages of the two types of citation, integral and non-integral, in Hyland (1999) and Thompson (2000). These figures show firstly that there is considerable variation in citation practice between the different disciplines, with Philosophy being the only discipline that prefers the integral form over the non-integral, greater emphasis being placed on the arguments of different individuals. Secondly, it is interesting to note that in the case of the Agricultural Economics theses writers, the integral type was also preferred. Although no direct comparison can be made between Agricultural Economics and the disciplines in Hyland's study, one would not expect Agricultural Economics to be closest to Philosophy. A more plausible explanation is that citation practices in the two genres are different: Thesis writers in Agricultural Economics make greater use of integral citations for Language Learning & Technology 93 Paul Thompson & Chris Tribble Looking at Citations… reasons that become clear from closer reading of the texts. One obvious point is that the length of texts in the two genres is markedly different: the articles in Hyland's corpus range from 3 to 31 pages in length, whereas the Agricultural Economics theses in Thompson's corpus are around 200 pages long. In long texts, such as the Agricultural Economics theses, or in book length treatments of research, there is a higher likelihood that references to leading researchers in the field will be elaborated and give greater prominence to the author(s).2 Table 2. Ratios of Non-Integral to Integral Citations by Discipline in Hyland (1999) and Thompson (2000) Discipline Biology Electronic Engineering Physics Mechanical Engineering Marketing Applied Linguistics Sociology Philosophy Doctoral theses Agricultural Botany Agricultural Economics Non-integral 90.2 84.3 83.1 71.3 70.3 65.6 64.6 35.4 Integral 9.8 15.7 16.9 28.7 29.7 34.4 35.4 64.6 66.5 33.5 38.1 61.9 Table 3 below shows the percentage of citations in the two corpora that incorporate direct quotation form the source text. It is clear from these figures that quotation is a relatively common feature in the social science and humanities texts but that it is scarcely used in the science texts. Where quotation is used in the science texts, (viz. the 0.8% figure in the Agricultural Botany column), the citation is a definition, while many of the Agricultural Economics quotations are evaluative comments. Table 3. Sample Percentages of Citations in Two Corpora That Include Direct Quotation Articles (Hyland, 1999) Biology Electronic Sociology Engineering 0 0 13 Doctoral theses (Thompson, 2000) Applied Linguistics Agricultural Botany Agricultural Economics 10 0.8 8 The statistics reported here suggest that there are clear divergences in the citation practices of writers in different disciplines, and also between genres of academic writing. The level of analysis at this stage, however, restricts the kinds of questions that one can ask, and it is necessary to develop a more sensitive set of categorisations. In the next section, we describe a set of categories drawn from Thompson (2000) , and we then proceed to pose further questions. Language Learning & Technology 94 Paul Thompson & Chris Tribble Looking at Citations… NON-INTEGRAL CITATION Source Non-integral citations perform a range of functions. The first function is to attribute a proposition to another author. The proposition might be a statement of what is known to be true, such as in the factive report of findings in other research, or the attribution of an idea to another, as in this example: Citation is central ... because it can provide justification for arguments (Gilbert, 1976) The citation provides evidence for a proposition which can remain unchallenged if the writer is in agreement with it, or can be countered by the ensuing argument. Let us call this type of citation source because it indicates where the idea comes from. Identification The second type of non-integral citation identifies an agent within the sentence it refers to. An example of this is A simulation model has therefore been developed to incorporate all the important features in the population dynamics (Potts, 1980)3 where the information within the parentheses identifies the author of the study referred to. Instead of including the name of the author within the sentence ("Potts [1980] has developed..." or "A simulation model has been developed by Potts [1980]..."), the writer has chosen to focus attention on the information (Weissberg & Buker, 1990, differentiate between author- and information-prominent citations). Reference This type of citation is usually signalled by the inclusion of the directive "see" as in DFID has changed its policy recently with regard to ELT (see DFID, 1998). This type of citation is often similar to a source citation in that it can provide support for the proposition made, but it also functions as a shorthand device: Rather than provide the information in the present text, the writer refers the reader to another text. This type is particularly common in reference to procedures or to detailed proofs of arguments which are considered too lengthy to be repeated. Origin An example of this type is The software package used was Wordsmith Tools (Scott, 1996). Where Source citations attribute a proposition to a source, Origin citations indicate the originator of a concept or a product - in this case the creator of the Wordsmith Tools programme.4 INTEGRAL CITATIONS A clear distinction can be made between integral citations which control a lexical verb (Verb controlling) and those that do not (Naming). A third type is the reference to a person that is not a full citation -- this has been called a Non-citation form. Verb Controlling The citation acts as the agent that controls a verb, in active or passive voice, as in Davis and Olson (1985) define a management information system more precisely as... Language Learning & Technology 95 Paul Thompson & Chris Tribble Looking at Citations… Naming In Naming citations, the citation is a noun phrase or a part of a noun phrase. The distinction here is primarily grammatical but the form also implies a reification, such as when the noun phrase signifies a text, rather than a human agent: Typical price elasticities of demand for poultry products in Canada, Germany and the UK are shown in Harling and Thompson (1983) Another example of reification is when the naming citation identifies a particular equation, method, formulation or similar construct with individual researchers, as in In this paper, the management information system (MIS) definition of Davis and Olson (1985) has been used. An alternative type of naming citation is that which refers generally to the work or findings of particular researchers: Work by Samuel and East (1990) demonstrated that variety and seed rate had considerable effects on yield and quality aspects In this case, the naming citation is similar to a verb-controlling citation in that it reports work done by particular researchers. Non-citation There is a reference to another writer but the name is given without a year reference. It is most commonly used when the reference has been supplied earlier in the text and the writer does not want to repeat it. For example The "classical" form of the disease, described by Marek, causes significant mortality losses. However, instances where a person was invoked through reference to the thinking associated with them in general, rather than with reference to a specific work or set of works (for example, "Marxist" or "Darwinian") are not included. FURTHER EXPLORATION Employing these categories, it is possible to explore a number of questions about the theses examined in Thompson (2000): Q1. Are there differences in the types of non-integral, or integral, citations used by writers in different disciplines? Language Learning & Technology 96 Paul Thompson & Chris Tribble Looking at Citations… Figure 1. Proportion of citation types used in the two disciplines As shown in figure 1, writers in Agricultural Botany use the non-integral Source and Ident types much more frequently, while the Agricultural Economists make far greater use of integral Naming citations (reasons for which become apparent in Q4 below) and also make more mentions of names without giving full citation information. Q2. Are there differences in the practices of writers within the same discipline? Figure 2. The average number of different citation types per 1,000 words of text found in the eight Agricultural Botany theses As can be seen in Figure 2, the density of citations in the individual Agricultural Botany theses varies from just under 5 per 1,000 words (TAB5) to around 13 (TAB2 and TAB6). TAB7 uses Verb-controlling citation types far more than any of the other writers, and far fewer non-integral citation types. Examination of this thesis reveals that the writer makes frequent reference to individual studies and compares their findings to his own experiments (X found this, and Y reported this. My findings were ...). TAB 6, by contrast, uses predominantly non-integral citation forms, and prefers to make information prominent through use of the Identification citation rather than the integral Verb controlling type. TAB6 is a report of a laboratory-based investigation of innovative techniques for isolation of vacuoles, and Language Learning & Technology 97 Paul Thompson & Chris Tribble Looking at Citations… therefore the emphasis is on the techniques, and the subject of study, that is, the vacuoles. Different writers within one discipline, then, take different approaches to research, and their rhetorical choices are, to a degree, determined by the nature of the research that they conduct. Q3. Are different types of citation used in different rhetorical sections? In the Agricultural Botany theses, it was possible to divide the texts into four types of rhetorical section, following the conventions that are common in most scientific reports: Introduction, Methods, Results, Discussion. As can be seen in Table 4, there is considerable variation in the different sections of the theses, with relatively low use of citations in the Methods and Results sections of the thesis, and a markedly different set of citation types in the case of the Methods sections. To understand these variations, it is helpful to think of the hourglass model proposed by Hill, Soppelsa, and West (1982): the Introduction and Discussion sections of an article take a broad view, relating what is known in the field at large, while the Methods and Results sections are narrow, focussing on the research itself. While the Introduction and Discussion sections contain many references to other studies to establish the current state of knowledge and where the current report fits in, the Methods section contains mainly references to the methods and techniques of others. Table 4. Citation Types in Different Rhetorical Sections of AB Theses Density Most common types of citation (per 1,000 words Introduction 15.6 Source, Identification, Verb controlling Methods 2.3 Refer, Origin, Naming Results 2.4 Source (52%) Discussion 10.1 Source, Identification, Verb controlling This data shows that there is, then, variation in the density and type of citations used in different rhetorical sections of a thesis, and similar variation has been found across rhetorical sections in Physics, Chemistry and Biology masters' theses (Hanania & Akhtar, 1985). Section Table 5. The Number of Occurrences of Naming Citations in the Two Disciplinary Groupings RI Naming AB AE Total occurrences 116 484 Q4. Are there differences in patterns of language around particular citation types? Close inspection of the different kinds of Naming citation in the theses revealed interesting differences in the discourses of the two disciplines. Firstly, in terms of simple frequency, it can be seen from Table 5 that this citation type is much more commonly used (by more than four times) in the Agricultural Economics texts. In order to find out why this might be the case, concordance lines of the Naming citation type were examined. It was observed that certain patterns were regularly used, such as the three shown in Table 6. Table 6. The Number of Occurrences of a Pattern in the Thesis Corpus of Preposition + Naming Citation Agricultural Botany 3 (12)* 37 (154) 25 (104) ...in X (1991) ...of X (1991) ...by X (1991) Agricultural Economics 58 70 29 * In the middle column, the figure in brackets shows an adjusted figure which would make the amount equivalent to the figure in the right column (n*484/116). Language Learning & Technology 98 Paul Thompson & Chris Tribble Looking at Citations… The pattern "in X (1991)" is clearly much more commonly used in the Agricultural Economics theses. The use of the preposition "in" indicates that the citation is a reference to a book, and this is supported by the examples given in Table 7. In the Agricultural Botany theses on the other hand, "of" and "by" are more commonly used and these tend to refer to the research actions, findings, methods, and techniques of other researchers. Where Agricultural Economics thesis writers use "of," it is noticeable that this also includes discourse nouns, such as views and suggestions, which are not found in the Agricultural Botany texts. The Agricultural Economics writers, therefore, appear to be concerned with the texts and concepts of others, while the Agricultural Botany writers make reference to the research activities and techniques of other scientists. Table 7. Frequent Patterns Involving in "in," "by," and "of" in the theses REASONS FOR VARIATION We have seen from the quantitative data that there are substantial differences in citation practices between disciplines and between genres. The types of research work undertaken, the epistemological bases upon which this research is founded, the conventions of the discipline, and the purposes for which texts are created all influence the forms of citation made. Looking at citation from a micro-perspective, however, one might naturally ask, "What is it that leads a writer to choose one citation form over another?" Why, for example, did we choose, earlier in this paper, to write "Two recent studies of citation practices in academic texts that test this assumption are Hyland (1999) and Thompson (2000)," rather than "Hyland (1999) and Thompson (2000) are two recent studies of citation practices in academic texts that test this assumption"? Our reason in this case was that we wanted to place the noun phrase beginning "two recent studies" in theme position within the sentence. Shaw (1992) has observed that this is commonly the factor that determines voice (active/passive) in reporting verbs in sentence construction. The choice between using a non-integral identification type ("A simulation model has therefore been developed to incorporate all the important features in the population dynamics [Potts 1980]") instead of a Verb-controlling type ("A simulation model has been developed by Potts [1980] ...") is often governed by decisions as to how much prominence to give to the people involved (cf. Weissberg & Buker, 1990). To a certain extent, disciplinary convention plays a part here; it is conventional in scientific writing to de-emphasize the role of the researchers, particularly in controlled experiments, where the claim is that the human factor is not consequential (Dr. Philip John, School of Plant Sciences, University of Reading, personal communication). Language Learning & Technology 99 Paul Thompson & Chris Tribble Looking at Citations… WHAT DO EAP TEXTBOOKS SAY ABOUT CITATIONS? In the previous sections we have outlined a number of research findings regarding both the kinds of citations that are used in "expert performances" (Bazerman, 1994, p. 131) and the reasons for their use. Our next task is to review what kinds of advice or models are provided in published materials for EAP students, and to assess the extent to which these might need complementing. Three widely use EAP course books were selected for this purpose: Jordan (1992), Trzeciak and Mackay (1994), and Swales and Feak (1994). In summary, the course books provided surprisingly little advice or guidance to learners. Jordan (1992) offers little explicit advice and depends mainly on quotations to provide models for learners to work from. However, as Jordan only exemplified three kinds of citation it cannot be considered a sufficient treatment of the subject: 1. non-integral "...(Seers, 1979, pp. 27-28) a further dimension is added - 'development now implies, inter alia, that...'" 2. integral - naming "...For Seers, 'Development is inevitably a normative concept' ... (Seers, 1972, p 22)" 3. integral - verb controlling "... Hicks and Streeten (1979, p 568) identify and review four different approaches..." Similarly, Trzeciak and Mackay (1994) comment on only three kinds of citation: 1. integral - verb controlling ...Reporting using paraphrase 2. non-integral - identification ...Reference to source 3. integral - other ...Direct quotation... But again, they offer little in the way of clear guidance to the apprentice writer and do not draw their attention to disciplinary differences. Swales and Feak (1994) give a relatively fuller range of advice and examples, and discuss the contrast between non-integral/integral and footnote styles. However, they make no comment on the implications of using contrasting forms, and fall back on references to APA and MLA style guides. They do, however, usefully comment on the role of citation in abstracts. In conclusion, it is possible to say that little explicit advice is given in major teaching materials on how to manage citations in specific disciplines. Instead, there is an emphasis on summary, paraphrase, and quotation, and on a small set of the mechanical features associated with citation. How then can students learn more about citation practices in their own subject area? USING MICRO-CORPORA TO COMPLEMENT EAP WRITING PROGRAMMES Arguments have been made for the development of micro-corpora as resources for use in EAP programmes (Hyland, 2000; Tribble, 2001), and a corpus-informed approach appears to have much to recommend itself so long as relevant data are available. The need for such support is reinforced when student use of citations is investigated. In the preparation of this paper we reviewed a small collection of student assignments written at Reading university,5 and identified the following problems: • Lack of variety of citation types within single texts (e.g., the repeated use of "According to...") • Lack of linguistic variety + inappropriate selection of verb (e.g., inappropriate use of "claims") • Absence of certain categories (e.g., Non-integral reference) • Over-use of non-citational references to authors / authorities Language Learning & Technology 100 Paul Thompson & Chris Tribble Looking at Citations… These findings (supported by extensive experience of teaching EAP students) indicate that two kinds of resource will be of benefit to learners: firstly, a collection of their own writing, or the writing of their peers -- a "learner corpus" (Granger, 1998), and, secondly, a collection of examples of writing from the target discourse community (e.g., research articles/dissertations, etc., from the students' own field of study), or texts as closely analogous to this kind of writing as possible (e.g., student examination scripts -which are notoriously difficult to get hold of). While the collection of such data banks used to be difficult and time consuming, with the use of word-processors by students and the growing availability of electronic texts from the WWW or low cost scanners, and accurate optical character recognition (OCR) programs,6 these restrictions no longer really apply. With appropriate text resources to hand, it is relatively easy for teachers and students to begin a systematic investigation of citation practice in genres that are relevant to their own needs or interests. This need not require the use of a concordancing program; setting the search function in a word-processor such as Microsoft Word® to look for "(19" or "(20" with the "Find whole words only" un-checked will provide rapid access to the dated citations in a text, as will a search for a list of names based on the bibliography in an article. Obviously, more powerful searching and analysis of the results will be possible with a dedicated concordancing program.7 An appropriate procedure will be Stage 1 Stage 2 Stage 3 Stage 4 learners are introduced to a range of citation forms appropriate to their level of study learners investigate actual practice in relevant texts, reporting back on the form and purpose of citations they identify learners investigate the practices of their peers in writing assignments learners review their own writing and revise in the light of these investigations. As an example of how such a procedure can be used in an EAP programme we have drawn on the British National Corpus8 -- making use of Dave Lee's BNC Index (see Lee article in this issue) to make a microcorpus of 22 extracts from one academic journal − Language and Literature.9 The assumption in this case has been that the texts will be of interest to post-graduate humanities EAP students who are (a) interested in extending the ways in which they word citations, and who (b) wish to ensure that they are writing in a way that is appropriate for their field of study. It is possible for a teacher to use the four stage procedure outlined above to develop a set of learning materials which will achieve this end. In Stage 1, students will be familiarised with the citation categories we have discussed in this article. In Stage 2, they will work in different groups to complete a task such as the one given below. The worksheet was prepared using Wordsmith Tools (Scott, 1996) to find the citations in each article, so that students can be asked to compare citational practice across comparable texts in a narrow focus disciplinary context. In this instance we used a simple "catch-all" search string 19??)/??), that is, search for any five character string beginning with 19 -- remember this is a pre-21st century corpus -- and ending with a closing bracket, and any three character string ending with a closing bracket to catch other forms. Using this method, we located 112 citations in the 22 texts. Language Learning & Technology 101 Paul Thompson & Chris Tribble Looking at Citations… Table 8. Citation Worksheet Example 1. iety of possible surface realisations of that type of isomorphic relation we know as textual metaphor. Christine Brooke-Rose (1958), in her A grammar of metaphor, provides a classification of forms of metaphor. However, her categorisation is unprinci 2. information on the author from an asserted privileged position (much of the methodology and work of F.R. Leavis (e.g. 1936, 1967) and his followers is characterised by this approach). </p> <p> The two (realistic) perspectives remaining are positi 3. hur C. Clarke's 2001: a space odyssey, a novelisation of the film screenplay written by Stanley Kubrick and Arthur C. Clarke (1968). </p> <p> This passage cannot be characterised as prototypical SF. It does not deal with aliens, space, technology, 4. ism and claims to objectivity that have been increasingly questioned in the past twenty or so years (by writers from Derrida (1975) to Lakoff (1987)). </p> <p> The middle way seeks to formalise, or at least make explicit, normative patterns in the 5. of the overall communicative process involving an utterer and a receiver, very much in the implied spirit of Grice's (1971, 1975) projection of that communicative situation. However, the ordered pair of functions (f1 and f2) that are associated with 6. intuitions. The value of the model is not only in recasting the traditional notion of the Co-operative Principle (from Grice 1975), but also in describing the resolution of meaning as a principled negotiation between text and reader. The resolving st 7. though a holistic perspective is taken on literary stylistics in addressing science fiction. This approach follows van Dijk (1977) in regarding not only sentences but also textuality as the proper study of linguistics. In this, continental European 8. ext-world to their cognitive universe (based on their previous familiarity with the patterns typical of the genre). Eikmeyer (1989), in a paper from a conference on coherence, points out that reader interpretation depends on the depth of understandin 9. cott, who filmed his reading of the novel as Blade runner in the early 1980s. Though David Newnham, in the Guardian (24 July 1990), calls the film post- modernism: the movie, Scott's version is little more than a violent adventure story. Indiana Jone 10. for example, involves judgements based on textual factors such as the narrative point of view (Fowler 1986: 127-46; Simpson 1990), the presentation of verbalisation (Leech and Short 1981: 318-51), the degree of non-actualised propositions (Leech 198 11. as often in literary discourse, author and reader are considerably separated by space and time. </p> <p> Eikmeyer (1989: 27) concludes his paper by introducing a subjectivity condition by which the participants judge the values of each parameter. 12. the point of view of the reader's judgement of parameters is taken. Deriving from (and slightly correcting) Eikmeyer (1989: 27), the prototypically co-operative parameters for the reader are: where J is the judgement or subjectivity condition. This Language Learning & Technology Text Cit. Cat j7f Integral Verb Controlling j7f Non-Integral Refer j7f Integral Naming j7f Integral Naming j7f Integral Naming j7f Non-Integral Refer j7f Integral Naming j7f Integral Verb Controlling j7f Integral Verb Controlling j7f Non-Integral Source j7f Integral Verb Controlling j7f Integral Naming 102 Paul Thompson & Chris Tribble Looking at Citations… The task for each group in Stage 2 is, therefore, to categorise the citations identified in each article, and then to pool results and present a summary of the range, purpose and forms of the citations that occur in this micro-corpus (the categorisations have been provided in this worked-up example). In Stage 3 students will review their own citational practices (either using Wordsmith Tools to extract examples for analysis, or being provided with materials prepared by their teacher). Stage 4 will be on going and will involve a cycle of check-list supported peer review and self evaluation, supported by tutor comment on writing assignments or departmental work. CONCLUSION In this paper we have described a range of citational practices in academic writing along with their linguistic realisations. We have also reviewed the extent to which published teaching materials provide learners with opportunities to develop their understanding of, and capacity to form, appropriate citations in their own writing, and found that, at the moment, these offer relatively little constructive support to apprentice writers. The need for such support has been underscored by a survey of a small number of EAP texts written by students on a pre-sessional course at a UK university. If teachers of English for Academic Purposes are to be able to help learners develop a better control of this essential academic writing knowledge/skill, we would recommend the accumulation of relevant collections of field specific texts as a resource for teachers and students of academic writing. By analysing these texts with word processing software or dedicated corpus tools (or by working with the results of teacher led analysis), students will be able to develop a fuller understanding of the cultural and linguistic role of citation in their fields of study and be much better placed to write well formed and appropriate academic texts. NOTES 1. Readers who wish to explore this area further may wish to start off with Aston (1996) or Tribble & Jones (1997). 2. We are grateful to one of our anonymous reviewers for the report that their undergraduate class had analysed the uses of citations in Hyland's (2000) book and found an above-average use of author-assubject integral citations! 3. Potts is also the author of the article that this example is drawn from. In other words, this is a selfcitation. 4. The categories presented here are a reduced set. The categories of Example (non-integral) and the three types of Verb-controlling (integral) citations, Research/Discourse/Other, in Thompson (2000) have been removed to make the explanation clearer. 5. Pre-sessional assignments written by postgraduate students on the following themes: EFL in Korea / Testing in ELT in Pakistan / Project implementation / Food industry / Agroforestry / International management 6. E.g., Caere corporation's Omnipage Pro® or ABBYY's Fine Reader® 7. E.g., WordSmith Tools (Scott, 1996) or MonoConc Pro (Barlow, 1999) 8. A text resource of 100 million words of late C20 British English that is now available internationally (contact http://info.ox.ac.uk/bnc/ for more information). 9. The BNC file identifiers for the texts selected are J7F / J7G / J7H / J7J / J7K / J7L / J7M / J7R / J7S / J7T / J7U / J7V / J7W / J7X / J7Y / J80 / J81 / J82 / J83 / J84 / J85 / J86 / J87 / J88 / J89. Language Learning & Technology 103 Paul Thompson & Chris Tribble Looking at Citations… ABOUT THE AUTHORS Paul Thompson is a Research Fellow at the School of Linguistics and Applied Language Studies, Reading. He is currently studying towards a PhD, examining the language and organization of PhD theses in different disciplines. e-mail: p.a.thompson@reading.ac.uk Chris Tribble is the author of Writing in the OUP teacher education series and has a long-term interest in the use of computers in text analysis and language description, written communication, and evaluation in education. He lives in Sri Lanka and lectures on the MA at King's College, London University. email: ctribble@sri.lanka.net REFERENCES Aston, G. (1996). Corpora in language pedagogy: matching theory and practice. In G. Cook & B. Seidlhofer (Eds.), Principle and practice in applied linguistics: Studies in honour of HG Widdowson (pp. 257-270). Oxford, UK: Oxford University Press Barlow, M. (1999). Monoconc Pro [computer software]. Houston TX: Athelstan Bazerman, C. (1994). Constructing experience. Carbondale: Southern Illinois University Press. Becher, A. (1989). Academic tribes and territories. Milton Keynes, UK: Open University Press. Borg, E. (2000). Citation practices in academic writing. In P. Thompson (Ed.), Patterns and perspectives: Insights for EAP writing practice (pp. 14-25). Reading, UK: CALS, The University of Reading. Britton, J., Burgess, T., Martin, N., McLeod, A., & Rosen, H. (1975). The development of writing abilities, 11-18. London: Macmillan. Burnard, L., & McEnery, T. (Eds.). (2000). Rethinking language pedagogy from a corpus perspective: Papers from the third international conference on teaching and language corpora (Lodz Studies in Language). Hamburg, Germany: Peter Lang. Campbell, C. (1990). Writing with others' words: Using background reading text in academic compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom. Cambridge, UK: Cambridge University Press. Connor, U. (1996). Contrastive rhetoric: Cross-cultural aspects of second language writing. Cambridge, UK: Cambridge University Press. Fox, H. (1994). Listening to the world: Cultural issues in academic writing. Urbana, IL: National Council of Teachers of English. Granger, S. (Ed.). (1998). Learner language on computer. Harlow, UK: Longman. Groom, N. (2000). Attribution and averral revisited: Three perspectives on manifest intertextuality in academic writing. In P. Thompson (Ed.), Patterns and perspectives: Insights for EAP writing practice (pp. 15-26). Reading, UK: CALS, The University of Reading. Hanania, E., & Akhtar, K. (1985). Verb form and rhetorical function in science writing: A study of MS theses in Biology, Chemistry and Physics. ESP Journal, 4(1), 45-58. Hill, S., Soppelsa, B., & West, G. (1982). Teaching ESL students to read and write experimental research papers. TESOL Quarterly 16, 333-347. Language Learning & Technology 104 Paul Thompson & Chris Tribble Looking at Citations… Hyland, K. (1999). Academic attribution: Citation and the construction of disciplinary knowledge. Applied Linguistics, 20(3), 341-367. Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. Harlow, UK: Longman. Johns, A. (1997). Text, role and context. Cambridge, UK: Cambridge University Press. Jordan, R. R. (1992). Academic writing course. London: Nelson. Malcolm, L. (1987). What rules govern tense usage in scientific articles? English for Specific Purposes 6, 31-44. Pennycook, A. (1996). Borrowing others' words: Text, ownership, memory, and plagiarism. TESOL Quarterly, 30(2), 201-230. Pickard, V. (1995). Citing previous writers: what can we say instead of "say"? Hongkong Papers in Linguistics and Language Teaching, 18, 89-102. Scott, M. (1996). WordSmith Tools. Oxford, UK: Oxford University Press. Shaw, P. (1992). Reasons for the correlation of voice, tense, and sentence function in reporting verbs. Applied Linguistics, 13(3), 302-319. Swales, J. M. (1981). Aspects of article introductions. Birmingham, UK: Aston University Languages Study Unit. Swales, J. M. (1986). Citation analysis and discourse analysis. Applied Linguistics, 7(1), 39-56. Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge, UK: Cambridge University Press. Swales, J., & Feak, C. (1994). Academic writing for graduate students. Ann Arbor, MI: University of Michigan Press. Thompson, P. (2000). Citation practices in PhD theses. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective. Frankfurt: Peter Lang. Tribble, C., & Jones, G. (1997). Concordances in the classroom: A resource book for teachers. Houston TX: Athelstan. Tribble, C. (2001). Corpora and corpus analysis: New windows on academic writing. In J. Flowerdew, (Ed.), Academic discourse. Harlow, UK: Addison Wesley Longman. Trzeciak, J., & McKay, S. (1994). Study skills for academic writing. Hemel Hempstead, UK: Phoenix ELT. Weissberg, R., & Buker, S. (1990). Writing up tesearch: Experimental research report writing for students of English. Englewood Cliffs, NJ: Prentice Hall Regents. Language Learning & Technology 105 Language Learning & Technology http://llt.msu.edu/vol5num3/curado/ September 2001, Vol. 5, Num. 3 pp. 106-129 LEXICAL BEHAVIOUR IN ACADEMIC AND TECHNICAL CORPORA: IMPLICATIONS FOR ESP DEVELOPMENT Alejandro Curado Fuentes University of Extremadura, Spain ABSTRACT Lexical approaches to Academic and Technical English have been well documented by scholars from as early as Cowie (1978). More recent work demonstrates how computer technology can assist in the effective analysis of corpus-based data (Cowie, 1998; Pedersen, 1995; Scott, 2000). For teaching purposes, this recent research has shown that the distinction between common coreness and diversity is a crucial issue. This paper outlines a way of dealing with vocabulary in English for Academic Purposes (EAP) instruction in the light of insights provided by empirical observation. Focusing mainly on collocation in the context of English for Specific Purposes (ESP), and, more precisely, within English for Information Science and Technology, we show how the results of the contrastive study of lexical items in small specific corpora can become the basis for teaching / learning ESP at the tertiary level. In the process of this study, an account is given of the functions of academic and technical lexis, aspects of keywords and word frequency are defined, and the value of corpus-derived collocation information is demonstrated for the specific textual environment. INTRODUCTION The areas of English for Specific Purposes (ESP) and corpus-based lexical studies seem to converge in the study of terminology (cf. Pedersen, 1995). The main aim in terminology studies is to create specialised dictionaries that reflect knowledge fields and concepts where these are related to the property of lexical use restriction.1 In the textual collections, collocations play an essential role in the description of this specific language usage (Pedersen, 1995, p. 61). In this sense, word combinations work as building blocks that increase the learner's potential to command special languages. However, the results of technical collocation studies have little to offer students for academic performance and achievement: that is, they do not help learners meet the "stylistic expectations of the academic community" (Cowie, 1998, p. 12). This is because of the fact that in addition to the specialised terminology, there are other types of combinations that greatly influence the ESP learning context: for example, seek the objective, consider my suggestion, the theory is canvassed, argue rather less vehemently, and many other examples of academic discourse (Cowie, 1978, p. 132). Our approach is precisely based on the distinction between technical and academic word behaviour. We are influenced by lexicography where this this double perspective is exploited (e.g., Lozano Palacios, 1999) according to whom general academic vocabulary is distinguished from more specific word use. Lexical levels or categories are fostered and described through the application of corpus-based studies. The design of a fit corpus is of prime importance so that lexical profiles can be developed effectively. This means that aspects such as size, type, balance, and integration of texts must be defined from the scope of ESP. In this line of work, small representative corpora are favoured for specific purposes (Tribble, 1997, p. 116). Copyright © 2001, ISSN 1094-3501 106 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… In addition, an electronic concordancer such as WordSmith Tools (Scott, 1996) is rather useful to handle reduced text collections (Tribble, 2000). This includes dealing with differences between one given genre and the reference corpus, or between one specific theme and the overall body of subject texts (Scott, 1997). The results obtained are Keywords, which signal the "aboutness" of the texts (Scott, 2000), and thus receive primary observation in restricted language measurement. General word usage, in contrast, is derived from lexical surveys across subject boundaries. These are examined through critical concordance data, also known as KWIC -- Key Word In Context. With these notions in mind, particular subject areas are represented by specific corpora. The size and type of the sources can vary, depending on how similar or different the topics are. For instance, related disciplines within the broad domain of Health Sciences can be grouped together (e.g., Nursing, Occupational Therapy, Medicine), because they share knowledge fields. Yet, their organisation and distribution in a specific corpus may present thematic variables due to emphasis on a given branch alone (e.g., Sports Medicine). These selection principles are conceived according to interests and priorities in university programs and syllabi. In this respect, the domain selected in our research includes some current Information Science and Technology areas, such as Computer Science and Engineering, Optical and Radio Communications, Librarianship and Information Management, and Audio-visual Communications. These degrees are the main headings of our subject area sub-corpora; they are also majors that have been recently incorporated at our university (1995 - 2001).2 Due to the fact that changes take place very rapidly in these disciplines, the texts in the corpus should be regularly updated. A five-year time margin is recommended by some of our colleagues as a suitable renewal period. This suggests that we select, for instance, academic textbooks and research articles that have been published recently. In addition, information obtained from the Internet is favoured, since such feedback also tends to be up-to-date. This technical material is assessed conveniently, not only for university studies, but also for future careers where instructions are mostly read in English. As a result, the selection of the sources is made according to two chief principles: the importance of academic readings for tertiary level education and the consideration of technical material for both college and work situations. The principal objective of this paper is the classification of different lexical categories in English for Information Science and Technology. In this respect, the basis or point of departure is a lexical common core, described in contrast with the diversity of word use. Keywords and word frequency constitute the basic tools for working with this language variation. Collocation information is the main means for observing these linguistic traits in our context. The notion of collocation pervades this analysis of technical and academic constructions in ESP development. METHODOLOGICAL ISSUES From our viewpoint, the examination of lexical data in small corpora is related to the analysis of specific purpose languages. This relationship motivates our selection and arrangement of sources according to two main factors: ESP focal points (internal) and contextual conditions (external). Under the first parameter, texts are updated in terms of subject matter. Dudley-Evans and St. Johns (1998, p. 99) claim that this search for novelty is crucial in ESP; the aim is that language reflects current issues in Science and Technology, where the tendency is for "carrier content" to "date rapidly" (p. 174). A second internal factor is that material be authentic. This means that texts should be required or recommended in university courses (James, 1994), and that different genres should be included (Conrad, 1996). For instance, a primary or introductory stage involves textbook discourse -- aimed at fulfilling learning demands in first and second year university studies (Johns, 1997, p. 46). Then, technical writing Language Learning & Technology 107 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… in reports contains appropriate language for intermediate levels (Bergenholtz & Tarp, 1995, p. 19). Likewise, research papers and articles tend to meet the advanced needs of research students (Brennan & van Naerssen, 1989, p. 202). The third internal point in our methodology considers textual availability. A priority is that texts are managed electronically. Documentation in electronic format is needed for concordance procedures. Therefore, with the increased production of texts in this manner, students can become genuine users of corpus resources for learning purposes (Johns, 1993). ESP instructors can then work as supervisors of learner-centred reading tasks. The design of the corpus can be carried out by both instructor and learner alike; the former directs operations according to language interests, and the latter contributes special interest topics in the subject area. Criteria outside the ESP learning context are also applied. These external conditions influence corpus selection because study programs and syllabi must be accounted for as relevant to subject matter. In our institution and similar centres in Spain, they offer guidance for the arrangement of the sources. A contrastive examination of university curricula is encouraged to identify common subjects, taught in more than one of the four disciplines mentioned: Computer Science and Engineering, Optical and Radio Communications, Librarianship and Information Management, and Audio-visual Communication. Shared fields are labeled as subject categories in Table 1, according to the data derived from Information Science and Technology study programs.3 Table 1. Subjects Shared by Disciplines A A1 A2 B B1 B2 B3 C C1 C2 C3 D D1 D2 E E1 F F1 F2 F3 F4 F5 F6 Computer Science/Engineering and Optical/Radio Communications History of computers, Hardware, Software Computer engineering and architecture, Data communications and Client-server architecture Librarianship/Information Management, Computer Science/Engineering and Optical/Radio Communications Information units management Online database systems, Computer systems Automated Knowledge-based systems Librarianship/Information Management and Audio-Visual Communication Content analysis Media documentation Documentation Legislation Optical/Radio Communications and Audio-Visual Communication Media technology Media theory Librarianship/Information Management, Optical/Radio Communications, and Audio-Visual Communication Communication Theory All Four Disciplines Perspectives on Information UNIX / Internet HTML, SGML, TEI Hypertext technology Electronic publishing Information infrastructure Texts are selected according to their relevance in the subjects -- A1 to F6 (Table 1). The reading material is either offered in the courses, or recommended by content instructors. For example, textbook chapters on the history of computers, hardware, and software (label A1) are part of the book Computer Language Language Learning & Technology 108 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… (Díaz & Jones, 1999), suggested as further reading in the named introductory course for Optical/Radio Communication students. In contrast, a research article like "The Audience as Reader" (Callev, 2000), belongs as reference material in Content Analysis (subject C1); it provides technical reading that is helpful for project reports in both Audio-Visual Communication and Librarianship/Information Management studies. The inclusion of different academic genres balances the corpus. The goal is to provide a representative collection of the subject areas in the learning context of our institution, where specific language competence is mainly demanded for reading and writing. Thus, questions about lexical features are addressed by using the right corpus (Biber, Conrad, & Reppen, 1998). Figure 1 illustrates how our sources can be balanced according to genre and subject area synchronisation. Figure 1. Distribution of sources in corpus C.S. = Computer Science / Engineering I.S. = Information Science (Librarianship / Information management) Tel = Telecommunications (Optical / Radio Communications) A.Com = Audio-visual Communication All = All four disciplines RAs = Research Articles TXs = Textbooks RPs = Technical Reports The disciplines serve as reference for textual selection. According to this notion, each of the four areas includes an equal number of sources in each genre. Ten research articles, for example, deal with Computer Science / Engineering topics, drawn from bibliography lists in this discipline. However, the concepts are also examined in other study programs, such as Optic / Radio Communications. The same applies to the other cases, where university curricula provide feedback about reading requirements; these are double-checked by following programs and consulting colleagues in the subject areas.4 Figure 1 presents an additional set of sources: five textbook excerpts and six technical reports. These deal with the field of Business Technology, which appears as common core in all the subject areas. It cannot be distinguished as predominant within one single domain, but quite the opposite, it is a complementary part of all the different areas. Its importance is derived from not only study programs, but also the current Spanish job market. For instance, a report entitled "The Do's and Don'ts of Technology Planning" (FECT & NECC Conference, 1999) summarises Information Infrastructure issues (category F6, Table 1), which are commonplace in careers related to Information Science and Technology.5 The overall corpus does not exceed one million words. The purpose of this limit is to attain specificity. In this sense, a reduced size demands a precise representation of the specialised language. Figure 2 shows Language Learning & Technology 109 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… the number of running words (tokens) and distinctive items (types) for each of our genre sub-corpora. Standardised ratios (types per 1,000 words) are also contrasted. Figure 2. Word distribution WordSmith Tools provides the basic functions -- Keywords and Collocates -- which perform likelihood tests and Mutual Information measurements. These are made on the corpus to generate a quantitative view of lexical behaviour (cf. Ooi, 1998). Wordlists, another main feature, constitute the cornerstone on which to start the gathering of data. By cross-tabulating wordlists, keywords are obtained. A given sub-corpus (e.g., a subject category in Table 1) is contrasted with the overall reference corpus. The resulting group of words tends to be rather descriptive of the context aimed at. In this respect, the relationship between lexical items and text seems to be bi-directional, as words serve to identify context, and this, in turn, influences the particular bonding of elements. The results derived from this type of analysis are offered in the following section. The measurement of the data is carried out to observe lexical patterns, and, thus, a convenient classification of words can be made. Then, in the discussion, the significance of the data is assessed for ESP development. LEXICAL RESULTS Lexical findings are examined in context. This means that linguistic input is obtained by observing word combinations that are meaningful in the subject and genre domains. We use concordances to reflect the significance of lexical patterns in specific contexts (Firth, 1957; Halliday, 1966); this implementation constitutes the basis of our work. The contrastive view of the data provides the necessary conditions to check lexical diversity and uniformity in the corpus. The aim is to describe genre and subject matter variables. For this analysis, the operations are ordered as follows: observing, measuring, and classifying lexical data. To illustrate this analytical procedure, an example is provided with the cluster, provide access to, used extensively throughout the corpus. We observe this presence in all the genres and in several subjects, and thus measure its frequency and dispersion in the whole corpus. This is done to make sure that it occurs significantly as a general expression. In this respect, the assertion made about its classification is based on empirical factors. Language Learning & Technology 110 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Unbiased data are those based on lexical behaviour in context (KWIC), which can reveal how common a given expression really is. We determine that the relationship between concordance lines and the number of different sources contained in those lines informs about the type of lexical item. In this sense, the example provide access to is analysed according to a 0.3% cut-off point, meaning that, above that level, its occurrence is considered common core in our texts, and, consequently, free and general. This margin higher than 0.3 % refers to the number of sources in the concordance lines: For every 10 lines of concordance text, at least three different sources must be involved. In addition, we consider that the three genres should be present in the total concordance, and that at least six different subjects must be included. These numbers are considered reliable, since our corpus is not very large. We believe, in fact, that 95 texts, 17 subjects, and three genres are low numbers in comparison with bigger corpora, and that 0.3 % is an appropriate measurement as a result. As an example of this computation, the collocation directory service operations is observed. It is recorded as key within subject domain A -- belonging to the areas of Computer Science and Engineering and Optical/Radio Communication. It appears 70 times but only in three texts, yielding a 0.04% contextual margin. We thus regard this lexical manifestation as specific and restricted, in contrast to the free and general case of provide access to. Directory service operations actually behaves as a specialised collocation, in agreement with Pedersen (1995), and, as such, tends to form complex nominal compounds (see also Varantola, 1984). Finally, lexical elements that have a high frequency in the corpus, but are predominant within one single genre, also deserve attention. They tend to operate as restricted word combinations, but do not denote technical or specialised meanings. Instead, they form compounds of a semi-technical type. An example is your program directory, appearing in subjects A1, A2, B1, D1, F1, and F4 (see Table 1), and in 14 different sources. However, only the genre of technical reports contains these instances. This specificity makes these constructions genre-based. Three main lexical sets thus constitute the object of our study in the results: general elements, specific items in defined contexts, and genre-based constructions. General Elements Detailed Consistency Lists (DCL) are made available through the concordancer. These are wordlists arranged according to the contrast of frequencies in different domains (e.g., in genres). For the listing of general academic items, they prove to be rather useful. In our corpus of Information Science and Technology sources, the DCL is considered an academic word list. It is similar to Coxhead's (1998), since it presents input that can become quite relevant for English for General Academic Purposes (EGAP). Most of the lexical data in the DCL includes verbs and nouns, followed, to a lesser degree, by adjectives and adverbs. An important feature of academic language is that there are more verbs in the past tense and past participle (e.g., defined, conceived, designed) than there are present or gerund forms. The same happens, in fact, in Coxhead (1998). In the case of nouns, many correspond to common scientifictechnical instances, such as information, data, Web, HTML, and computer. Free word combinations result from examining the DCL. For example, the forms associated with the noun information are widespread throughout the corpus. They are analysed as free collocations, appearing in contexts that vary significantly in terms of subject matter. They are thus considered semi-technical elements. Some examples are information system, information technology, digital information, and information about. In addition to these frequent items, lexical elements found at the bottom of the DCL, are likewise important. Despite having a low frequency, they exhibit contextual significance in their behaviour. An example is the inflected form coined. All six occurrences of this item denote academic use. The pattern of Language Learning & Technology 111 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Verb + Noun in the expression coin + the term, surfaces in this sense. It is declared academic because of its high degree of dispersion, as it shows up in different genres and subjects. These contexts are a textbook and a research article on information management units, another article on perspectives on information, and two technical reports on information infrastructure. The term coined is judged as important in the DCL, as a result, not only due to its wide range of use, but also to its collocational strength -- denoting a great degree of idiomaticity (Stubbs, 1995). The diversity of contexts in which it is included makes it idiomatic. In addition, it is the only form in its lexical family appearing in all three genres and in more than one subject area. From this perspective, general academic terms can be either frequent or sparse, but they must always present a noteworthy dispersion. Other examples of low frequency items are the following: where this technology excels, imported into, select / edit / paste, cable hooked into the, was instrumental in + verb-ing, first and foremost, diskette drive, compounded by the, relative autonomy, and ticket booth. They all share the property of general academic vocabulary. A string like was instrumental in + verb-ing, or the cluster compounded by the, function as common core in our setting. The same occurs to a noun collocation such as diskette drive or ticket booth. They receive the same treatment, in this respect, as frequent word combinations, and match in importance the ones mentioned above, for example, information system, information technology, digital information, and so forth. Among the examples of low frequency words, however, a distinct type of item emerges, and an alternative approach is inferred in its classification within the general academic vocabulary. These are the so-called lexical phrases -- for example, first and foremost. They tend to behave as procedural items in our context, being closely related to academic use (Stotsky, 1983; Thurstun & Candlin, 1998).6 They also appear in a wide variety of contexts, functioning as grammar and discoursive markers. Their procedural status derives from the effect of signposting which they demonstrate in the texts. This characteristic is analysed as a rhetorical marking of functions and techniques. For example, they may indicate interaction with the reader, a reference to the text itself or to the investigation carried out. How these items manifest different rhetorical uses can be checked, for instance, with the behaviour of the preposition by. Its variation is made plain by a contrastive view at the corpus. The preposition is seen to denote conventional agent utilisation in many passive clauses, for example, claims made by the text, but it can also serve as a highly frequent instrumentalisation device, for example, by means of. In addition, it is commonly used in classification statements such as used by location and used by subject. Finally, it is often included in descriptive phrases like characterized by and defined by. This wide range of rhetorical expressions also affects content words. Nouns, for instance, are used in common clauses like make use of. These noun expressions appear in all three genres. Adverbs can also function in this way. Some examples are more likely and more appropriately, extensively produced in our corpus. Despite this inclusion of content words, grammar items such as the combinations mentioned above with by, prove to be the most extended type of rhetorical devices. Specific Items in Defined Contexts The function of Keywords in the electronic concordancer provides the means needed to describe terminology according to specific textual segments. This procedure is carried out with a given group of sources selected by subject. For example, topic A1 (History of Computers, Hardware and Software; see Table 1) is compared with the entire corpus, and word frequencies in both subject and reference collections are cross-tabulated. The resulting keywords are relevant not only in terms of frequency, but also textual dispersion. Thus, items like Multics, segment, Minix, bit, ring, segments, and ATM, appeal to the thematic essence of category A1. Language Learning & Technology 112 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… This list of keywords demonstrates the importance of "key-ness" scores in WordSmith Tools. The percentages in this measurement indicate positive and negative keywords. A "key-ness" level above 25%, in this sense, contains words that are pivotal as subject items ("positive keywords" [Scott, 1997]). Their identification is made in defined contexts such as thematic sub-corpora, as these terms concentrate the "aboutness of the texts" (Scott, 2000). Most of these words are nouns, combining as compounds that weigh heavily on field specialism, and they operate in restricted domains as subject descriptors. Reference to specific notions is observed in system project manager revision, automation project manager acceptance, or project manager report. They are examples of key combinations derived from the collocation project manager, which has a high "key-ness" score in the topic of Automated Knowledge-Based Systems (heading B3 in Table 1). There are several instances of these long nominal compounds in our subject texts, which leads us to consider that the longer the noun compound, the more restrictively it tends to operate in the subject area (Pedersen, 1995). Specific lexical structures thus reflect technical use, although this is not clear in all cases. For instance, the noun library is the top keyword in texts about Librarianship. It collocates with nouns that do not present any semantic complexity, as the instances virtual library staff, connectivity on the library, and public library community prove. Within subject F6 on Information Infrastructure, in fact, these elements specify the procedure of electronic information organisation, but do not offer much comprehension difficulty. The generality is that most keywords tend to sum up the thematic content of the texts. Several are actually quite descriptive at first sight, such as images and media in the area of Document Content Analysis (C1), or copyright and contractor in Document Legislation (C3). Keywords can also be obtained in the contrastive analysis of two or more disciplines. In this case, they originate from two lists of subjects, for example, A1 (History of Computers, Hardware and Software) and A2 (Computer Engineering and Architecture, Data Communications, and Client-server Architecture). The findings are then identified as broader in scope (for example, applicable to both Computer Science and Optical / Radio Communications), presenting, as a result, a less restricted subject-based pattern. Some examples in this thematic group (A1, A2) include bits, hardware, directory, IP, software, and PC. These items result from contrasting the smaller, theme-restricted context with the overall corpus, as pointed out above. The data prove to be crucial for the constitution of lexical profiles in the texts. A similar deduction is made by working with the four separate areas, that is,, Computer Science / Engineering, Optical / Radio Communications, Librarianship / Information Management, and Audio-visual Communication. The lexical information analysed in this case can be valuable as a guide to specialised language, much like dictionaries and other lexicographic material are (e.g., Collin's Dictionary of Computing, 1999, or the TERMITE Database of Telecommunications, 1999). In this respect, our data can be contrasted with authoritative sources to check similarity / variation features. For instance, in the case of the form abandoned, examined in Collin's Dictionary of Information Technology (1997), the phrase abandoned the spreadsheet is given. In our sources, the clause the code had to be abandoned, is similar to that example. This contrastive view is highly recommended from our perspective, due to the fact that the dictionaries and glossaries handled are recent. They therefore provide updated material for linguistic analysis. Table 2 displays the top three words surveyed in this way. The feedback conveys meaningful disciplinebased content, but also diversity of lexical use. Language Learning & Technology 113 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Table 2. Main Items in Discipline-Based Sub-Corpora Computer Science Program Data System Librarianship/Information Management Library Information Use Optical/Radio Communications Network System Data Audio-Visual Communication Digital Media Server Two large corpora are included in this cross-examination: The first two columns (from left to right) correspond to James (1994) and Lozano Palacios (1999). The two sources offer ranked positions of words, which is quite useful for academic purposes, since, as in Coxhead (1998) above, the most frequent verbs and nouns combine critically. For instance, Lozano Palacios (1999) reports on Verb + Noun and Noun + Noun patterns. The items are deemed as essential academic data. The clusters provide + access to, and data collection + techniques, instruments, methods are two relevant examples. The former is considered a general academic expression in our corpus, according to the description of the section "General Elements." The latter is specific and subject-based, appearing more frequently within the F domain. However, the compound data collection + Noun is not evaluated as strictly technical, mainly due to the common coreness aspect that characterises setting F (see Table 1). The two main aspects revised, academic and subject-based language, thus seem to merge in disciplinedriven vocabulary. The effect produced by such words seems neither common nor restricted. These items would be found at a middle position between general and specialised vocabulary in our study. Genre-Based Constructions Constructions that occur across different subjects, but only in one single genre, are also accounted for. These elements are rather frequent in various texts, much like the discipline-based items examined above. Nevertheless, these genre-based combinations are namely treated as specific academic language. Some examples common in technical reports include information object, networked information services, and information on the Web. As mentioned in General Elements, the DCL forms the bulk of contrasted vocabulary. The three genres are compared, and their word frequencies serve to establish measurement references. The items identified as relevant in these word lists have a high frequency in one genre alone, and, in contrast, very low usage in the other two contexts. In addition, a Keyword analysis is carried out on the top words of the genre lists in order to check that the lexical items are actually distinctive in their genre categories. The results are classified from most to least typical in terms of genre description; the former operate as positive keywords, and the latter as negative. An example of a highly positive keyword in the genre of textbooks is requirement. Another is Semiotics. They represent the two chief types of keywords in this environment: Widely extended across subject areas in the first case (for example, the following requirement), and restricted to particular subjects in the case of Semiotics within Audio-Visual Communication. In a different genre, technical reports, the top keyword library appears quite frequently. It combines within clusters and compounds more or less familiar, as observed in units like Cable Book Library, the library's clientele, library program, networking the library, inter-library lending, inter-library loan, and so forth. In contrast, the noun protection illustrates low frequency items in these reports. Despite its fewer occurrences, important grammatical forms such as protection from and protection for, can be pinpointed. Other significant lexical items which occur with protection include fire protection, copyright protection, protection criteria, and protection levels. Language Learning & Technology 114 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… In research articles, the top item is project, apparently uncharacteristic and yet, intimately linked to the research activity. Some frequent collocates show endemic traits in this respect: project deadline, project work, project milestone, project manager, project revision, and so on. With the analysis of this data, effective samples are drawn to back up our claims for the next section. Lexical information is then assessed, and implications for ESP development are reflected upon. DISCUSSION In the survey of results, three main divisions of lexical behaviour have been found: General academic vocabulary that occurs widely across the corpus, with presence in a diversity of topics; elements drawn from subject matter scrutiny, considered specialised or technical; and genre-driven findings occurring restrictively, thus characteristic of one single genre. In this section, our main aim is to assess this lexical co-occurrence in its context to determine validity for teaching ESP. The process of language acquisition, in this respect, should be evaluated according to the contextual variables analysed. We approach the description of academic and technical constructions in our environment by evaluating their use in either a great or small number of texts. In both cases, lexical units are influenced by subject matter and academic discourse. Word combination significance is mainly determined through language task application. This means that effectiveness of data is judged by specific language instruction: "To teach language for the subject specialism," and "teaching tasks based on the specialized content" (Edwards, 1996, p. 13). This evaluation leads to categorising lexical data as priority items for ESP and EAP courses (Jordan, 1997). Eight types of lexical units are consequently devised. They result from the detailed revision of the data in the previous section, and from how such items serve to fulfil specific language learning demands in our context. They are classified as follows: common core collocations, rhetorical academic elements, technical collocations, thematic combinations, area-based general words, area-based specific words, genre-based academic vocabulary, and genre-based thematic words. Common Core Collocations A main group of lexical elements is first inferred by focusing on those words that occur commonly. This level is measured across subject areas, which constitute a common core foreground where constructions are used by "authors writing on similar topics" (Stotsky, 1983, p. 438). The items receive a semi-technical treatment primarily because they are content words conceived, in agreement with Ewer (1983, p. 10), as a "number of language items which are common to the subjects," or as the "core language." In our scientific-technical context, this semi-technical degree derives from word behaviour registered at a general academic stage. The elements become core combinations related to the academic context. In this sense, they are viewed as formal, context-independent words with a high frequency and/or wide range of occurrence across scientific disciplines, not usually found in basic general English courses; words with high frequency across scientific disciplines. (Farrell, 1990, p. 11) Academic combinations function as lexical extensions of General English vocabulary in our specialised corpus. In other words, their meaning is familiar in academic discourse and common in the Information Science and Technology domain, since the expressions denote events and concepts that characterise this area. This language is more general than specific because it describes notions and ideas that are customary in the whole corpus. As described in the Results above, common core academic elements have high frequency and dispersion rates across sources, such as in the case of the collocations information technology and digital Language Learning & Technology 115 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… information. In contrast, lexical items can show low frequencies, but the number of texts included then is also high by comparison, and the offer of topics is diverse, for example, the aforementioned combination coined the term. The cut-off point that distinguishes both planes is 0.3 %, as mentioned previously (three texts for every 10 concordance lines). Table 3 is an example of a general academic entry. The lemma is address (representing both verb and noun), and its derived forms are addressed and addresses, which are also considered common academic words in our corpus. The number of instances is provided near the entry and divided into the three genre sub-corpus frequencies. Table 3 is organised according to frequency, from highest repetition to least; the most repeated combination is labelled with the times it occurs (shown in brackets). Table 3. Example of General Academic Entry in our Corpus TXs RPs RAs 128 317 63 ___ space (137) the same ___ whose ___ must ___ does not ___ the network ___ to ___ this ___ the issue ADDRESSED 47 15 36 to be ___ (14) ___ in is ___ by should be ___ ___ here has ___ can be ___ ADDRESSES 42 33 14 IP ___ (17) TXs = Textbooks; RPs = Technical reports; RAs = Research articles BOLD = lemma (most frequent item in its lexical family) UNDERLINE = word forms (less frequent) derived from the lemma ADDRESS As can be deduced, the collocations in Table 3 are based on verb forms (e.g., address + the issue) and nouns (e.g., address + space). These are content word associations, similar to the ones that the BBI Dictionary of English Word Combinations (Benson, Benson, & Ilson, 1997) describes. This source actually includes common academic items from the world of Information Technology: access data, browse the web, and so forth (Benson, Benson, & Ilson, 1997, p. vii). The combination of grammar items and verb forms is also evaluated at this level of general academic use. Some examples are shown in the section of addressed (Table 3) -- for example, addressed in, and addressed by. These are regarded as general collocations, as a result, since they are found in many different texts. Grammar constructions are academic collocations in this respect. The BBI Dictionary of English Word Combinations (Benson et al., 1997) distinguishes grammar from content collocations, but, in our case, this is not so. In our analysis, grammar combinations work at the same common core plane as academic collocations. We deduce our claim from the management of word lists as academic input for ESP development. Learners are encouraged to carry out lexical profiles when coping with academic reading. Such a chore implies, as a matter of fact, coming to terms with the DCL in the different genres, pinpointing constructions that are common core and typical in the texts. The collocations examined in Table 3 are classified in the form of lexical charts, where the most frequent items are contrasted with less common constructions. In this line of work, most learners do not differentiate grammar from lexical combinations. This turns out positively in our context of science and technology, since undergraduates are not used to making syntactic Language Learning & Technology 116 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… observations. Actually, students tend to carry out an integrated approach, in which content items function as main units that may or may not keep company with grammar words. For example, given a common collocation such as address the issue (Table 3), the lack of a preposition is learned by building collocation charts. These are exploited from both the readings and the DCL, taking the whole corpus as reference. In a related exercise, synonyms are explored. For instance, the academic verb cope with, is examined in combination with the noun the issue. In such a comparison, students value the collocational strength of the preposition with for the construction, in opposition to address, which does not demand the colligation. This drill is performed similarly with low frequency items. The main difference is that patterns are recognised by working with a small amount of lexical information. Useful combinations are then easily perceived, as different subjects are encompassed by that occurrence. A previous example, coined the term illustrates this case. Its free distribution is observed when we can detect that the phrase actually occurs in various texts. In addition, its common coreness is reinforced by contrasting synonyms such as built the term or constructed the term, since these are used in fewer contexts. Students can then be guided to value the more fixed meaning of the verb coin, given the fact that synonymous combinations show a lower dispersion rate. Rhetorical Academic Elements Rhetorical items also demonstrate common core relevance due to their high frequencies and distribution. They are used as markers of cohesion in the texts, according to the Results section. They tend to convey procedural usage, a feature that relates them to academic elements. Some of the examples mentioned are by means of, indicating instrumentalization, and more likely, operating as a token of clarification in the sentences. These constructions are classified at the same level as general academic language. Their procedural status defines them as common core, in agreement with McCarthy (1990, p. 51), and with Hutchinson and Waters (1981, p, 65): They serve as instruments of coherence and cohesion throughout discourse. Some procedural nouns functioning this way are the use of and the device which. This language is analysed under the EAP umbrella, which includes EST (English for Science and Technology). In this learning framework, comprehension activities are favoured, as they challenge learners to cope with markers of discourse structure (Flowerdew & Miller, 1997). For academic lectures, in fact, main and secondary ideas are discerned in the texts by exploiting these markers appropriately. For Science and Technology discourse, learners demonstrate their comprehension of content by conceiving appropriate rhetorical boxes (e.g., classifications, explanations, descriptions; Bygate, 1987). These are often built as a result of the adequate interpretation of rhetorical elements. A suitable exercise is based on the search for lexical formations containing a common grammar word. This type of work allows for the exploitation of procedural language in our corpus. It aims to identify vocabulary that co-occurs typically at the general academic plane. For example, measuring the occurrences of the preposition by, as examined in the Results above, provides different semantic features of scientific-technical discourse, for example, denoting functions as agent, instrument, and so forth. Figure 3 demonstrates another example with the preposition within. The word is analysed in different contexts so that learners can contrast its different meanings. This task of inducing sense depends on the main contextual conditions found; some authors refer to this activity as semantic prosody analysis (Stubbs, 1995). It results from the qualitative observation of common core collocations. In this case, the expressions and word combinations convey a strong procedural meaning, since they signal the type of discourse function being used. Language Learning & Technology 117 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… WITHIN ELEMENTS Procedural elements in use SEMANTIC PROSODY ___ information ___ software / ___ the project / ___ a commercial context WITHIN + CATEGORY WITHIN + LOCATION ___ ( + a scheduled time) __ headings / ___ ( + document ) Figure 3. Inferring procedural meaning in discourse WITHIN + TIME WITHIN (INSIDE) Technical Collocations The level of technicality in word behaviour is closely related to subject domain. The salient condition is that elements function uniquely in their corresponding field, describing the restricted setting. An example is the range of specific combinations identified with the noun network in U-network, access network, local area network, and so forth. This is examined within the subject of Client-Server Communication (category A2 in Table 1). The items thus allude to concepts and developments in specialised areas, and their interpretation demands conceptual knowledge. In addition, abbreviations are often key in this context, which is also evidence of the specific understanding that is required at this learning stage, for example, bit ASCII, LAN distribution, and GIF and JPEG files. Conceptually restrained, technical vocabulary is formed by collocations that introduce specialised knowledge in ESP. The identification of this special language is made by inferring idiomatic constructions from concordance samples. The aim is to perceive the fixation of long compounds, and to appreciate the value of this lexical restriction in the subjects. Figure 4 displays the technical collocations of object, a critical noun in the setting of Data Communications (category A2 in Table 1). An important collocation like object-oriented is first underlined by focusing on the most frequent word that goes with object in this context. Then, objectoriented features is also marked as important, given its high co-occurrence probability. Finally, according to the Mutual Information scores, the phrase use of object-oriented features is recorded in this technical scope. Figure 4. Concordance sample for the noun object in a technical setting Language Learning & Technology 118 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… We can guide learners to explore technical terminology by encouraging this data classification. An example is the relation of the collocations hierarchically. This can be achieved as follows: OBJECT (subject: Data Communications) Object Object-oriented Object-oriented features Use of object-oriented features In order to determine which combinations are productive, students operate with restrictive lexical charts. For instance, a useful type of activity is a filling-in-the-gap exercise where the ability to specify technical items is fostered. Learners pinpoint the restricted elements of subject texts by building tables such as the one shown in Figure 5. In this case, coming to terms with the central collocate in four different combinations demonstrates technical language command. Figure 5. Fill-in-the-gap exercise with technical collocations In Figure 5, learners take subject B3 as reference for lexical work (texts about Automated KnowledgeBased Systems). The answer to the central node can be found by working with the language of these sources. This means that students must revise concordance material and context as indicated in Figure 4. The word in the blank -- management -- can be realised after key technical input is correctly sifted. Thematic Combinations Semantic features are examined in technical words by inspecting the subject context. However, exploring the field of knowledge does not always lead to the description of specialised combinations, according to our data. There are forms of lexical behaviour, in fact, which occur critically in the thematic environment but do not classify as technical collocations. These are content words with a less complex level of comprehension, namely due to their greater familiarity in the world of Information Science and Technology. Some examples given in the Results are virtual library staff, connectivity on the library, and public library community, included in subject F6 (Information Infrastructure). Other examples reflect the register of a subject in a clear way. For instance, the legislation language of category C3 is clearly revealed through key clauses such as the contractor shall and copyright law (see Results). The constructions are either specific clusters or multi-word units that identify the subject under analysis. Other elements can be located in the area of Content Analysis (C1), where mass media and of the mass media operate as typical constructions within their thematic setting. In addition, like technical collocations, these items are almost exclusive of their domain and thus seldom found in any other part of our corpus. At this level of thematic combinations, we also find lexical data that is characteristic of a related group of subjects, that is, within a major heading from A to F in Table 1. As a result, the language items described in this case are not as precise as technical collocations. For example, key combinations are analysed in the space where Computer Science and Optical/Radio Communications meet (category A). The results refer Language Learning & Technology 119 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… to computer and network issues, but do not posit much technical difficulty. Some of these items are computer program, hardware and software, bits per second, and the interface shall provide. The analysis is therefore based on language that is segmented according to main subject categories. The keywords that emerge from this task are evaluated in terms of technical use (as was done in the section Technical Collocations). In this observation, we notice that restrictive collocates are not detected. In contrast, a wider possibility of combinations is offered. For example, the synonym computer application is found alongside computer program in subject domain A, although the former is used less frequently. This aspect of thematic combinations suggests activities where learners can make lexical decisions. These are based on choices by which synonymous thematic word combinations are explored. Students are given the freedom to decide which structures best fit the topic areas. Some possibilities are offered in Table 4 for the shared background of Computer Science and Optical/Radio Communications (letter A in Table 1). Table 4. Investigating Synonymous Thematic Combinations COMPUTER & TELECOMMUNICATIONS WORDS Computer program The interface shall provide A string of bits Possible links Multiple processes Piece of software = = = = = = The method presented in Table 4 is a constrastive view with synonyms located anywhere in the corpus. Thematic combinations are thus distinguished from common core items, such as computer program versus computer application or the interface shall provide versus the interface provides. Lexical evaluation is then possible by contrasting thematic constructions with general use. In this manner, the level of specification of the former can be appreciated. The same is applied to other textual segments, for example, to subject items that are not technical. The F6 category examples, for example, virtual library staff or connectivity on the library, are assessed in this manner, being replaced by common core options like library personnel and connecting virtual libraries. The purpose of this work with thematic vocabulary is to value concordance data in different textual positions. In other words, the goal is to train learners in the aspect of lexical variation, which encourages operation according to context; this is a consistent position from our viewpoint at all levels. Area-Based General Words The goal at this stage is to describe how language develops within a single discipline of Information Science and Technology. The items are familiar in all four areas, but expound a characteristic tone in a particular one. An example is provided by the cluster provide + access to. This structure appears freely throughout the whole corpus, but it receives greater emphasis in Librarianship and Information Management, where its semantic prosody is revealed as provide + access to + documentation; this behaviour is confirmed in Lozano Palacios (1999). Area-based lexical features contribute to enhancing academic word usage. As described in the section Common Core Collocations, common core academic elements are exploited in EGAP (English for General Academic Purposes). The same can be done in the case of area-based general constructions, since this input may be similarly used in EGAP courses. Such similarities at both common core and area-based levels are contrasted by means of specialised dictionaries, glossaries, and corpora about the academic Language Learning & Technology 120 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… disciplines. These sources can supply linguistic information that allows learners to pinpoint similarities and differences. Managing and making sense of dispersion plots, available through WordSmith Tools (Figure 6), can also be enriching for learners. The plots signal where certain items crop up in the texts, thus challenging students to cope with visual data. For example, the noun access manifests a high concentration in sources dealing with the area of Librarianship. Figure 6 shows this lexical clustering in rows 1, 2, and 3 -corresponding to text files of reports (row 1) and textbooks (rows 2 and 3). The concordancer can then disclose whether access is, in fact, used as a noun in these contexts. This activity should enable students to check, on their own, the high frequency of provide access to in our corpus. The dispersion plots help them to clarify that it is actually emphasised in Librarianship; concordance feedback do the same by allowing learners to examine the semantic prosody + documentation. The expression is consequently conceived as a general academic expression due to its common coreness; yet, students notice that it is more heavily used in the context of Librarianship Studies, denoting a special meaning. The DCL (Detailed Consistency List) of the four disciplines included in our corpus also makes areafocused lexical use easy to perceive. Through this list, the frequency of access is seen as higher in Librarianship / Information Management texts (see Table 5). Figure 6. Use of dispersion plots by learners for visualisation of lexical concentration Language Learning & Technology 121 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Table 5. Availability of Frequencies in Discipline-Based DCL N Word Files Audio Computer Library 68 Access 4 158 145 415 N = position occupied by word in DCL according to frequency and distribution in files Telecom 345 Area-Based Specific Words The approach described in the section Area-Based General Words is geared towards practical work at the academic level of ESP. The goal is to foster guidance through particular lexical fields in area-based language. This turns out to be particularly useful in EGAP, where getting familiarised with reading material, for instance, becomes a powerful resource. Inspection of word use at this level also demands knowledge of specific concepts in the areas. In this sense, our study includes attention to particular lexical groups in the areas, where learners should thus specialise. An example of these items, examined in the Results, is the unit data collection, co-occurring with specific nouns such as techniques, instruments, and methods. These prove to be fixed word combinations, given their high co-occurrence rate. The focus is then placed on English for Specific Academic Purposes (ESAP) learning (cf. Jordan, 1997). For ESP development, constructions involve a view into concept from this perspective. The approach is made as a response to specific queries regarding a subject area. For example, data collection techniques refers to the standard means of gathering data in Librarianship. The underlying fact is that we investigate context, in this respect, to explore concepts. The activity demands learners examine conceptual paragraphs (cf. Trimble, 1985) that explain notions and clarify technicalities alluded to by the terminology. This contextual information can be exploited for task development; it constitutes support material, for example, for research preparation, that is, doing project reports in English. Figure 7 presents a set of conceptual paragraphs taken from our corpus. Learners may use them for taskbased research. The excerpts are assessed according to specific learning needs. They can then serve as complementary or illustrative material for project reports (e.g., as examples/passages to give in oral presentations). A range of methods were employed to analyze data from the various data collection instruments. Quantitative data from the questionnaires, logs, training assessment, etc. were coded and entered in a spreadsheet for analysis. The techniques used to analyze these data relied primarily on computing averages and frequencies. Develop and test a range of data collection instruments related to measuring the impact of Internet connectivity. Ultimately, the evaluation aspect of the project became the means by which a final report was developed for use by other public librarians and policymakers. Phase 2 was intended to direct the development and administration of the various data collection instruments: What is the value of network connectivity for rural libraries? How does the installation and use of a network connection have impact on library staff, organization, and service provision? What groups in the user community benefit from the network connection? Figure 7. Instances of specific concept development in an area Genre-Based Academic Vocabulary This group is determined from academic discourse study. However, unlike general words in Common Core Collocations, or area-related elements in Area-Based Specific Words, the identity of this level is based on the conception of genre. Awareness of genre features, in this respect (cf. Jordan, 1997), is the Language Learning & Technology 122 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… prerogative. This is confirmed, for instance, in the case of advanced learners who are required to perform well in writing assignments. An example of a lexical item in the genre focus is project, as mentioned in Results. This noun is often used in research papers to refer to the investigation being described. In a Computer Science setting, for instance, project is stressed in project deadline, project manager, project work, and so forth (see Results). The items become common in the genre of research articles. A relevant activity is to contrast genre-based instances such as these with general academic elements. The comparison aims to refine the view into genre-focused words, while general academic items are explored in the whole corpus. Table 6 provides an illustration of this comparative task in research articles. GENRE-BASED (ARTICLES) project members design process search test GENERAL ACADEMIC for the project the design of of the search Table 6. Contrastive View of Research Article Items with Common Core Elements in Our Corpus Lexical variation is thus visualised in the genre. The recognition is beneficial for learners aiming to develop effective writing skills. In particular, technical composition within the genre is enhanced by means of specific genre features. Coping with this should enable the ability to adapt to the conventions of an area like Information Science and Technology. In this regard, we favour an ESAP methodology for genre-based academic items, while the focus is also placed on scientific-technical writing. Genre-Based Thematic Words This last category also considers genre awareness as the main scope. The procedure by which this set of items is established falls under the ESAP application. This means that specific language is exploited in tasks designed to make genre features familiar. In addition, thematic influence fortifies the genre-based lexical focus on academic and technical purposes. An example mentioned in the Results is the noun Semiotics. It surfaced in textbook chapters about Content Analysis (heading C1). Academic lectures on this subject offer language greatly influenced by theme. A course in our institution integrates these lessons on Semiotics in Audio-visual Communication and Librarianship/Information Management studies. The lectures encourage the elaboration of summaries and reports, for which familiarisation with typical collocations and structures in the setting becomes beneficial. Learners apply their note-taking skills to listening and writing activities derived from the lectures. Figure 8 reproduces a short extract of a lecture on semiotic elements, given by an American visiting professor at our school in 1997. Content comprehension is then tested in activities (Figure 9). Today's topic deals with the fundamentals of all visual communication …. These are basic elements, [pointing to the slide] … these are the compositional source of all kinds of visual materials, … for example, the messages, the objects and … the experiences as well … In this way, … we have that the most basic element is the dot, … which can be defined as a pointer, a marker … a marker of space … the other element is the line …. This is an articulator of form, … that is, a design item for making a technical plan, … so it designs the form intended, … ok; … another element that we can think of is the shape, which is the basic outline, …. Figure 8. Lecture excerpt on semiotic elements Language Learning & Technology 123 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… 1st element: Definition: 3rd element: Examples: 4th element: Example: 5th element: Contrast: 6th element: Reference: th 8 element: th Classification: 11 element: Exemplification: General Field of elements General function of elements in field Concept of understanding elements Example Figure 9. Example of an activity with specific subject lecture The textbook and lecture genres are thus exploited in this course of second year Audio-visual Communication and Librarianship students. Key lexical items are pinpointed as traits of genre-based thematic language. Learners have the option to experience this language by both textbook reading and lecture note-taking. Some examples are visual elements, Semiotics components, signs and codes, and the basic element (see Figure 9). We must clarify that these items are not restricted to one given genre. In other words, not only general elements but also specific items may appear in other genres. However, the words are more descriptive of the context being dealt with. For instance, the data explored in the course (Figures 8 and 9) reflects the typical language of the Semiotics subject, expounded through lectures and textbooks. The concepts needed in that content lead to seeking these specific items, belonging to the mentioned topic of Semiotics and to no other. The integration of genre and subject fosters content-based instruction, an important point in ESP learning. The approach focuses on corpus material, developed with different educational levels in mind. An example is that of learners in second year courses of Librarianship and Audio-visual Communication having to cope with the mentioned genres of textbooks and lectures. The assessment of our data is proposed as a practical view of ESP from an academic and subject scope. In this sense, it is not an exhaustive view of word behaviour, but an applied one for subject area courses. In the following section, we revise this and other relevant claims made. CONCLUSIONS The principal aim in this paper has been to provide evidence that supports the distinction of common core language from restricted lexical behaviour. A central assumption is that two separate levels exist in our sources: academic and technical. Nevertheless, inferred from lexical classification in our specific corpus, both planes are divided into further categories of word use. These encompass regions of lexical use where academic and technical elements apparently coalesce (however, just in appearance, as has been observed, since specialised use is finely specified). In a corpus that is representative of both academic and technical material in our selected areas, seeking lexical behaviour patterns is primarily done according to contextual parameters. This is achieved by applying genre and subject variables. The aid of study programs and university curricula becomes Language Learning & Technology 124 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… essential in this respect, while applying ESP principles is required for consistency. The chief purpose is to collect texts that meet language and content demands in our setting. Our approach to the data includes empirical observation, classification, and assessment of lexical patterns. In this process, measurement is carried out quantitatively, that is, in the form of absolute and relative word frequencies. These are essential reference statistics used for contrastive analysis: They serve as point of departure in the contextual study. Keywords then play a decisive role for lexical profiles, which demand a qualitative treatment of the data. This means classification of patterns based on frequency and dispersion. As a result, in the analysis data is assessed as either occurring broadly across texts, or more narrowly within certain sources. The results propose three main types of lexical behaviour based on this: Common core elements in the whole collection, specific words in themes and topics, and elements that are characteristic of only one genre. The three are surveyed through analytical steps related to ESP notions: Settings are defined and described according to specific learning needs. In the evaluation of lexical information, academic and technical word behaviour is discussed. Eight categories are induced by investigating the relationship between concordance data and context. The way in which these language peculiarities are developed affects our approach to ESP courses. Common core elements are divided into general academic items and procedural words. Both demonstrate a widespread distribution throughout the corpus, and are subject-independent. This makes us consider them as semi-technical vocabulary. They include content and grammar items that have either a high or low frequency in the corpus. Their function is inferred for EGAP (English for General Academic Purposes) teaching, mainly through the application of academic tasks, for example, using wordlists to point out lexical data in readings. EAP (English for Academic Purposes) thus motivates our work with EST (English for Science and Technology). Procedural items are common core constructions that mark cohesion in academic discourse. This is a main characteristic in general academic writing as well as in lectures. Their organisation in discourse facilitates comprehension. In contrast with general academic collocations, procedural elements include grammar combinations that have a semantic prosody related to the organisation of discourse. Regarding subject-based formations, the degree of restriction in the collocations influences the lexical divisions made. In the case of technical vocabulary, combinations are quite fixed. The elements are highly restricted in their behaviour, meaning that they exhibit a consolidated use in the thematic setting where they are identified as key. Through detailed revision of concordance lines, technical compounds are examined within longer phrases. This description is done in a manner resembling specialised dictionarymaking, where key constructions function as descriptors for the subject area. Concordance observation is also useful for underlining thematic influence on those collocations that are not strictly technical. These are valued as significant feedback in the subject area, but denote a less fixed behaviour. This means that they can be replaced by synonymous expressions without making a significant change. However, their use is characteristic in certain subjects, and not in others. In this sense, even though they tend to be easy to understand, they are also considered specific of the subject area. Discipline-based elements are also distinctive in the subject area. They would be found in the middle ground between general academic expressions and specific language. In this respect, they are treated as common lexical items, identified in different areas, but prevailing in only one. They are conventional within the discipline, referring to aspects that are frequent and widespread. In EGAP development, work with these elements enhances the use of academic language for particular areas. A different case is the lexical data that refers to concepts exclusive of only one discipline. In that situation, ESAP is favoured: Tasks challenge learners to cope with specific content in their studies. Language Learning & Technology 125 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Knowledge of technical issues is fostered through activities that demand exploitation of conceptual paragraphs, for example, by elaborating oral reports. Finally, lexical features are analysed in the genre context. In addition, as emphasis is placed on subjects, the genre setting includes thematic items. Both subject and academic elements raise genre awareness in this context. This is especially useful for ESAP writing performance. Genre-based items can present restricted patterns of lexical behaviour, developed within one single genre, or even topic. The elements behave as descriptive items, but the difference is that they may do so in the overall genre, regardless of topic influence, or in a specific subject conveyed through particular genre conventions. The information obtained and described in this article is therefore assessed for ESP development. However, it is not intended as theory on lexical behaviour in academic and technical contexts. On the contrary, its validity highly depends on practical factors which lead to the design of specific corpora. Large textual collections can serve as reference for the analysis of our data, but do not meet specific learning demands as fittingly as one's own corpus can. In fact, we believe that none but representative material in the teaching environment can really fulfil specific language requirements. NOTES 1. The sources may either include one major discipline, such as the Dictionary of Computing (Collin, 1999), or more than one area, as is the case with the Dictionary of New Media: Film, Television, Print, Digital, Internet, Multimedia (1999). 2. The Spanish titles are "Informática técnica" and "Ingeniería Informática," "Sonido e Imagen," "Biblioteconomía y Documentación," and "Comunicación Audio-visual" (see University of Extremadura Web page at http://www.unex.es/). 3. The university curricula consulted in Spain (in addition to our own institution) are as follows: For Computer Science and Optical and Radio Communications, Facultad de Informática, Universidad Politécnica (Madrid), Universidad Politécnica de Valencia, and Universidad de Vigo (Departamento de Teoría de la Señal y Comunicaciones). For Librarianship and Information Management, Facultad de Biblioteconomía y Documentación (Universidad de Granada). For Audio-visual Communication, Instituto Universitario del Audio-visual (Universitat Pompeu Fabra, Barcelona) and Facultad de Ciencias de la Información (Universidad Complutense de Madrid). 4. Guidance offered by content instructors is highly valued in the process of textual selection. In addition, as mentioned above, advanced learner's knowledge can produce positive results. Internal (ESP) approaches can thus benefit from these external factors provided by the institution. 5. In fact, the elaboration of a broader corpus that incorporates business texts leads us in such a direction: to integrate material that is generally useful for information technology majors as well as business students, as they cope with common issues and concepts. 6. Stotsky (1983, p. 438) refers to "words that contribute to cohesive ties in academic discourse ... usually the content words generated by authors writing on similar topics." These words are also common core, offering greater difficulty to non-native or overseas students because they are "often abstract and / or complex." Language Learning & Technology 126 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… ABOUT THE AUTHOR Alejandro Curado teaches English for Computer Science and Telecommunications at the Polytechnic School at University of Extremadura (Spain). His doctoral thesis (2000) presents lexical findings according to genre and subject in specific settings. His research aims to integrate both discourse and corpus-based lexical approaches to teaching ESP. E-mail: acurado@unex.es REFERENCES Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations. Amsterdam: John Benjamins. Bergenholtz, H., & Tarp, S. (1995). Manual of specialised lexicography. Amsterdam: John Benjamins. Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Investigating language structure and use. Cambridge, UK: Cambridge University Press. Brennan, M., & van Naerssen, M. (1989). Language and content in ESP. ELT Journal, 43 (3), 196-205. Bygate, M. (1987). Speaking. Oxford, UK: Oxford University Press. Callev, H. (2000). The stream of consciousness. Film-Philosophy, 4(11). Retrieved August 15, 2001, from the World Wide Web: http://www.film-philosophy.com/vol4-2000/n11callev. Collin, S. (1997). Dictionary of information technology. London: HarperCollins. Collin, S. (1999). Dictionary of computing. London: HarperCollins. Conrad, S. (1996). Investigating academic texts with corpus-based techniques: An Example From Biology. Linguistics and Education, 8, 299-326. Cowie, A. P. (1978). The place of illustrative material and collocations in the design of a learner's dictionary. In honour of A.S. Hornby. Oxford, UK: Oxford University Press. Cowie, A. P. (1998). Introduction. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications (pp. 1-38). Oxford, UK: Clarendon Press. Coxhead, A. (1998). An academic word list. English Language Institute Occasional Publication No 18. New Zealand: Victoria University of Wellington. Díaz, J. C., & Jones, M. (1999). Computer language. Madrid: UNED. Dictionary of new media: Film, television, print, digital, Internet, multimedia. (1999). New York: Readfilm. Dudley-Evans, T., & St. Johns, M. J. (1998). Developments in ESP: A multidisciplinary approach. Cambridge, UK: Cambridge University Press. Edwards, P. (1996). The LSP teacher: To be or not to be? That is the question. AELFE (Asociación española de lenguas para fines específicos), 9-25. Ewer, J. (1983). Teacher training for EST: Problems and methods. The ESP Journal, 2, 9-31. Farrell, P. (1990). A lexical analysis of the English of electronics and a study of semi-technical vocabulary. Dublin: Trinity College. Language Learning & Technology 127 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… FECT & NECC Conference (1999) Excerpts of Paper “The Do’s and Dont’s of Technology Planning” Retrieved August 15, 2001 from the World Wide Web: http://fetc.state.fl.us/. Firth, J. R. (1957). A synopsis of linguistic theory. 1930-1955. In J. R. Firth (Ed.), Studies in linguistic analysis (pp. 1-55). Oxford, UK: Basil Blackwell. Flowerdew, J., & Miller, L. (1997). The teaching of academic listening comprehension and the question of authenticity. English for Specific Purposes, 16(1), 27-46. Halliday, M. A. K. (1966). Lexis as a linguistic level. In C. E. Bazell, J. C. Catford, M. A. K. Halliday, & R. H. Robins (Eds.), In memory of J. R. Firth (pp. 148-162). London: Longman. Hutchinson, T., & Waters, A. (1981). Performance and competence in ESP. Applied Linguistics, 2(1), 5669. James, G. (1994). English in computer science. A corpus-based lexical analysis. Hong Kong: Longman. Johns, A. M. (1997). Text, role and context. Cambridge, UK: Cambridge University Press. Johns, T. (1993). Data-driven learning: An update. TELL & CALL, 3, 23-32. Jordan, R. R. (1997). English for academic purposes. Cambridge, UK: Cambridge University Press. Lozano Palacios, A. (1999). Vocabulario para los estudios de Biblio-documentación [Vocabulary for library science and documentation studies]. Granada: Servicio de publicaciones, Universidad de Granada, Facultad de Biblioteconomía y Documentación. McCarthy, M. (1990). Vocabulary. Oxford, UK: Oxford University Press. Ooi, V. B. Y. (1998). Computer corpus lexicography. Edinburgh: Edinburgh University Press. Pedersen, J. (1995). The identification and selection of collocations in technical dictionaries. Lexicographia, 11, 60-73. Scott, M. (1996). WordSmith. Oxford, UK: Oxford University Press. Scott, M. (1997). PC analysis of key words and key key words. System, 25(1), 1-13. Scott, M. (2000). Reverberations of an echo. In B. Lewondowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora. Frankfurt: Peter Lang. Stotsky, S. (1983). Types of lexical cohesion in expository writing: Implications for developing the vocabulary of academic discourse. College Composition and Communication, 34(4), 430-446. Stubbs, M. (1995). Collocations and semantic profiles: On the cause of the trouble with quantitative studies. Functions of Language, 2, 23-55. Termite Database (1999). ITU Global Directory Telecommunication Terminology. Thurstun, J., & Candlin, C. (1998). Concordancing and the teaching of the vocabulary of academic English. English for Specific Purposes, 17(3), 267-280. Tribble, C. (1997). Improvising corpora for ELT: Quick and dirty ways of developing corpora for language teaching. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 106-118). Lodz, Poland: Lodz University Press. Tribble, C. (2000). Genres, keywords, teaching: towards a pedagogic account of the language of project proposals. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective (pp. 75-90). Frankfurt: Peter Lang. Retrieved August 15, 2001 from the World Wide Web: http://ourworld.compuserve.com/homepages/Christopher_Tribble/Genre.htm. Language Learning & Technology 128 Alejandro Curado Fuentes Lexical Behaviour in Academic and Technical Corpora… Trimble, L. (1985). English for science and technology: A discourse approach. Cambridge, UK: Cambridge University Press. Varantola, K. (1984). On noun phrase structures in engineering English. Turku: Annales Universitatis Turkuensis. Language Learning & Technology 129 Language Learning & Technology http://llt.msu.edu/vol5num3/mollering/ September 2001, Vol. 5, Num. 3 pp. 130-151 TEACHING GERMAN MODAL PARTICLES: A CORPUS-BASED APPROACH Martina Möllering Macquarie University, Sydney ABSTRACT The comprehension and correct use of German modal particles poses manifold problems for learners of German as a foreign language since the meaning of these particles is complex and highly dependent on contextual features which can be linguistic as well as situational. Following the premise that German modal particles occur with greater frequency in the spoken language, the article presents an analysis which is based on corpora representing spoken German. The concept "spoken language" is discussed critically with regard to the corpora chosen for analysis and narrowed down in relation to the use of modal particles. The analysis is based on the following corpora: Freiburger Korpus, Dialogstrukturenkorpus, and Pfeffer-Korpus. In addition, a collection of telephone conversations (Brons-Albert, 1984) was scanned into computer-readable files and analysed using MicroConcord (Scott & Johns, 1993). A quantitative analysis was carried out on all corpora. The qualitative analysis was limited to the telephone conversations and looks at the constraints on and functions of the different occurrences of the form eben. INTRODUCTION Discourse particles occur in a variety of languages and have been analysed in great detail for the English language by Schiffrin (1987). Particles of the modal particle type are prevalent in West-Germanic languages: Dutch, Frisian, and German (e.g., de Vriendt, Vandeweghe, & Van de Craen, 1991; Abraham, 1991a for the link between German, Frisian, and Dutch; Aijmer, 1997, for Swedish). Research interest in German modal particles arose in the late 1960s with the advent of a more pragmatically oriented approach to linguistics. They started to shed their image as superfluous, stylistically dubious "fillers" that had to be avoided in "proper German" (Busse, 1992). Since Kriwonossow's (1963, first published in 1977) and Weydt's (1969) seminal studies on German modal particles, a large body of work on the subject has emerged. In those publications, different terms are used for the words that are here described as "modal particles." Thus, we find for example, "flavouring words" [Würzwörter] (Paneth, 1981), "intentional particles" [Intentionale Partikeln] (Rall, 1981), "pragmatic particles" (Held, 1983), "discourse particles" (Abraham, 1991b) and "toning particles" [Abtönungspartikeln] (Helbig, 1994), the term which together with the German "Modalpartikel" (Thurmair, 1989) is the most commonly used. In a number of publications (Dalmas, 1990, 1992; Rudolph, 1991), however, the word particle is used without further specification. The term particle stems from a structural approach to categorising the various parts of speech into word classes based on the inflexional properties of words. In accordance with this morphological criterion, the term particle is often used to refer to "non-declinables," that is, in German, the large group of words that cannot be considered as part of the word classes noun, adjective, verb, article, or pronoun. In this sense, particles may be adverbs, conjunctions, prepositions, interjections (Helbig, 1994), sentence adverbs (Thurmair, 1989), and particles in a narrower sense: Copyright  2001, ISSN 1094-3501 130 Martina Mollering Teaching German Modal Particles... Particles as Word Class A word like aber, for example, which is a particle in the broader sense as it cannot be inflected, can be categorised as a member of the word class conjunction as well as of the class particles in a narrower sense, specifically, as modal particle (e.g., Bublitz, 1977) depending on the linguistic context in which it occurs. Thus, in a word class definition, the words considered as modal particles all have at least one homonym in another class or subclass, depending on the model of categorisation (for a critical discussion see, e.g., Helbig, 1989). In the research literature the term particle is commonly used in its narrower sense, excluding the other groups of non-declinables. The word class particle in the narrower sense is then seen to include subcategories, modal particles being one of them. The following subcategories have been described (Helbig, 1994, p.31): A plethora of publications within different theoretical frameworks have dealt with the pragmatic and discursive functions fulfilled by modal particles. These functions are described, for example, in terms of the management of interaction (Franck, 1979), as constituting consensus (Lütten, 1979), as a guidance for the hearer (Rehbein, 1979) and as playing a part in establishing text coherence (Rudolph, 1989). There is agreement, though, on the fact that the function of German modal particles is illocutionary and interpersonal rather than propositional. In very general terms, modal particles indicate the speaker's attitude towards the utterance as well as the intended perception on the part of the hearer. Modal particles may point to the interlocutors' common knowledge, to the speaker's or listener's suppositions and expectations, and they may create cohesion with previous utterances or mark the speaker's evaluation of the importance of an utterance (e.g., Abraham, 1991a, 1991b; Helbig, 1994; Thurmair, 1989). However, foreign language learners of German do not properly understand modal particles and rarely use them (Möllering & Nunan, 1995). This reflects a lack of sensitivity to an important feature of German communication, which might lead to misunderstandings and/or misinterpretations. Modal Particles in Second Language Acquisition Research findings (Husso, 1981; Rall, 1981; Steinmüller, 1981; Weydt, 1981) provide an ambiguous picture of the relationship between language acquisition in general and the acquisition of modal particles, Language Learning & Technology 131 Martina Mollering Teaching German Modal Particles... but there is agreement on a much lower frequency of use by non-native speakers. Learners who received instruction in German as a foreign language did not perceive the communicative value of particles as very high (Harden & Rösler, 1981; Möllering & Nunan, 1995). Research findings on the acquisition of modal particles in uninstructed contexts (Kutsch, 1985; Cheon-Kostrzewa & Kostrzewa, 1997a, 1997b) have shown that the acquisition process is influenced by the fact that each particle is used in a variety of functions. Particle functions are acquired in an accumulative manner over a long period of time. The distinction between modal particles and their homonyms is therefore a major teaching objective (see also Busse, 1992). Research findings on the teaching of pragmatic language features in general (see Kasper, 2000, for an overview) have provided promising results which allow for the hypothesis that explicit instruction of different particle functions could accelerate and enhance the acquisition process. The approach to teaching modal particles I would like to propose here is concerned with learners' comprehension of modal particle meanings in context. Research in interlanguage pragmatics has shown that teaching pragmatic features of language is facilitative and necessary when input is lacking or less salient and that explicit instruction is particularly effective in the area of consciousness raising (Kasper & Rose, 1999, p. 96-97). The concept of "consciousness-raising" (e.g., Rutherford, 1986) refers to the refinement of learners' metacommunicative awareness, that is, their ability to judge the relationship between a form and its meaning in context. It is this type of awareness that needs to be honed for a learner to comprehend the intricacies of particle meanings. With McCarthy and Carter (1994), I would like to argue that language awareness is not necessarily best taught by direct input language teaching: That is to say the normal presentation-practise-production cycles should not be seen as binding for all features of discourse, and in the case of [discourse] markers, these would seem to be a feature best handled by other types of activity: language-observation activities, problem-solving, perhaps cross-linguistic comparisons. (p. 68) The approach I would like to propose is based on authentic language data as collected in a number of corpora of spoken German. Rather than providing the learner with a list of grammatical particle functions supplemented by examples on the sentence level (e.g., Helbig & Helbig, 1995), an analysis of such corpora yields examples of particles in context. With the use of concordancing procedures, patterns of collocation can be established and made salient for learners of German. Non-native speakers might perceive German speech acts such as "request" or "voicing of opinion" as very direct (Rall, 1981) if they merely look at the syntactic mode of the encoding of a particular speech act without perceiving the modifications brought about by the use of modal particles (House & Kasper, 1981). The following example might illustrate this: a) Es ist nicht einfach, dieses Problem zu lösen. [It is not easy this problem to solve] This problem is not easily solved. a) Es ist ja nicht einfach, dieses Problem zu lösen. [It is (ja) not easy this problem to solve] This problem is not easily solved (as you know). a) Es ist doch nicht einfach, dieses Problem zu lösen. [It is (doch) not easy this problem to solve] (But you will agree that) this problem is not easily solved. Whereas native speakers might perceive (a) as a turn in a discussion to be quite abrupt, (b) and (c) involve the hearer's anticipated point of view. In (b), a shared opinion is assumed, while (c) expresses the wish to overcome a perceived difference of opinion. (Weydt 1983). Modal particles create "conversational cohesion" (Schiffrin, 1987), in the case of doch and ja by reference to shared knowledge. Language Learning & Technology 132 Martina Mollering Teaching German Modal Particles... One reason why the comprehension of modal particles is difficult for non-native speakers is the fact that all modal particles have at least one homonym. As many particles occur in a variety of functions, criteria such as position within the sentence play a role in determining whether a particle occurs as modal particle, as connective, adverb of time, and so forth. The following sample of natural language data, which is an excerpt from a discussion between secondary students and a well known German author, illustrates the point. It provides an example of particles in use in authentic spoken German.1 Amongst others, the particle aber occurs frequently: A: Ich nehm' Ihnen das ehrlich gesagt gar nich' ab. Ich hab' den Verdacht, ich meine, natürlich werd' ich mich wahrscheinlich sogar irren, ABER (1) daß Sie die Sache so geschrieben haben, daß Sie eben sagen "na schön," dann haben Sie sich das überlegt, und dann haben Sie die Stelle gelesen und haben sich gesagt "na Donnerwetter, das wird ABER (2) ziehen, die werden ABER (3) staunen, was ich mich so, was ich mir so alles traue..." B: ja ja, . . . (students laughing) wenn für mich als Autor der Begriff 'lieber Gott' etwas genau so Banales und Liebenswertes und Unbestimmtes ist wie der Begriff 'Mädchen' (...) dann kann ich das ohne weiteres in einer Reihe nennen. ABER (4) daß sie den lieben Gott für so leicht zu beleidigen halten, also das wundert mich. In (1) and (4) aber is used as a connective. It connects the clause it appears in to the preceding one and thus creates cohesion (Halliday & Hasan, 1976) on the textual level of the text. This function can be realised in English by using the conjunction "but". ABER (1) ABER daß Sie die Sache so geschrieben haben .... BUT [the fact] that you've written it in that particular way... ABER (4) ABER daß Sie den lieben Gott für so leicht zu beleidigen halten... BUT [the fact] that you think our Lord could be insulted as easily as that... As a connective, aber occurs mainly at the beginning of a clause. Its reference is anaphoric; it expresses contrast in its immediate context, that is, to the preceding proposition or propositions. In (2) and (3) aber appears as a modal particle. Here, it is not as easily translated into English. ABER (2) ... das wird ABER ziehen... that will [ABER] be a success ABER (3) ...die werden ABER staunen... they will [ABER] be surprised In these instances, aber expresses surprise and an approximation would be the following translations: ABER (2) ABER (3) ... das wird ABER ziehen... boy, what a success that is going to be boy, that'll / will that ever go down well ...die werden ABER staunen... they're going to be surprised, I can tell you baffled/astonished they're gonna be absolutely Language learners are regularly faced with the task of distinguishing between the different meanings of a particle like aber. It is the contention of this paper that they may be aided in this by an analysis of reallanguage data which unveils structures, patterns, and predictable features regarding a particle's different usages. The exploitation of language corpora is proposed here in order to arrive at authentic teaching Language Learning & Technology 133 Martina Mollering Teaching German Modal Particles... materials which facilitate the comprehension of German modal particles. The association patterns which were of particular interest in this investigation are linguistic features in terms of lexical and grammatical associations (Biber, Conrad, & Reppen, 1998, p. 6). Non-linguistic associations like the distribution of modal particles across registers have been dealt with to some degree through the selection of corpora for the analysis, while distribution across dialects or across time periods was not examined. Occurrences of Modal Particles in Different Text Types Following the definition that a text is "either spoken or written discourse, so that for example the words used in a conversation (or their written transcription) constitute a text" (Fairclough, 1995, p. 4), modal particles occur more frequently in spoken than in written texts. Rudolph (1991) found that in conversation, particles and conjunctions are used almost three times as frequently as in journalistic and literary texts, but she does not provide a specific analysis of words in modal particle function, as her definition of particles is a very wide one. She classified text types according to the supposed dichotomies of oral/written and fictional/non-fictional and investigated the text types everyday conversations (oral/non-fictional), newspaper articles (written/non-fictional), and (sections from) narrative texts (written/fictional) for the occurrence of particles. The assumption of a distinction between spoken and written texts as a dichotomy has been challenged. Biber (1988), for instance, proposes no such dichotomy of dimensions across texts, no clear cut distinction between spoken and written texts, but multidimensional distinctions. McCarthy (1993) uses the terminology "spoken and written medium" but also describes complexities and mixing. He proposes as a useful distinction the terminology of medium which "is concerned with how the message is transmitted to its receivers" and mode which "is concerned with how it is composed stylistically, that is, with reference to sociolinguistically grounded norms of archetypical speech and archetypical writing. These norms are norms of appropriacy, culturally conditioned on a cline of 'writtenness' and 'spokenness'." (McCarthy, 1993, p. 171) Following this distinction, the database chosen for this study consists of four corpora of spoken German in the sense of "medium: spoken." Three of the corpora are held at the German Language Institute (Institut für Deutsche Sprache, IDS), namely the "Freiburger Korpus (FKO)," "Dialogstrukturenkorpus (DSK)," and "PFEFFER-Korpus (PFE)." The fourth corpus consists of a collection of telephone conversations published by Brons-Albert (1984). Freiburger Korpus (FKO). The corpus consists of 224 texts with a total of 700,000 words. It was compiled mainly between 1966 and 1972 as part of a project at the IDS that aimed at describing "grammatical and stylistic" features of spoken German. Audiorecordings from radio and television broadcasts as well as other recordings of private and public speech events were collected. Speakers were either not aware of being recorded or recording was a natural part of the speech event (as in the radio and television broadcasts), and they did not know that their productions were to be linguistically analysed. The recordings have been transcribed and categorised into discussions, interviews, talks, reports, and narrations. Dialogstrukturenkorpus (DSK). This corpus contains 72 texts with about 200,000 words. It was compiled by a group of researchers of the German department at Freiburg University in conjunction with the IDS in the periods 1968 - 1972 and 1974 - 1977 in order to further analyse the organisation of natural conversation (see FKO). It consists mainly of interviews (radio and television broadcasts) and discussions. Pfeffer-Korpus: (PFE). Compiled by A. Pfeffer and W. Lohnes at Stanford University, California, in the early 1960s, the corpus comprises 398 texts with a total of 650,000 words. Recordings were made in 56 different areas of Germany, Austria, and Switzerland with a total of 400 different speakers. Each recording is about 12 minutes in length ( about 1500 words) on 1 of 25 topics. The subjects (with a spread Language Learning & Technology 134 Martina Mollering Teaching German Modal Particles... of age, sex, education, and profession following a statistical analysis) were interviewed on those topics in 397 of the texts; text 398 is a group discussion between four speakers. All three corpora can be accessed via a data retrieval system, COSMAS (Institut für Deutsche Sprache, 1999), developed at the IDS. It allows an analysis of the data through frequency counts and concordancing procedures which makes it possible to search all three corpora of transcribed spoken German -- with a total of about 1.5 million words -- for occurrences of particles in context. An update of the PFEFFER-Korpus (Jones, 1997) was not yet accessible (personal communication with Jones) at the time of data analysis. Telephone conversations (BRO; Brons-Albert, 1984). This collection is made up of 35 texts and includes a total of about 44,000 words. The data were arrived at by recording telephone conversations which the researcher, Brons-Albert, had on her private phone over a period of 10 months. Callers were unaware of being recorded. With permission of the individual speakers, a selection of conversations were transcribed and published. For each dialogue, information on the speakers' age, profession and/or education, dialect, and the relationship between the speakers is provided. For the purpose of the present study, the printed texts were scanned into computer-readable files to make them accessible for concordancing. QUANTITATIVE ANALYSIS The first step in the process of data analysis was to establish the frequency of particles which could potentially function as modal particles2 in the four corpora. Frequency of occurrence has been advanced as one grading criterion (Busse, 1992; Vorderwülbecke, 1981) for the teaching of modal particles. Taking into account the multifunctionality of particles and learners' difficulties with distinguishing different particle functions, the term particle frequency can be seen as ambivalent. The term frequency might, on the one hand, refer to the occurrence of a word in modal particle function, or it might refer to all occurrences of a word, of which only some might be occurrences in modal particle function. In the present study, particle frequency is addressed in two steps: first, the overall frequency of particles in the corpora of spoken German is established in order to determine how salient each particle would be for a learner of German. A subset of the occurrences of the particle eben is then analysed qualitatively. The qualitative analysis provides a distinction between frequency of occurrences in modal particle function and other functions. The three corpora held at the IDS (DSK, FKO, PFE) were searched with the help of COSMAS (Institut für deutsche Sprache, 1999); the fourth corpus (BRO) was searched using Microconcord (Scott & Johns, 1993). The total number of occurrences of each word in each of the corpora was established. As the different corpora vary considerably in size, raw counts of frequency were normalised to make counts comparable. Frequency per 1,000 words of text was chosen as a basis of comparison.3 The following table provides an overview of particle frequence in all four corpora, that is, over a total of nearly 1,600,000 words: Language Learning & Technology 135 Martina Mollering Teaching German Modal Particles... Table 1. Frequency of Word Occurrence per 1,000 Words in the Four Corpora Most striking is the frequency of ja with 19.5 occurrences per 1,000 words overall, which is more than double the frequency of the next word in line auch with 8.9 occurrences, followed by aber with 5.9 occurrences per 1,000 words. Then follow mal with 4.4 occurrences down to eben with 2.1. More than half the particles analysed occur with an average frequency of less than 2 (vielleicht 1.5, down to eh and ruhig with 0.1). The following table presents the frequency of occurrence per 1,000 words in the four different corpora: Table 2. Frequency per 1,000 Words, All Language Learning & Technology 136 Martina Mollering Teaching German Modal Particles... Both tables show clearly that ja, auch, and aber are the most salient, followed by a second group made up of mal, doch, schon, denn, nur, and eben. Again ja provides the most striking pattern with an enormous variation of frequency between the four corpora. It is most frequent in the BRO corpus with 33.7 occurrences per 1,000 words, followed by 19.8 in DSK, 13.4 in FKO, and 10.9 in PFE. The most frequently occurring words with a potential for modal particle function (ja down to denn) occur with a particularly high frequency in BRO. The existing corpora of spoken German are relatively small in comparison to the corpora available for spoken English, for example, the British National Corpus with a spoken component of about 10 million words (see, Berglund, 1999). The composition of the different corpora indicates that although they can be broadly classified as "spoken German," there are significant differences with regard to "mode" (McCarthy, 1993). German modal particles have been found to occur most frequently in texts which are informal, personal, associative, and with a high level of familiarity (Hentschel, 1986). In particular, the level of informality and familiarity of speakers with one another varies considerably between the four corpora. The predefined corpus text categories provided in the description of the corpora held at the IDS are rather broad. Although all the corpora comprise dialogues, the nature of these dialogues in FKO and DSK is rarely personal. The dialogues in PFE are determined by the method of data collection: an interviewer talking to a person s/he is not familiar with. The ensuing dialogues are in fact largely monologic as the interviewer's brief questions prompt long stretches of narrative on the part of the person interviewed. The highest level of informality and familiarity between speakers can be found in the compilation of texts by Brons-Albert (1984) which, for this study, led to the decision to concentrate on those texts in the qualitative analysis of the data (for a discussion of text categories with regard to formality, see Sigley, 1997). Qualitative Data Analysis: EBEN The second stage of the analysis investigated modal particles in context in order to establish patterns of collocation in terms of lexical co-occurrence as well as co-occurrence with certain grammatical choices (Sinclair, 1991). To this end KWIC (Key Word In Context) concordances were compiled of the BRO data using the concordancing software package MicroConcord (Scott & Johns, 1993). The concordancing software used in analysing the corpus data lists the occurrences of the word under investigation in context, but is not able to distinguish between different functions of the word in question. "Tagging," where researchers have marked words in a corpus as belonging to categories like verb, noun, subjunctor (for a more detailed discussion of tagging see Biber, Conrad, & Reppen, 1998, p. 261f) is not available for particle functions (Jones in Wichmann, Fligelstone, McEnery, & Knowles, 1997, p. 152) and a qualitative analysis was necessary to distinguish between modal particle function and others. The BRO corpus was searched for the word in question and the ensuing concordances were categorised by making use of the program's classification feature. Moving the cursor to the concordance line to be categorised and entering a number allows subsequent sorting of lines according to categories (Witton, 1994). The categorization of occurrences was in the first instance based on native speaker intuition. It had to be carried out in many instances by looking at larger stretches of the text, as the information provided in the KWIC concordance was often not sufficient to distinguish between different usages of the word under investigation. In order to distinguish use in modal particle function from other possible functions of the words in question, it was necessary to manually disambiguate each occurrence of the word to establish patterns which language learners could be made aware of to help them distinguish modal particle functions from others. As one example of the qualitative analysis, an investigation of the concordance data on eben is detailed below. Eben was chosen as it belongs to the group of more frequently occurring particles (see Quantitative Analysis) without yielding too many occurrences for the scope of this article. Language Learning & Technology 137 Martina Mollering Teaching German Modal Particles... The particle eben occurs in 20 different texts in BRO, in the functions of modal particle, answering particle and adverb of time.4 Answering Particle As an answering particle (27 occurrences), eben is easily recognised within the concordance data, as it can be found in the initial position of an utterance. In some instances, it appears as a complete utterance and its representation is capitalised. This is, of course, a channel-specific measure by which the orthographic realisation in the transcription tries to interpret the intonation patterns of the original spoken text:5 1 2 3 höne Bücher, die man lesen kann. A: kann man so nie wissen. B: (lacht) sen müssen, was wer noch kaufen. A: Eben! Eben! Eben. B: und so viele schöne Sachen, di C: Ja, aber hättsde das direkt ge B: Würd ich sagen, dann geh ich d It can function as the opening of an utterance, but separate from the following proposition: 4 5 6 7 8 9 Doppelte als die ganze Zeit, ne. B: ja nichts Schlimmes! A: Ja, ne? B: alles, wozu man jetz nich kommt! B: n die elf Kilo abgenommen hätte! D: er Frau auch Frau Doktor Sounso. B: ort "werden" nich, anscheinend . B: Eben. Eben. Eben. Eben, Eben, Eben, Is auch schön. A: Undie Arbeit ma Solang se sich dabei wohlfühlt, si Du, meine Mutter, die hatte ne gan ich denk, die is doch gar ni mehr dann bisde ooch Herr Dokta! A: Ri siehsde, un, stimmt auch wirklich, It also occurs in combination with a second answering particle "ja" (yes), "nein" (here pronounced and transcribed as "nee"; no) or "hm": 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 , ich helf ihr, soweit's geht B: ja, uns ja nun wieder auch nich B: Nee, m Telefon merkt es ja keiner. B: Ja, .. Hauptsache, es klappt! B: Och ja, ann alles, wenn ich will, ne. D: Ja, h kann Sie nur beglückwünschen. Nee, tung, das Rauchen einstellen! D: Ja, der Effekt ja oh ni mehr da! A: Ja, eder mal die Eßbremse ziehen. D: Ja, alle 8 Tage da . losjehn, ne. A: Ja, einer allein schuld is, ne. B: Ja, irekt am 31. feiern, abends. B: Ja, les, wenn ich will, ne. D: Ja, eben, jetz niet, wann se kommt. A: Jaja, esigen Vertrag beim Notar ab. B: Hm, , ne? A: Jo, is ja ejal, ne. B: Ja, uern zu sparen, zu heiraten. B: Jo, wann un wie oft er Lust hat, ne? A: eben eben eben! eben, eben, eben, eben, eben, eben, eben. eben. eben. eben. eben. eben. eben. eben. Eben, A: und son bißchen Telefongespräch ( ) Hast du mal deinen Pullover aus A: Und dann kriegt der hinterher wenn et so janz jut weiterläuft, s eben. Versteh ich. B: Naja! Zwei das war, wie C, ich war der Meinun nee / B: Das versucht manch einer genau! Vor allen Dingen, es geht j kann ich verstehn, dat kann ich ve B: Das geht nich. Könn Se noch ma A: Irgendwie en ganz kleinen Grun dass ja wirklich Klasse! A: Hm. J Versteh ich. B: Naja! Zwei Kilo n So lange dauert die Fahrt ja nich. Nee, ganz davon abgesehen, nem. I (lacht) A: (lacht) B: Bis ja noc klar, und außerdem ist das total ja. B: Paar Würste dazu oder irge In all these occurrences it serves to confirm the previous speaker's contribution. Adverb of Time In its occurrences as an adverb of time (13), eben is a short form of soeben (just, a moment ago). In this particular use it is harder to distinguish from modal particle function as its position within the clause is similar to that of modal particles. Language Learning & Technology 138 Martina Mollering 1 2 3 4 5 6 7 8 9 10 11 12 13 Teaching German Modal Particles... ja B: Telefon so anders. Em, ich hab C: C A: Guten Tag, Herr C, ich hab Ja, hörma, wat sachsde dazu, was ich n dann? . Meine Mutter hat dich zwar u, ich wollt dir nur sagen, der Z war a, ich hab der A eben gesacht, daß se . den Abend ruhig gestalten, die war onsequent wär, ich war zwar sachtich hen das über'n DeAEs, und der meinte ppelt belegt hat, ich seh nämlich da erzählt hab? B: Nee, was hasde denn a, der X hat sich gewundert, weil ihr acht? Mit der A? C: Ja, ich hab der A eben eben eben. eben eben eben eben eben eben, eben, eben, eben, eben mit der Frau X von der Verwaltung mit einem Kollegen von Ihnen gespr der A erzählt hab? B: Nee, was schon den ganzen Quark gefragt, abe hier, die Schreibmaschine is also vorbeigebracht wurde B: Ach so! C: einkaufen und mußte sich danach hi noch zu C, ich bin jetz noch stolz, ja, ich solle auf jeden Fall nen Fenelon, Lettres a l'Academie hab ich . hab das nich / C: Ja, von w als ihr ihn aus demAuto ließt, g gesacht, daß se eben vorbeigebracht A contextual clue, however, is its collocation with one of the German tenses expressing reference to the past: Simple Past, Present Perfect, Past Perfect. Investigating a larger stretch of the dialogue reveals that this is the case in nearly all occurrences: line 1: line 2: line 3: line 4: line 5: line 6: line 7: line 8: line 9: line 11: line 12: line 13: hab...gesprochen hab...gesprochen hab...erzählt hat...gefragt war wurde...vorbeigebracht war (einkaufen) sachtich meinte hasde... habt gesagt hab...gesacht (Perfekt "hab" = habe) (Perfekt) (Perfekt) (Perfekt) (Imperfekt) (Imperfekt, passive voice) (Imperfekt) (sagte ich, Imperfekt) (Imperfekt) (Perfekt: hast du ...ellipsis of past participle) (Perfekt) (habe gesagt; Perfekt) What can be established from the evidence is a strong correlation between eben in its function as soeben (just, a moment ago) and verb forms expressing the past. For a native speaker familiar with all the functions of eben this is quite obvious but for a learner of German recognizing this collocational pattern is helpful in distinguishing the different meanings of the word. A particular meaning of eben in its temporal function comes about when it collocates with ma(l) (12 occurrences): 1 2 3 4 5 6 7 8 9 10 11 12 eben / B: Ja, Augenblick, ich hör onntag oder bis Montag, Momentchen rade, das könnt nich sein, Moment r Messe! A: Ah! 69 B: Da müßtich ame) C: (Straßenname)? Da muß ich her ein Bier getrunken/ B: Moment Ich mein, wenn der schon kommen? B: Ja, kommen Se morgen llt mir grade ein, kannst du mir orz. B: Warte mal, kann ich noch Sie vielleicht freundlicherweise ng an / einschalten, daß er dann Language Learning & Technology ma eben, ma eben, ma eben! ma eben ma eben ma eben! mal eben mal eben, mal eben mal eben mal eben mal eben Frau A: Hm. B: Ja? ((Stimme im Hin ja? ((20s)) Ne, das is bis zum 9. A (lacht) Ich gebn dir ma. D: Ja, Mom nachgucken, das is entweder nur bis nachguggen, nech. A: ja. ((59s)) C: (zu ihrer Mutter) Ja, ich komm gleic . dieses Knöllchen da ausgestellt ha ja? A: Is gut. Hm, danke. B: Ne? B mit kurzen Worten sagen, wie man ein sehen? Das is Porz, ja achthundertzw so durchrufen, wann der Herr U da Fr so tickt, das hat ja nicht zu sagen, 139 Martina Mollering Teaching German Modal Particles... In these instances, eben does not refer to the past, but together with ma(l) functions to point to the short duration of an event. This is particularly apparent in lines 2, 3, and 6: 2 3 6 onntag oder bis Montag, erade, das könnt nich sein, her ein Bier getrunken/ B: Momentchen ma eben, Moment ma eben! Moment ma eben! ja? ((20s)) Ne, das is bis zum 9. A (lacht) Ich gebn dir ma. D: Ja, Mom (zu ihrer Mutter) Ja, ich komm gleic The collocation with "Moment" and especially with its diminutive form "Momentchen" (just a moment/wait a minute) stresses the temporal aspect as well as the short duration of the wait. In a number of instances there is a further aspect to the combination of mal and eben: 1 4 5 8 9 10 11 eben / B: Ja, Augenblick, ich hör r Messe! A: Ah! 69 B: Da müßtich ame) C: (Straßenname)? Da muß ich kommen? B: Ja, kommen Se morgen llt mir grade ein, kannst du mir orz. B: Warte mal, kann ich noch Sie vielleicht freundlicherweise ma eben, ma eben ma eben mal eben, mal eben mal eben mal eben Frau A: Hm. B: Ja? ((Stimme im Hin nachgucken, das is entweder nur bis nachguggen, nech. A: ja. ((59s)) C: ja? A: Is gut. Hm, danke. B: Ne? B mit kurzen Worten sagen, wie man ein sehen? Das is Porz, ja achthundertzw so durchrufen, wann der Herr U da Fr Here, the temporal aspect "it doesn't take long" also has a pragmatic function: If something does not take long to do, then it is not much of an imposition to ask for it to be done. In lines 1, 4, and 5 the speaker wants to assure his/her interlocutor that what is being done for him/her is not too much of an inconvenience: 1 4 5 ich hör ma eben, Frau [I'll quickly find out] Da müßt ich ma eben nachgucken [I would have to have a quick look] Da muß ich ma eben nachguggen [I'll have to have a quick look] In 8, 9, 10, and 11 the interlocutor is being assured that the imposition posed on him/her is minor: 8 9 10 11 kommen Se morgen mal eben, ja [why don't you quickly come by tomorrow -> why don't you drop round tomorrow] kannst du mir mal eben mit kurzen Worten sagen [could you quickly tell me in a few words] kann ich noch mal eben sehen? [could I have another quick look] könnten}Sie vielleicht freundlicherweise mal eben so durchrufen [would you be so kind to give us a quick call] Language Learning & Technology 140 Martina Mollering Teaching German Modal Particles... Modal Particle In the following 32 occurrences eben functions as a modal particle: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 agst, wie das gewesen wär, du hättest ich. Und . da muß ich jetz am Montag ne, bloß / bloß B: ah! A: muß sie und so weiter un alles dafür. Und da zwanzig, jo. A: Aber montags geht ch schicken will, dann schicke ik se dagegen Widerspruch eingelegt und muß A: Ich mein, sie hat möglichst, öh, viel von, ne, damit se jetz von der Schule bringen, weil er Verrücktheit und so. Aber as Eine is albe Stelle abtreten und inoffiziell r du tust Milch und Zucker rein, mußt ht moglich sein, ja, weil 8.000 Mark n auch zu uns kommen, A. Ihr könnt . st nie damit gerechnet, daß die Bank machen. A: Jaja, klar. B: Das is ja en, mit/mit Ananas drin, so un/ mußt Wanzen kommen? A: Ja, das weiß ich es aus B: mal alles aus, weil wer ja ißig Parteien A: hmhm B: un wenn da besichtigt, und so, ne. Bloß, es is ? A: An un für sich, ja, bloß, es is B: un Schulen, und, em, na, Bücherei viduelle . Vergütung und ä, das wird ätten wer hinterher aufgegeben, weil Ah so! B: da is eine Kiesgrube und Na, ich ja im Grunde auch. Da ham wer n, ich hatte aber . grad zu der Zeit Vielleicht wollt ich mir auch selbst das nur sagen. B: Hm C: Die wurde mir , verzichten müssen. Nein, ich hatte eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben eben, eben, eben eben, eben, eben/ eben eben eben eben damals die Bankgeschichte nich wei . meinen Widerspruch begründen. A: für die Doktorarbeit muß sie das Ga möglichst, öh, viel von, ne, damit / B: Ja, Augenblick, ich hör ma ebe nich. A: Doch, schick sie ruhig! I jetz vor's Amtsgericht. A: Ja, un bloß bessere Changsen, da weiterzu auch sagen kann, das sind nich blo zu autoritär is, ne, auf der Schule doch ne Geschichte, die e nur ne Viertel. B: nur ne Viertel! auch wieder Süßstoff nehmen und au viel Geld sind, dann, em, müBten w . einfach ma nur so vorbei / brauch so'n Mist macht oder die . entsprec dat Doofe. ((Räuspern)) A: Sicher. Süßstoff nehmen, darfs kein/ keine auch nich! B: Meistens sitzen die wissen müssen, was wer noch kaufen größere, öh, Reparaturen notwendig / du kanns schlecht en Fenster aufm . daß ich doch son bißchen / also die Ärzte befinden sich alle in d dadurch erleichtert, daß es nur hal öh, sie mit ihrem Bekannten dann. öh, A: ja B: sons nix, ne.. Un b C: Ja, ich hab da bestimmt en unh keine Zeit, öh, "lernschwache Milli bloß beweisen, ich kann alles, wen als etwas . unterkühlt vorbeigebrac gedacht, ö, daß . er . offiziell . In the vast majority of occurrences in modal particle function (26 of 32), eben collocates with a verb in the present tense, as can be seen from the concordance data provided here. The verb forms which are not included in the concordance lines shown here have been established by investigating larger stretches of the respective dialogue. In two instances, line 1 and line 5, there is a subjunctive form and only in four instances (lines 29 to 32) eben in its modal particle function collocates with a verb form indicating past. The concordance data show that eben as a modal particle occurs only in statements, there are no occurrences in interrogatives or imperatives. This is in line with its meaning. As a modal particle eben expresses "unchangeability," "unavoidability," or "irrevocable fact" as a detailed analysis of the following instances will show. The most obvious examples of stating a "given fact" are those where eben appears as part of an existential clause (e.g., Halliday, 1994), that is, where the main verb is "sein" (to be): Language Learning & Technology 141 Martina Mollering 22 23 Teaching German Modal Particles... besichtigt, und so, ne. Bloß, es is ? A: An un für sich, ja, bloß, es is eben eben, / du kanns schlecht en Fenster aufm . daß ich doch son bißchen / also Eben functions interpersonally, expressing that a fact is evident and undeniable. There are two instances of relational clauses (Halliday, 1994) with sein occurring as dependent clauses introduced by weil: 10 14 jetz von der Schule bringen, ht möglich sein, ja, weil weil er 8.000 Mark eben eben zu autoritär is, ne, auf der Schule viel Geld sind, dann, em, müBten w Here, eben works in conjunction with weil to create the impression of uttering an irrevocable fact: The relational clause is posited as a valid argument introduced by weil. The following two excerpts exemplify this in the context of larger stretches of text: Context: line 10 B: A: B: A: B: ..., un wenn se frech waren, oder irgendwie was nich richtig gemacht ham, mußten die vor die Klasse, oder aus der Klasse un in der Ecke stehn, un so, under muß so ungefähr ., öh, der hatte also sein erstes Referendarjahr , alson ganz junger noch, ne. [...and when they were cheeky or somehow did something wrong, they had to stand in front of the class, or leave the classroom and stand in a corner, and the like, and he must roughly. , er, he was doing his first year of teaching, so one of the really young ones still, you know.] Das gibt's gar nich! [You don't say!] Hm, und . da . ham sich aber die ganzen, öh, öh, Eltern wahnsinnig beschwert, un wollen den jetz von der Schule bringen, weil er eben zu autoritär is, ne, auf der Schule jedenfalls. [Hm, and . then . all the, er, er, parents complained like mad, and now they want to get him out of the school, because he is simply too authoritarian, you know, at school at least] Ah so! [I see!] Das is also völlig unnormal, daß sich da einer so benehmen würde, erzählte die Y mir/ [It's really not normal for somebody to behave like that, Y told me] By using eben, speaker B stresses the unavoidability of the parents' actions: they had to act like they did, because the teacher's behaviour lay outside of what is considered normal behaviour, an argument which is expressed again explicitly in B's next turn. Context: line 14 B: A: Ja, die Garage hat uns damals 8.000 Mark extra gekostet und sehr viel drunter wollten wer se auch nich verkaufen, ne. [Yes, the garage cost us an extra 8000 Mark then and we didn't want to sell it for much less, you know] Ja. Is ja unverschämt, was die für Einstellplätze nehmen! [Yes. It's outrageous how much they charge for car spaces] Language Learning & Technology 142 Martina Mollering B: Teaching German Modal Particles... Ja, leider. Aber wir warn damals laut Vertrag an den . Kauf der Garage gebunden und müssen auch laut Vertrag die Garage auch mit verkaufen, wenn wer die Wohnung verkaufen. Ich mein, sollte das um alles in der Welt nicht möglich sein, ja, weil 8.000 Mark eben viel Geld sind, dann, em, müBten wer uns versuchen, da ne andere Lösung einfallen zu lassen [Yes, unfortunately. But at the time we were bound by the contract to buy the garage and according to the contract we also have to sell it when we sell the apartment. I mean, if that's not at all possible, yes, because 8000 Mark simply IS a lot of money, then, er, we would have to try to find some other solution] Using eben, speaker B presents the proposition "8000 Mark simply is a lot of money" as an irrevocable fact, common knowledge that is generally agreed upon. Within the larger argument "the garage may be difficult to sell" the phrase containing eben is a supportive move, eben providing the necessary emphasis. In a fairly large proportion of occurrences, eben as a modal particle collocates with modal verbs, namely müssen (have to, lines 2, 3, 7, 13, 18); können (be able to; lines 9, 155), and wollen (want to; lines 26, 30): 2 3 7 13 18 9 15 26 30 ich. Und . da muß ich jetz am Montag ne, bloß / bloß B: ah! A: muß sie dagegen Widerspruch eingelegt und muß du tust Milch und Zucker rein, mußt en, mit/mit Ananas drin, so un/ mußt möglichst, öh, viel von, ne, damit se n auch zu uns kommen, A. Ihr könnt . ätten wer hinterher aufgegeben, weil Vielleicht wollt ich mir auch selbst eben eben eben eben eben eben eben eben, eben . meinen Widerspruch begründen. A: für die Doktorarbeit muß sie das Ga jetz vor's Amtsgericht. A: Ja, un auch wieder Süßstoff nehmen und au Süßstoff nehmen, darfs kein/ keine auch sagen kann, das sind nich blo . einfach ma nur so vorbei / brauch öh, sie mit ihrem Bekannten dann.6 bloß beweisen, ich kann alles, wen In collocation with a form of müssen, eben lends emphasis to the obligation of carrying out a particular act. In these clauses, eben serves to express the unavoidability of the obligation as the following example shows in more detail. Context: line 3 A: B: A: B: A: B: A: Gut, wenn das dann alles ma fertig ist, les ich's dir ma vor! Wie sich das anhört. Die schreibt nämlich auch Dialekt un sowas genau wortwörtlich ab, da. [Right, when it's all ready at some stage, I'll read it to you. The way it sounds. You see, she also copies out the dialect and things like that word for word.] Ja? [Does she?] Ja, in ihrer Examensarbeit hatte se sowas ähnliches gemacht, ne, bloß / bloß [Yes, in her dissertation she did something similar, you know, but] ah! muß sie eben für die Doktorarbeit muß sie das Ganze en bißchen ausweiten, noch [for her doctoral thesis she'll simply have to expand the whole thing a bit, still] Hmhm un noch / noch mehr bringen, ne. [and produce some more, you know. ....] Language Learning & Technology 143 Martina Mollering Teaching German Modal Particles... The English "simply" could here be expressed as "it's as simple as that," that is, no discussion about it is necessary. The results of the qualitative analysis carried out on eben can be summarised as follows: Table 3. Summary of Results: EBEN position in clause initial grammatical co-occurrence lexical collocation category meaning mal answering particle adverb of time adverb of time form of sein modal particle form of müssen modal particle "exactly" "just" / "a moment ago" "quickly" "simply" (irrevocable fact) "simply" (unavoidability of action) indication of past central/final central/final tendency for present tense central/final Application of the Corpus-Based Analysis to Language Teaching Over the past decade, corpus-based research has had an increasing influence on language teaching pedagogy, with regard to linguistic content as well as to teaching methodology (Kennedy, 1998). While the majority of studies reporting on corpus-based teaching approaches refer to English (e.g., Biber, Conrad, & Reppen, 1994; Conrad, 2000; Fligelstone, 1993; Wichmann et al, 1997) a number of studies have discussed German (Dodd, 1997, 2000; Jones, 1997). In general terms, Leech (1997) distinguishes between the direct use of corpora in teaching and the use of corpora indirectly applied to teaching. Teaching about corpora, teaching the exploitation of corpora and exploiting corpora to teach are said to represent a direct use of corpora, whereas reference publishing, materials development and language testing are indirect applications (Leech, 1997, p 6-7). Thus, the approach proposed here is direct in that it exploits the corpora of spoken German described above to arrive at relevant data. The approach is indirect, though, in the sense that the concordance data are not compiled by the language learners themselves but developed into work sheets that confront the learner with the task of distinguishing particle meanings in context. The adaptation of concordances for language teaching is described informatively and clearly by Tribble and Jones (1990) for English in general and by Thurstun and Candlin (1997) for academic English. The concordance-based creation of teaching materials presented here follows approaches outlined in those publications. Concordance data are used to assist learners deduce the meaning of words in context (Tribble & Jones , 1990, p. 35ff). How those teaching materials will be structured and what type of activities they will encourage will obviously depend on the learners' proficiency, learning styles, and so forth, but the sample worksheet contained in the Appendix illustrates how the topic investigated here could be approached. For less advanced learners samples of larger stretches of dialogue could be provided to aid understanding. CONCLUSION The limited ranges of speech events which learners are exposed to in classroom discourse do not provide enough input on modal particles to lead to an understanding of their meaning. An important factor in teaching modal particles is therefore the exposure of learners to particles in various contexts and the focussig of learners' attention on their meaning in those contexts. Corpus examples are extremely effective as they expose learners to the type of language they will encounter in real communicative situations (McEnery & Wilson, 1996, p. 120). Collocations, involving both grammar and lexis, have an Language Learning & Technology 144 Martina Mollering Teaching German Modal Particles... important place in language pedagogy as they can be identified empirically by the methodologies developed in corpus analysis (Kennedy, 1998, p. 289). The quantitative analysis of the German corpora described above has shown which particles occur most frequently in spoken German and are therefore most salient for a learner of German. A manual disambiguation of particle meaning was carried out on concordance data for the particle eben. Its meaning in modal particle function was differentiated from its meanings in other functions, namely as answering particle and as adverb of time. The analysis of reallanguage data unveiled structures, patterns and predictable features relating to the various usages of eben and formed the basis for a sample worksheet for learners of German. Similar worksheets aimed at intermediate to advanced learners of German will be developed for the more frequently occurring particles ja, auch, aber, mal, doch, schon, denn, and nur. It is hoped that they will provide a useful extension to the existing teaching materials on modal particles. APPENDIX SAMPLE WORK SHEET: EBEN 1. The word EBEN has different meanings which depend on the context of use. Can you find out by looking at the following groups of examples which of the translations given below best reflects the meaning of EBEN in each group? simply a moment ago/just exactly quickly group 1 _____________________ group 2 _____________________ group 3 _____________________ group 4 _____________________ GROUP 1 1 schöneBücher, die man lesen kann. A: Eben! B: und so viele schöne Sachen, di 2 doppelte als die ganze Zeit, ne. B: Eben. Is auch schön. A: Undie Arbeit ma 3 alles, wozu man jetz nich kommt! B: Eben. Du, meine Mutter, die hatte ne gan 4 n die elf Kilo abgenommen hätte! D: Eben, ich denk, die is doch gar ni mehr 5 er Frau auch Frau Doktor Sounso. B: Eben, dann bisde ooch Herr Dokta! A: Ri 6 ort "werden" nich, anscheinend . B: Eben, siehsde, un, stimmt auch wirklich, 7 der Effekt ja oh ni mehr da! A: Ja, eben, genau! Vor allen Dingen, es geht j 8 , ich helf ihr, soweit's geht B: ja, eben A: und son bißchen Telefongespräch 9 uns ja nun wieder auch nich B: Nee, eben ( ) Hast du mal deinen Pullover aus 10 h kann Sie nur beglückwünschen. Nee, eben, das war, wie C, ich war der Meinun Language Learning & Technology 145 Martina Mollering Teaching German Modal Particles... GROUP 2 1 Ja, hörma, wat sachsde dazu, was ich eben . der A erzählt hab? B: Nee, was 2 n dann? . Meine Mutter hat dich zwar eben schon den ganzen Quark gefragt, abe 3 u, ich wollt dir nur sagen, der Z war eben hier, die Schreibmaschine is also 4 a, ich hab der A eben gesacht, daß se eben vorbeigebracht wurde B: Ach so! C: 5 . den Abend ruhig gestalten, die war eben einkaufen und mußte sich danach hi 6 onsequent wär, ich war zwar sachtich eben noch zu C, ich bin jetz noch stolz, 7 hen das über'n DeAEs, und der meinte eben, ja, ich solle auf jeden Fall nen 8 a, der X hat sich gewundert, weil ihr eben, als ihr ihn aus dem Auto ließt, g 9 acht? Mit der A? C: Ja, ich hab der A eben gesacht, daß se eben vorbeigebracht GROUP 3 1 eben / B: Ja, Augenblick, ich hör ma eben, Frau A: Hm. B: Ja? ((Stimme im Hin 2 onntag oder bis Montag, Momentchen ma eben, ja? ((20s)) Ne, das is bis zum 9. A 3 rade, das könnt nich sein, Moment ma eben! (lacht) Ich gebn dir ma. D: Ja, Mom 4 r Messe! A: Ah! B: Da müßtich ma eben nachgucken, das is entweder nur bis 5 ame) C: (Straßenname)? Da muß ich ma eben nachguggen, nech. A: ja. ((59s)) C: 6 her ein Bier getrunken/ B: Moment ma eben! (zu ihrer Mutter) Ja, ich komm gleic 7 Ich mein, wenn der schon mal eben . dieses Knöllchen da ausgestellt ha 8 kommen? B: Ja, kommen Se morgen mal eben, ja? A: Is gut. Hm, danke. B: Ne? B 9 llt mir grade ein, kannst du mir mal eben mit kurzen Worten sagen, wie man ein 10 orz. B: Warte mal, kann ich noch mal eben sehen? Das is Porz, ja achthundertzw 11 Sie vielleicht freundlicherweise mal eben so durchrufen, wann der Herr U da Fr 12 ng an / einschalten, daß er dann mal eben so tickt, das hat ja nicht zu sagen, GROUP 4 1 besichtigt, und so, ne. Bloß, es is eben / du kanns schlecht en Fenster aufm 2 A: An un für sich, ja, bloß, es is eben, . daß ich doch son bißchen / also 3 jetz von der Schule bringen, weil er eben zu autoritär is, ne, auf der Schule 4 ht möglich sein, ja, weil 8.000 Mark eben viel Geld sind, dann, em, müBten w 5 ich. Und . da muß ich jetz am Montag eben . meinen Widerspruch begründen. A: 6 ne, bloß / bloß B: ah! A: muß sie eben für die Doktorarbeit muß sie das Ga 7 dagegen Widerspruch eingelegt und muß eben jetz vor's Amtsgericht. A: Ja, un 8 du tust Milch und Zucker rein, mußt eben auch wieder Süßstoff nehmen und au 9 en, mit/mit Ananas drin, so un/ mußt eben Süßstoff nehmen, darfs kein/ keine 2. Where is the position of EBEN in the clause? Please circle the correct answer. group 1 initial positon middle/end position group 2 initial positon middle/end position group 3 initial positon middle/end position group 4 initial positon middle/end position 3. Now look at group 2 again and identify the verb forms in the clauses with EBEN. Write down the verb forms and their tenses. Language Learning & Technology 146 Martina Mollering Teaching German Modal Particles... line 1______________________ line 2______________________ line 3______________________ line 4______________________ line 5______________________ line 6______________________ line 7______________________ line 8______________________ line 9______________________ 4. Which word appears in front of EBEN in group 3? ______________________ 5. Examine group 4 again. Write down the verb forms. line 1_____________________ line 2______________________ line 3_____________________ line 4______________________ line 5_____________________ line 6______________________ line 7_____________________ line 8______________________ line 9_____________________ Which two verbs do you find in these clauses? verb 1: ___________________ verb 2: ___________________ Which tense is used in these clauses? _________________________ 6. Please supply the appropriate translation for EBEN. Position in clause initial Reference to time Collocation Past central/final central/final Present central/final MAL form of "sein" form of "müssen" Type of word answering particle adverb of time adverb of time modal particle Translation modal particle NOTES 1. Freiburger Korpus, Schulklassengespräch mit Günter Grass (FKO/XAM.00000); transcription has been modified to facilitate reading comprehension. 2. The list represents the core particles considered to occur in modal particle function and is based on an evaluation of a substantial part of the literature on modal particles (Helbig, 1994; Thurmair 1989; Weydt, 1979, 1981, 1983, 1989). 3. DSK: 200,000 words; 70 texts; average length of text, 2857 words FKO: 700,000 words; 220 texts; average length of text, 3182 words PFE: 650,000 words; 386 texts; average length of text, 1684 words BRO: 44,000 words; 35 texts; average length of text, 1257words Language Learning & Technology 147 Martina Mollering Teaching German Modal Particles... 4. These categories are based on an evaluation of the literature on eben in different function categories (Hartmann, 1979; Helbig, 1994; Hentschel, 1986; Lütten, 1979; Thurmair, 1989; Trömel-Plötz 1979). 5. The analysis presented here is based on transcripts of spoken language and therefore does not refer to phonological features of the data. 6. The text continues as follows: "...schon auf die Bekanntgabe der Ergebnisse warten wollte." ACKNOWLEDGEMENTS I would like to thank Nic Witton and three anonymous reviewers for their helpful comments on a previous draft of this article. ABOUT THE AUTHOR Martina Möllering is Head of German Studies in the Department of European Languages at Macquarie University, Sydney, Australia. She is involved in language teaching and teacher training in German as a Foreign Language. Her research interests include the application of computers in language teaching, particularly the use of corpora and on-line communication facilities. E-mail: martina.mollering@mq.edu.au REFERENCES Abraham, W. (1991a). Discourse particles in German: How does their illocutionary force come about? In W. Abraham (Ed.), Discourse Particles. Amsterdam: Benjamin. Abraham, W. (1991b). Modal particle research. The state of the art. Multilingua, 10, 1-2. Aijmer, K. (1997). I think - an English modal particle. In T. Swan & O. Westvik (Eds.), Modality in Germanic languages (pp. 1-47). Berlin: de Gruyter. Berglund, Y. (1999). Exploiting a large spoken corpus: an end-user's way to the BNC. International Journal of Corpus Linguistics, 4(1), 29-52. Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press. Biber, D., Conrad, S., & Reppen, R. (1994). Corpus-based approaches to issues in applied linguistics. Applied Linguistics 15, 169-189 Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Investigating language structure and use. Cambridge, UK: Cambridge University Press Brons-Albert, R. (1984). Gesprochenes Standarddeutsch: Telefondialoge [Spoken standard German: Telephone conversations]. Tübingen, Germany: Günter Narr. Bublitz, W. (1978). Ausdrucksweisen der Sprechereinstellung [Ways of expressing speaker attitude]. Tübingen, Germany: Niemeyer. Busse, D. (1992). Partikeln im Unterricht Deutsch als Fremdsprache [Particles in teaching German as a foreign language]. Muttersprache, 102(1), 37-59. Cheon-Kostrzewa, B. J., & Kostrzewa, F. (1997a). Der Erwerb der deutschen Modalpartikeln. Ergebnisse aus einer Longitudinalstudie (I) [The acquisition of German modal particles. Results from a longitudinal study]. Deutsch als Fremdsprache, 2, 86-92. Cheon-Kostrzewa, B. J., & Kostrzewa, F. (1997b). Der Erwerb der deutschen Modalpartikeln. Ergebnisse aus einer Longitudinalstudie (II). Deutsch als Fremdsprache, 3, 150-155. Language Learning & Technology 148 Martina Mollering Teaching German Modal Particles... Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21 century? TESOL Quarterly, 3, 548-560. Dalmas, M. (1990). Partikelforschung "konkret" [Research into particles: “concrete”]. Deutsch als Fremdsprache, 27(5), 285-289. de Vriendt, S., Vandeweghe, W., & Van de Craen, P. (1991). Combinatorial aspects of modal particles in Dutch. Multilingua, 10(1/2), 43-59. Dodd, B. (Ed.). (2000). Working with German corpora. Birmingham, UK: Birmingham University Press Fairclough, N. (1995). Critical discourse analysis. New York: Longman Fligelstone, S. (1993). Some reflections on the question of teaching, from a corpus linguistics perspective. ICAME Journal, 17, 97-109. Franck, D. (1979). Abtönungspartikel und Interaktionsmanagement [Toning particles and the managment of an interaction]. In H. Weydt, (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 3-13). Berlin: de Gruyter. Halliday, M. A. K. (1994). An introduction to functional grammar (2nd ed.). New York: Arnold. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Harden, T., & Rösler, D. (1981). Partikeln und Emotionen - zwei vernachlässigte Aspekte des gesteuerten Fremdsprachenerwerbs [Particles and emotions - two areas neglected in foreign language instruction]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [The particles of the German language] (pp. 67-80). Heidelberg, Germany: Groos. Hartmann, D. (1979). Syntaktische Funktionen der Partikeln eben, eigentlich, einfach, nämlich, ruhig, vielleicht und wohl [Syntactic functions of the particles “eben,” “eigentlich,” “einfach,” “nämlich,” “ruhig,” “vielleicht” and “wohl”]. Zur Grundlegung einer diachronischen Untersuchung von Satzpartikeln im Deutschen. In H. Weydt (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 121-138). Berlin: de Gruyter. Helbig, G. (1989). Die Partikeln - keine Wortklasse, eine Wortklasse oder mehrere Wortklassen? [The particles - no word class, one word class or several word classes]. Germanistisches Jahrbuch DDR-UVR, 8, 194-209. Helbig, G. (1994). Lexikon deutscher Partikeln (2nd ed.). Berlin: Langenscheidt. Helbig, G. & Helbig, A. (1995). Deutsche Partikeln - richtig gebraucht? [German particles – used correctly?]. Berlin: Langenscheidt. Held, G. (1983). "Kommen Sie doch" oder "Venga pure." Bemerkungen zu den pragmatischen Partikeln im Deutschen und Italienischen am Beispiel auffordernder Sprechakte [“Kommen Sie doch” or “Venga pure”. Remarks on pragmatic particles in requests in German and Italian.]. In M. Dardano, W. V. Dressler, & G. Held (Eds.), Parallela (pp. 316-336). Tübingen, Germany: Narr. Hentschel, E. (1986). Funktion und Geschichte deutscher Partikeln. Ja, doch, halt und eben [The function and history of German particles.”Ja," “doch,” “halt” and ”eben”]. Tübingen, Germany: Niemeyer. House, J., & Kasper, G. (1981). Politeness markers in English and German. In: F. Coulmas (Ed.), Conversational routine: Explorations in standardized communication situations and prepatterned speech (pp. 157-186). New York: Mouton. Husso, A. (1981). Zum Gebrauch von Abtönungspartikeln bei Ausländern [ On the use of toning particles by non-native speakers]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the teaching of German] (pp. 81-99). Heidelberg, Germany: Groos. Language Learning & Technology 149 Martina Mollering Teaching German Modal Particles... Institut für deutsche Sprache. (1999). COSMAS. Jones, R. (1997). Creating and using a corpus of spoken German. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 146-156). New York: Longman. Kasper, G. (2000, March). Four perspectives on L2 pragmatic development. Revised version of a plenary given at the annual AAAL conference, Vancouver. Kasper, G., & Rose, K. (1999). Pragmatics and SLA. Annual Review of Applied Linguistics, 19, 81-104. Kennedy, G. (1998). An introduction to corpus linguistics. New York: Longman. Kriwonossow, A. (1977). Die modalen Partikeln in der deutschen Gegenwartssprache [Modal particles in contemporary German]. Göppingen, Germany: Kümmerle. Kutsch, S. (1985). Zur Entwicklung des deutschen Partikelsystems im ungesteuerten Zweitspracherwerb ausländischer Kinder [On the development of the German particle system in children’s uninstructed second language acquisition]. Deutsche Sprache, 3, 230-257. Leech, G. (1997). Teaching and language corpora - a convergence. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 1-23). New York: Longman. Lütten, J. (1979). Die Rolle der Partikeln doch, eben und ja als Konsensus-Konstitutiva in gesprochener Sprache [The role of the particles “doch,” “eben” and “ja” in creating consensus in spoken language]. In H. Weydt (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 3038). Berlin: de Gruyter. McCarthy, M. (1993). Spoken discourse markers in written text. In J. Sinclair, M. Hoey, & G. Fox (Eds.), Techniques of description. Spoken and written discourse (pp. 170-182). New York: Routledge. McCarthy, M., & Carter, R. (1994). Language as discourse. New York: Longman. McEnery, T. & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press Möllering, M., & Nunan, D. (1995). Pragmatics in interlanguage: German modal particles. Applied Language Learning, 6(1/2), 41-64. Paneth, E. (1981). Partikeln im Unterricht - Erfahrungen mit englischen Studenten [Particles in teaching experiences with English students]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the teaching of German] (pp. 101-110). Heidelberg, Germany: Groos. Rall, M. (1981). "¿Se puede ensenar la necesidad de emplear particulas intencionales?" Ein Experiment mit spanischen Studenten [Is it possible to teach the necessity of using intentional particles? An experiment with Spanish students]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the teaching of German] (pp. 123-136). Heidelberg, Germany: Groos. Rehbein, J. (1979). Sprechhandlungsaugmente. Zur Organisation der Hörersteuerung [Speech act enhancers. On the organisation of hearer guidance]. In H. Weydt (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 58-74). Berlin: de Gruyter. Rudolph, E. (1989). Partikeln in der Textorganisation [Particles in the organisation of a text]. In H. Weydt (Ed.), Sprechen mit Partikeln [Particles in talk ] (pp. 498-510). Berlin: de Gruyter. Rudolph, E. (1991). Relationships between particle occurrence and text types. Multilingua, 10(1/2), 203223. Rutherford, W. (1987). Second language grammar: Learning and teaching. New York: Longman. Schiffrin, D. (1987). Discourse markers. Cambridge, UK: Cambridge University Press. Language Learning & Technology 150 Martina Mollering Teaching German Modal Particles... Scott, M., & Johns, T. (1993). MicroConcord. Oxford, UK: Oxford University Press. Sigley, R. (1997). Text categories and where you can stick them: A crude formality index. International Journal of Corpus Linguistics, 2(2), 199-237. Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press Steinmüller, U. (1981): Akzeptabilität und Verständlichkeit - Zum Partikelgebrauch von Ausländern. In H.Weydt(ed): Partikeln und Deutschunterricht (pp.137-148). Heidelberg: Groos. Thurmair, M. (1989). Modalpartikeln und ihre Kombinationen [Modal particles and their combinations]. Tübingen, Germany: Niemeyer. Thurstun, J., & Candlin, C. N. (1997). Exploring academic English: A workbook for student essay writing. Sydney: National Centre for English Language Teaching and Research. Tribble, C., & Jones, G. (1989). Concordances in the classroom. Harlow, UK: Longman Trömel-Plötz, S. (1979). "Männer sind eben so": Eine linguistische Beschreibung von Modalpartikeln aufgezeigt an der Analyse von dt. eben und engl. just [“Männer sind eben so”: A linguistic description of modal particles featuring the analysis of German “eben” and English “just”]. In H. Weydt, (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 318-334). Berlin: de Gruyter. Vorderwülbecke, K. (1981). Progression, Semantisierung und Übungsformen der Abtönungspartikeln im Unterricht Deutsch als Fremdsprache [Progression, semanticization and exercise forms for teaching toning particles in German as a foreign language]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles in teaching German] (pp. 149-160). Heidelberg: Groos. Weydt, H. (1969). Abtönungspartikel. Die deutschen Modalwörter und ihre französischen Entsprechungn [Toning particles. The German modal words and their French equivalents]. Bad Homburg, Germany: Gehlen. Weydt, H. (Ed.). (1979). Die Partikeln der deutschen Sprache [The particles of the German language]. Berlin: de Gruyter. Weydt, H. (Ed.). (1981). Partikeln und Deutschunterricht [Particles and the teaching of German]. Heidelberg, Germany: Groos. Weydt, H. (Ed.). (1983). Partikeln und Interaktion [Particles and interaction]. Tübingen, Germany: Niemeyer. Weydt, H. (Ed.). (1989). Sprechen mit Partikeln [Particles in talk]. Berlin: de Gruyter Wichmann, A., Fligelstone, S., McEnery, T., & Knowles, G. (Eds.). (1997). Teaching and language Corpora. New York: Longman. Witton, N. (1994). Micro-Concord presented, reviewed and compared with the Mini-Concordancer. OnCall, 8(2), 33-40. Language Learning & Technology 151 Language Learning & Technology http://llt.msu.edu/vol5num3/murphy/ September 2001, Vol. 5, Num. 3 pp. 152-173 THE EMERGENCE OF TEXTURE: AN ANALYSIS OF THE FUNCTIONS OF THE NOMINAL DEMONSTRATIVES IN AN ENGLISH INTERLANGUAGE CORPUS Terry Murphy Yonsei University, Seoul ABSTRACT This study uses the concept of "emergent texture" to analyze the corpus behavior of the four nominal demonstratives -- this, that, these, and those -- in an interlanguage corpus created at Yonsei University in the Fall of 1999. "Emergent texture" refers to the manner in which interlanguage texts gradually develop their use and control of the grammatical and semantic means used to establish textual cohesion. The study investigates a corpus of 109 single paragraphs created at Yonsei University in the Fall of 1999. The concept of markedness is emphasized as a way of mediating the debate over the issue of interlanguage development, linking this to the extensive description of inter-sentential cohesive relations in Halliday and Hasan's 1976 study, Cohesion in English. The investigation proper begins with the analysis of a single sample paragraph of low-level interlanguage taken from the corpus in order to establish a frame of reference for what follows. It then examines various aspects of interlanguage cohesion within the corpus as a whole, including reiteration, synonyms and near-synonyms, the behavior of the nominal group, and cataphoric reference. The paper concludes with a discussion of future research possibilities in the area of interlanguage cohesion. THE INVESTIGATION OF SECOND LANGUAGE WRITING The investigation of the written compositions of second language learners has been a central issue for applied linguists since the mid-1960s. Although the school of contrastive rhetoric (the study of the cross-cultural aspects of second language writing) remains highly influential, there has been a recent and growing interest in using corpus analysis to understand this area of second language learning (Beaugrande, 1997; Connor, 1996; Freedman, Pringle, & Yalden, 1979; Kaplan, 1966; Kroll, 1990). One central concern for applied linguists interested in corpus analysis has been the problem of how to measure the learner's growing second language sophistication (Laufer & Nation, 1998; Shaw & Liu, 1998). A majority of the applied linguists who have investigated this issue take some definition of lexical richness to be central in any adequate account of measurement. In other words, when approaching the issue of the development of second language writing, applied linguists draw a sharp line between the categories of lexis and grammar in order to focus their attention on the development of lexis. This decision is reflected in the fact that virtually all such measures, including those used by the major available software, rely on the notion of a stable grammatical denominator in their calculations of lexical richness. This paper marks a departure in suggesting that the concept of "emergent texture," which offers itself as a measure of the development of interlanguage grammar and semantics, may prove to be useful for analyzing some central aspects of the development of second language writing. Utilizing the basic framework of the work of Halliday and Hasan on first language textual cohesion, the present study demonstrates its usefulness in a detailed analysis of an interlanguage corpus created at Yonsei University in Seoul, Korea. Copyright  2001, ISSN 1094-3501 152 Terry Murphy The Emergence of Texture: An Analysis of the Functions… AIMS OF THE STUDY This study situates itself within the emerging schools of corpus and textual linguistics. The research was carried out on an interlanguage corpus created during the Fall 1999 semester, assembled from the various genres of single paragraph compositions written by two undergraduate writing classes at Yonsei University. Utilizing the basic framework of textual cohesion outlined in Halliday and Hasan's Cohesion in English (1976), the study analyses the manner in which certain basic grammatical units, the nominal demonstratives, become progressively integrated into second language writing. An underlying assumption of the study is that the concept of markedness, associated with functional grammar and text linguistics, might be used to shed light on this process of integration (Greenberg, 1966; Halliday, 1994; Jakobson, 1957; Rutherford, 1982). The degree of interlanguage cohesion is a useful measure of the writer's ability to make significant choices among grammatical and semantic elements. The basic approach adopted here to the issue of interlanguage development is dialectical and qualitative. In the words of Lucien Goldmann (1964), the only possible starting-point for research lies in isolated abstract empirical facts; the only valid criterion for deciding on the value of a critical method lies in the possibility which each may offer of understanding these facts, of bringing out their significance and the laws governing their development. … The advance of knowledge is thus to be considered as a perpetual movement to and fro, from the whole to the parts and back to the whole again, a movement in the course of which the whole and the parts throw light upon one another. (p. 5) An initially qualitative approach is necessary to avoid the risk of a probabilistic study flattening out what is most distinctive about interlanguage: its existence as a series of snapshots, highlighting uneven patterns of textual sophistication. Second language corpus analysis involves the investigation of a whole series of texts and textual component parts at various stages of development. It is neither possible nor immediately desirable in the study of interlanguage to attempt what Halliday (1992) elsewhere rightly suggests ought to be the approach taken to the study of the first language: "grammar [has] to be studied quantitatively, in probabilistic terms" (p. 61). In spite of this large caveat, this analysis does attempt to make meaningful and potentially verifiable statements regarding interlanguage. Moreover, it accepts that the true measure of second language textual development is what is currently known about the whole of first language textual behavior, including the massive advances in the accuracy of judgments about the English language associated with the development of corpus linguistics in the 1990s. Nevertheless, while corpus linguistics has demonstrated the falseness of many previously held intuitive judgments about language, this does not mean that linguists are free to dismiss previous work merely because that work happens to predate the era of corpus analysis. In the first place, it is possible to make a strong case for Cohesion in English as a significant precursor of corpus linguistic work proper. This is because the work employs actual texts in its analysis of texture, as might be expected from Halliday's commitment to the quantitative and probabilistic study of grammar. More importantly, recent corpus analysis has served to extend the previous work of Halliday and Hasan rather than undermine it, most notably in the case of the nominal demonstratives themselves (McCarthy 1994).1 The chief merit of using the theoretical framework set out in Cohesion in English in a corpus-based analysis of second language texture, however, is the promise that this holds out for rapid progress in a new area of research. Naturally, if the empirical results obtained through a corpus analysis begin to diverge widely from the work of Halliday and Hasan, this framework will need modifying or replacing. Until such time, however, it seems safer to employ a widely known framework than to attempt to devise a new one in the course of ongoing second language research. The study of second Language Learning & Technology 153 Terry Murphy The Emergence of Texture: An Analysis of the Functions… language development presents such a variety of other complications that it seems wise to reduce the linguistic difficulties where this is possible. Finally, the use of this theoretical framework has the additional merit of encouraging contributions from other scholars, particularly those already working in the fields of functional and textual linguistics. EMERGENT TEXTURE AND MARKEDNESS The concept of "emergent texture" refers to the manner in which interlanguage texts gradually extend their use and control of the grammatical and semantic means used to establish textual cohesion. The development of interlanguage texture encompasses the broad range of textual devices for achieving cohesion, including the use of reiteration, synonyms and near-synonyms, the behavior of the nominal group, and cataphoric reference. The study attempts to account for these emergent textual patterns in terms of the concept of markedness. It argues that the concept of markedness helps to explain why the interlanguage texts examined in this study develop in the manner they do. Growing interlanguage textual sophistication is a function of the increased ability of the second language learner to experiment with the marked members of sets. In other words, the emergent texture of interlanguage texts becomes richer because of the increasing ability of the writer to make marked, as opposed to unmarked, grammatical and semantic choices. For example, low-level interlanguage texts tend to achieve nominal demonstrative cohesion almost exclusively by means of the use of the definite and indefinite articles. In contrast, more sophisticated interlanguage texts deploy a much wider range of nominal demonstrative reference. The study argues that the concept of emergent texture has potential in the analysis of the wider variety of grammatical, semantic, and lexical elements involved in the achievement of cohesion. With this in mind, the paper concludes with a discussion of some possible areas for the future investigation of emergent texture in interlanguage development. A brief discussion of the history of markedness as a linguistic concept will serve to secure its legitimacy for the analysis of corpus texts, including interlanguage ones. Markedness was first utilized by N. S. Trubetzkoy of the Prague linguistic circle in his phonological analysis of the neutralization of distinctive opposites in Grundzüge der Phonologie (1939) (Greenberg, 1966, p. 11). Phonological neutralization is the process in which distinctive phonemes in given environments lose their distinctiveness, resulting in the regular appearance of the one unmarked phoneme. Trubetzkoy was the first linguist to note that in phonemic pairs differing only in a single feature of the same category, such as voiced or unvoiced, aspirated or unaspirated phonemes, it was the unmarked phoneme that regularly appears in neutralized environments. In other words, there is a hierarchical relation between the two pairs of the opposition (Waugh, 1976, p. 89). For example, it is always the unvoiced obstruent phoneme that occurs in final word or sentence position in German. Similarly, in classical Sanskrit, when the opposition between aspirated and unaspirated stops in sentence final position is neutralized, the unaspirated phoneme appears (Greenberg, 1966, p. 13). In German, therefore, it is the unvoiced phoneme that is unmarked; in Sanskrit, it is the unvoiced and unaspirated phonemes. Generally speaking, the quality of being unmarked is associated with the absence of a given feature, while markedness is associated with the presence of that same feature. Roman Jakobson later extended the idea of markedness to the study of grammatical categories and semantics, drawing a basic distinction between phonological distinctive features and lexicogrammatical conceptual features (Waugh, 1976, pp. 89-100). In a study published in 1957, he attempted a general definition of markedness, which allowed for the incorporation of the various levels of phonology, grammar and semantics. The general meaning of a marked category states the presence of a certain property A; the general meaning of the corresponding unmarked category states nothing about Language Learning & Technology 154 Terry Murphy The Emergence of Texture: An Analysis of the Functions… the presence of A and is used chiefly but not exclusively to indicate the absence of A. (quoted in Greenberg, 1966, p. 25) Jakobson's definition succeeded in substantially widening the concept of markedness beyond the realm of phonological analysis. It also allowed for the analysis of cases where more than one type of markedness functioned simultaneously. A good example of this phenomenon is the simultaneous operation of morphological and semantic unmarkedness in the word actor. In certain environments, actor is to actress as "male thespian" is to "female thespian." However, actor is the semantically unmarked of the two terms since only actor may be predicated of both male and female thespians. Actress is neutralized by the term actor in given environments because actress can only refer to female thespians. Actress is morphologically the more complex of the two terms, requiring the addition of an extra morpheme. Actor is therefore also the unmarked morphological term. More broadly, in the terms provided by Jakobson's definition, actress indicates the presence of femaleness, while actor may be used indiscriminately in a majority of instances to refer to thespians regardless of gender (Clark & Clark, 1978, p. 231; Greenberg, 1966, pp. 26-27). In the series of scholarly conversations conducted with his wife, Krystyna Pomorska, first published in French in 1980 and later translated into English as Dialogues (1983), Jakobson returned once again to this concept of markedness, suggesting: The conception of binary opposition at any level of the linguistic system as a relation between a mark and the absence of this mark carries to its logical conclusion the idea that a hierarchical order underlies the entire linguistic system in all its ramifications. … On the phonological level, the position of the marked term in any given opposition is determined by the relation of this opposition to the other oppositions in the phonological system -- in other words, to the distinctive features that are either simultaneously or temporally contiguous. In grammatical oppositions, however, the distinction between marked and unmarked terms lies in the area of the general meaning of each of the juxtaposed forms. The general meaning of the marked term is characterized by the conveyance of more precise, specific and additional information than the unmarked term. (p. 97) The close relationship between the notion of markedness in both grammar and lexicon offers a certain degree of assurance that it is the same phenomenon under investigation in both cases. William Rutherford's 1982 essay, "Markedness in Second Language Acquisition," represented an important attempt to extend the concept of markedness to the field of second language acquisition. Although he was interested in attempting to use the concept of markedness "to elucidate essentially two separate aspects of second language acquisition: transfer … and order of acquisition" (Rutherford, 1982, p. 98), it is only the second aspect that concerns the present study. Rutherford makes the general case for the important of markedness for interlanguage development in the following way: There seems to be a lot of interlanguage data that -- whatever the original purpose of their elicitation -- reveal a tendency on the part of all learners to impose on the target language a certain structural clarity, transparency, or … explicitness. Such a tendency can be adduced by the learner's preference for coordination over subordination, by the retention of pronominal reflexes in relative clauses, and by the apparent preference (at least in English) for constructions in which raising has not taken place over those "equivalent" expressions in which it has. (pp. 98-99) In his essay, Rutherford (1982) went on to suggest the importance of considering "the discourse function of syntactic constructions" in any use of markedness in studies of interlanguage development (p. 101). This suggestion is important because it is necessary to distinguish among choices that are motivated by the constraints of text or discourse development and those that are Language Learning & Technology 155 Terry Murphy The Emergence of Texture: An Analysis of the Functions… genuinely instances of interlanguage limitation. Rutherford's essay suggested in conclusion that there was a need to use markedness theory to move beyond "the distributional characteristics of the exponents of formal syntax [to achieve] a greater understanding of more complex language" (p. 103). The central problem with Rutherford's subsequent study, Second Language Grammar: Learning and Teaching (1987) is that it equivocates on the use of natural format data in order to achieve this greater understanding. According to Rutherford, consciousness-raising, which is one of the main themes of the book, takes place at a point between two extremes. These two "extremes" are "the natural appearance of a grammatical phenomenon in 'authentic' text on the one hand and its contextless explicit formulation on the other" (p. 153). In other words, Rutherford's earlier insistence on the use of markedness theory has been compromised (p. 103). The central concern with consciousness-raising, which was absent from the 1982 essay, implies a renewed commitment to what Robert de Beaugrande calls "the rewriting of natural language as formal notation" (Beaugrande, 1997, p. 41). Shorn of any theory of language in which to embed markedness theory, Rutherford abandoned the attempt to use the concept as a means to analyze interlanguage text and discourse (personal communication, March 2, 2000). This paper then is an attempt to complete the unfinished work of Rutherford's 1982 essay. It attempts to do this by embracing the functional linguistic concept of markedness of Halliday and his associates within a project committed to the investigation of actual interlanguage corpora. In this way, it may be possible to achieve that "greater understanding of more complex language" promised in Rutherford's essay, by means of an analysis of the function of the nominal demonstratives in the emergence of texture. THE CONCEPT OF TEXTURE Interlanguage texts exhibit only an elementary or emergent texture because of the underdevelopment of the system of directives for creating textual cohesion. Emergent texture is also therefore a measure of the capacity of a given interlanguage text to function as a textual unity. According to Halliday and Hasan, "A text has texture, and this is what distinguishes it from something that is not a text. It derives this texture from the fact that it functions as a unity with respect to its environment" (1976, p. 2). In the sense of the term put forward by Halliday and Hasan, the texts of second language learners offer varying degrees of texture, ranging from those produced with virtually no consideration given to the relationship among sentences or particular stretches of text to those which are barely distinguishable from texts produced by native writers. Another way of putting this is to say that low-level interlanguage texts are distinguished by their relative lack of cohesion; low-level interlanguage texts demonstrate a limited range of facility and concern with the significant relations among cohesive ties within the text. As Halliday and Hasan note, "Cohesion" is defined as the set of possibilities that exist in the language for making text hang together: the potential that the speaker or writer has at his disposal. … Thus, cohesion as a process always involves one item pointing to another; whereas the significant property of the cohesive relation … is the fact that one item provides the source for the interpretation of another. (p. 19) Cohesion within a text is established by means of the presence of the five major categories of cohesive ties: ties of reference, substitution, ellipsis, conjunction, and lexis (Halliday & Hasan, 1976, p. 4). The class of reference ties function as directives indicating that information is to be retrieved form elsewhere. Language Learning & Technology 156 Terry Murphy The Emergence of Texture: An Analysis of the Functions… Demonstrative reference is reference by means of location. The writer locates this type of reference along a scale of proximity. This scale is defined in terms of the selective participation and circumstances that define the textual occasion (Halliday & Hasan, 1976, p. 37). Demonstrative reference is therefore distinguished from both personal reference and comparative reference. Personal reference is defined by its function in the speech situation; comparative reference is a form of indirect reference that is established by means of identity (p. 31). The eight demonstratives that together constitute the grammatical means for establishing demonstrative reference may be divided into two basic sets. The more important of the two sets is the one that selectively locates the text with respect to participant and number: this, that, these, those. The other set, which locates the text with respect to time and place, is less significant: here, there, now, then. The major grammatical unit for analysis for the investigation of this first set of demonstratives is the nominal group. As Halliday and Hasan point out, What distinguishes reference from other types of cohesion…is that [it] is overwhelmingly nominal in character. With the exception of the demonstratives, here, there, now, and then, and some comparative adverbs, all reference items are found within the nominal group. (p. 43) It may well be the case that the second set of demonstratives plays the greater, or at least a significantly more prominent, role in the formation of cohesion in spoken and extemporaneous texts. However, interlanguage composition at the university level approximates the stereotypical model of writing outlined by Douglas Biber: students aim to create texts that are structurally complex, unified, abstract, and free from most forms of situation-dependent reference (Biber, 1988, p. 37). The nominal demonstratives alone will be the focus of this corpus investigation of the emergence of cohesion and texture. THE NOMINAL DEMONSTRATIVES AND EMERGENT TEXTURE The basic hypothesis of this study is that interlanguage textual development is revealed in an increasingly sophisticated deployment of the nominal demonstratives. Briefly put, the absence or presence of the four nominal demonstratives in a given interlanguage text is a central indicator of its emergent texture. Patterns of interlanguage cohesive development ought to be consistent with what is known about the complexities involved in the formation of texture. The division of labor among the nominal demonstratives in Standard English is somewhat unusual. As Halliday elaborates in the second edition of An Introduction to Functional Grammar (1994): Given just two demonstratives, this and that, it is usual for that to be more inclusive; it tends to become the unmarked member of the pair. This happened in English; and in the process a new demonstrative evolved which took over and extended the 'unmarked' feature of that – leaving this and that once more fairly evenly matched. This is the so-called 'definite article' the. (p. 314) In other words, the relations among the four nominal demonstratives are made somewhat complex in Standard English by the evolution of the lexical item, the. In addition, there is a distinction to make between the unmarked demonstrative when functioning as a Head and when functioning as a Deictic. Historically, in fact, both it and the are reduced forms of that; and, although it now operates in the system of personals, both can be explained as being the 'neutral' or non-selective type of the nominal demonstratives – as essentially one and the same element, which takes the form it when functioning as Head and the when functioning as Deictic. (Halliday and Hasan, p. 58) Language Learning & Technology 157 Terry Murphy The Emergence of Texture: An Analysis of the Functions… What this implies is that low-level interlanguage texts will rely heavily on the use of the definite article to establish cohesion. The cohesion of low-level interlanguage texts will mostly takes the form of strings of anaphorically referenced lexical items introduced by the Deictic the, with further cohesion provided by the use of it as a Head. The four marked nominal demonstratives therefore will be conspicuous mostly by their absence. In an article building on the work of Halliday and Hasan and other linguists who have examined the functioning of it, this and that in Standard English, Michael McCarthy has suggested a slight refinement of this basic scheme. McCarthy's work is particularly useful since it bases its conclusions on the analysis of a large sample of genuine texts. According to McCarthy: 1. It is used for unmarked reference within a current entity or focus of attention. 2. This signals a shift of entity or focus of attention to a new focus. 3. That refers across from the current focus to entities or foci that are non-current, non-central, marginalizable or other-attributed. (McCarthy, 1994, p. 275) The table reproduced below helps explicate the distinction between unmarked, or non-selective, and marked reference among the nominal demonstratives. In its stark division between the choice between non-selective and selective, the table, which has been modified slightly from that presented in Halliday and Hasan's book to stress the primacy of non-selection, highlights why the definite article tends to predominate in low-level interlanguage texts: Table 1. Demonstrative Reference (modified presentation from Halliday & Hasan, p. 38) Semantic category Non-selective Selective Grammatical function Modifier Modifier / Head Adjunct Class Determiner Determiner Adverb this these that those here now there then Proximity near far neutral the Low-level interlanguage texts possess only the most rudimentary system for specifying and identifying chains of lexical items in the text, nothing more. In comparison with these two uses of the unmarked demonstratives, each of the four forms this, that, these, and those are marked. In other words, the theory of markedness can furnish an explanation for why low-level interlanguage texts tend to eschew the use of the nominal demonstratives. In turn, this helps to explain the fact that low-level interlanguage texts possess only emergent texture, the upshot of their unsophisticated deployment of the devices for achieving suitable levels of cohesion. The basic distinction in the deployment of the marked demonstratives is in relation to the point of view of the writer of the text. Within the text, this is used to make anaphoric reference to something that has just been mentioned by the writer or that is in some other way being taken as "near." The singular demonstrative that is used anaphorically to indicate something that is being taken as "far" from the writer's point of view (Halliday, 1994, pp. 314-315). Similarly, the nominal demonstratives, these and those, differentiate between proximate and remote plural reference from the point of view of the writer. Since "pro-forms save processing time by being shorter than the expressions they replace," the greater frequency of the four demonstratives in a given interlanguage text is usually associated with the writer's ability to create efficient texts (Beaugrande & Dressler, 1981, p. 64). The marked nominal demonstratives are thus important in establishing the coherence or structure of a mature interlanguage text. Language Learning & Technology 158 Terry Murphy The Emergence of Texture: An Analysis of the Functions… The distinction that Halliday and Hasan (1976) make in relation to the unmarked nominal demonstratives the and it also applies to the marked nominal demonstratives this, that, these, and those. In general, the demonstrative this will occur as a Modifier in sentences such as this tree is an oak or as Head in sentences such as this is an oak. In low-level interlanguage texts, the presence of this as a Modifier ought to set definite restrictions on the lexical sophistication of the nominal group to which it belongs. One mark of interlanguage textual development is observed in the gradual elaboration of the linguistic environment in which this is discovered functioning as a Modifier. Nevertheless, the principal function of this in what appears to be a majority of extended English language texts is as an indicator of extended reference (Halliday & Hasan, 1976, p. 66). If Halliday and Hasan are right, growing interlanguage sophistication will be revealed in the gradual reorientation of the demonstrative adverb this away from its function as simple Modifier or more elaborate Head toward its use as an indicator of extended reference within the text. In other words, this will occur more frequently and in a wider variety of contexts in more sophisticated interlanguage texts. Its use will gradually extend to the introduction of nominal groups used to refer to segments of texts as linguistic acts in their own right. Sophisticated interlanguage texts will include nominal groups with the marked demonstratives, the textual function of which are "labels for stages of an argument, developed in and through the discourse itself as the writer presents and assesses his/her own propositions and those of other sources" (Francis, 1994, p. 83). Anaphoric reference tends to predominate in the interlanguage texts exhibiting the least cohesion. One upshot of this is that sophisticated interlanguage texts will exhibit less imbalance in their ratios of anaphoric and cataphoric reference. In other words, examples of cataphoric cohesion, which may involve the use of either this or here, will begin to emerge at higher levels of interlanguage development. In all likelihood, however, the emergence of cataphoric reference will consist largely in instances of what Halliday and Hasan refer to as grammatical cataphoric reference. In other words, the majority of instances of structural cataphora -- "the simple realization of a grammatical relationship within the nominal group" -- will be non-cohesive, even in high level interlanguage texts (Halliday & Hasan, 1976, p. 68). Though highly revealing as examples of collocational fluency, structural cataphora is not an example of a cohesive tie and does not enter into the formation of texture. In contrast, examples of genuine cataphoric reference, though occurring with relative infrequency, may be evidence for the relative sophistication of a given sample of interlanguage. There is a necessary caveat, however. Particular genres appear to offer different possibilities for actualizing lexical and grammatical arrangements. Process paragraphs, for example, are an obvious example of a paragraph genre that allows for the actualization of genuine cataphoric reference. In this sense, it may prove more useful to analyze sub-corpora of particular genres in an effort to isolate more quickly the difference between texts with developed texture and those that employ compensatory strategies for achieving more limited forms of cohesion. METHODS AND MATERIALS The interlanguage corpus for this research project was created over the course of the Fall 1999 semester by the students enrolled in my Writing and Beginning composition classes at Yonsei University. During the first 2 weeks of the new term, diskettes were distributed to all the students who had enrolled. The students were advised that the work that they would submit during the course of the semester would subsequently form part of an interlanguage corpus. They were told to submit all their work on the diskette, together with paper copies of the initial drafts of each assignment, for collection on scheduled dates throughout the semester. By the end of the semester, 109 single paragraphs had been collected from the students. In terms of genre representation, the corpus consists of 38 samples of illustration, 27 samples of description, 18 samples of comparison/contrast, Language Learning & Technology 159 Terry Murphy The Emergence of Texture: An Analysis of the Functions… 11 samples of process, and 11 samples of persuasion. Sample titles from the illustration genre, together with the word count, are as follows: "My Family's Three Values" (253 words), "Painful Experience Often Teaches Valuable Lessons" (301), "Personality Through Clothes" (136), "About My Mother I Most Admire and Love" (383), "Buddha as a Real Egalitarian" (336), and "Kim Ku, The Only Politician Whom I Admire (296). The description paragraphs include "A Blue Man on the Rainy Day" (280 words), "A Possession I Value (272), "My Crowded but Comfortable Room" (318), "An interesting person" (245), and "My Favorite Bar or Restaurant" (257). The comparison and contrast paragraphs include "My Best Friends Eun Lang and Hae Won: the N and S Poles of a Magnet" (307), "My Personality: in Childhood and as a University Student" (285), "The Real Face of University Life" (226), "My Two Completely Opposite Friends" (295), and "The Movies; The Christmas in August and A Letter" (233). The process paragraphs include the titles "How To Appear More Intelligent Than You Are" (235), "How to Break up with Your Boy Friend" (434), "How to Break Up With Your Girl Friend" (349), and "How to Care a Hangover" (332). Finally, the persuasion paragraphs include "Suh Kap-sook, a Case for Censorship?" (231), "Globalization: Ideology or Reality" (447), "The Brain Korea 21" (410), and "Views on the Millennium -- Korean Economy" (378). The total running length of the corpus is 31,641 words. The paragraphs vary in length from a low of 123 words in the case of "Pablo Picasso" to a high of 603 words for "My Favorite Coffee Shops I Highly Recommend." The paragraphs in the corpus written by the Writing class students were all completed by the time of the mid-term examinations. These sets of three paragraphs cover a range of basic paragraph genres including description, illustration, and comparison/contrast. The paragraphs in the corpus written by the Beginning Composition students include all five written assignments required for the course of the semester. The paragraphs include the genres of description, illustration, process, and persuasion. A SAMPLE OF LOW-LEVEL INTERLANGUAGE WRITING It is useful to prelude an extensive analysis of the corpus with an examination of a representative corpus sample of low-level interlanguage composition. By examining the elementary cohesive ties within this type of composition, it will become clearer what aspects of the English cohesive system are subject to development. The following paragraph, entitled "My Favorite Bar or Restaurant," was Language Learning & Technology 160 Terry Murphy The Emergence of Texture: An Analysis of the Functions… written by a first-year male student in the Writing class as fulfillment of the requirement for a descriptive paragraph. Shinchon is the area of bars and restaurants frequented by students immediately around the front gate of Yonsei University: 1. There are a lot of places in Shinchon I often go. "Backstage" is my favorite bar where we can enjoy music videos on screen with kinds of drink. I will introduce here to you. Descending steep stairs, you can see a filthy door to which diverse ad-posters attach. After opening the door, the air in the inner part is thick with tobacco smoke and it is too dark for here to see the front for a while. To the left of the door, a welllighted counter is opposite to two large pillar stuck to posters of famous rock bands. The counter is filled with many video tapes and various beverages. In front of the door or the counter, sofas are put from left to right facing a large screen which displays all sorts of rock music clips. On the screen you can see several genre clips that are from USA, Japan, Europe and even the Third World. Near the screen stand four huge speakers somewhat-broken by careless persons. In several places, there are some TV that is for those who are far from main screen or want to appreciate video clips in detail. Around the wall adhere some pictures, posters and scribbles on the base of grotesque wallpapers. You can feel this place so strange if you are not accustomed to dark atmosphere or rock music. However, Backstage will be your best friend if you pay attention to rock or be your eccentric fellow if you have an eye for the unknown world. There are a number of points that can be made about the texture of this particular paragraph. The most basic point is this: at levels of development represented by texts like this, interlanguage texts rely almost exclusively on the neutral non-selective the to establish textual cohesion. The second point is that interlanguage texts at this level of competence reveal a definitely limited capacity for lexical reiteration. According to Halliday and Hasan (1976), reiteration is a form of lexical cohesion which involves the repetition of a lexical item, at one end of the scale; the use of a general word to refer back to a lexical item, at the other end of the scale; and a number of things in between -- the use of a synonym, near-synonym, or superordinate. (p. 278) Reiteration in texts such as "My Favorite Bar or Restaurant" takes place almost exclusively at the end of the scale marked out by repetition. In other words, reiteration as a form of lexical cohesion in interlanguage texts like this involves simple lexical repetition and the neutral non-selective use of the definite article as an anaphoric device. In addition, there is only one use of the word it as a Head: "After opening the door, the air in the inner part is thick with tobacco smoke and it is too dark for here to see the front for a while." This sentence is an example of it as a relational attributive Head, a form of non-cohesive grammatical cataphora (Halliday, 1994, p. 143). In this clause, it could be replaced as Subject by the circumstantial demonstrative, here. There is one other use of the circumstantial demonstrative in the composition: I will introduce here to you. Both of these citations precede the subsequent use of the marked nominal demonstrative, this place. There is thus only a single citation for this. What is more, this citation is in relation to the subject of the description itself and does not occur until the penultimate sentence: "You can feel this place so strange if you are not accustomed to dark atmosphere or rock music." The nominal group this place is an example of what Francis has called a "retrospective label," one that "serves to encapsulate or package a stretch of discourse" (Francis, 1994, p. 85). As Francis suggests, the central defining quality of a retrospective label is that "there is no single nominal group to which it refers: It is not a repetition or a "synonym" of any preceding element. Instead, it is presented as equivalent to the clause or clauses it replaces, while naming them for the first time" (Francis, p. 85). It is a working hypothesis that the first labels to emerge in low-level interlanguage texts are retrospective labels that encapsulate the Language Learning & Technology 161 Terry Murphy The Emergence of Texture: An Analysis of the Functions… meaning of the entire text itself, echoing, if they echo anything at all, the title of the composition. At a higher level of interlanguage, advance labels, in which "the label precedes its lexicalization" (Francis, p. 83), will start to emerge. Once again, it might be expected that advance labels would be used in the first place to indicate the purpose of the entire text. However, the distinction in single paragraphs between advance and retrospective labels that encapsulate the meaning of entire texts and those that encapsulate only a portion of them is fuzzy. In order to demonstrate the correctness or otherwise of these more or less intuitive judgments, a contrastive analysis of the extent of cohesion and labeling in a corpus of five paragraph essays will be necessary. "My Favorite Bar or Restaurant" contains 15 instances of the use of the non-selective definite article the. Among this total there are 7 instances of specific anaphoric reference back to a previously introduced noun: a well-lit counter, a large screen, and a filthy door. Moreover, in each one of these 7 instances, the repetition takes the simplest form of unadorned Modifier and Head. In other words, no premodifying elements are realized; and the Head is a form of cohesion achieved though repetition rather than lexical modification. In addition, a large number of references are explained by the content-free status of the definite article. As Halliday and Hasan write, "the definite article … merely indicates that the item in question IS specific and identifiable; that somewhere the information necessary for identifying it is recoverable" (1976, p. 71). The fact that this text offers a description of the interior of a favorite bar in Shinchon serves to explain the references to the air in the inner part, the left, the front, Around the wall, and the base of grotesque wallpapers. Nevertheless, even this sample of interlanguage bears out Halliday and Hasan's contention that "purely anaphoric reference never accounts for a majority of instances [of cohesive textual reference in any textual sample, written or spoken]" (1976, p. 73). Of the 15 examples of reference involving the definite article, seven of them -- or approximately one half -- are anaphoric. One possible implication of Halliday and Hasan's work is that samples of low-level interlanguage are characterized by the relative absence of cataphoric reference. Further corpus analysis will reveal whether the relative sophistication of an interlanguage text is measurable in terms of the ratio of anaphoric to non-anaphoric reference. In this particular example of low-level interlanguage, the ratio is approximately 50:50, which seems unusually tilted toward anaphoric reference. As has already been suggested, this interlanguage text makes only limited use of the demonstratives themselves. There are, for example, no textual citations for that or these. There is one example of a structurally cataphoric (but therefore non-cohesive) instance of those in the nominal phrase: for those who are far from main screen or want to appreciate video clips in detail. The dominance of the definite article is a central indication of the linguistic absence of a developed capacity for the type of nominal demonstrative textual pointing. Moreover, where there are nominal demonstratives present, the text has previously established a "right to point" in the form of a series of collocationally significant semantic references (cf. Halliday & Hasan, 1976, pp. 284-288). In the case of the text under consideration, this series of references includes the steep stairs, filthy door, and the air in the inner part of this place. In this way, the interlanguage text indicates that the information necessary for the identification of this place is textually available. From this brief analysis, it is possible to draw out a number of working hypotheses about the emergent texture of low-level interlanguage. First, there is the tendency to rely almost exclusively on the definite article to carry the burden of cohesion, even if this means confining properly cohesive relations to anaphoric reference alone. In other words, low-level interlanguage is characterized by an absence of meaningful cataphoric reference. Furthermore, it seems that whatever exophoric reference in low-level interlanguage there is takes the form of well-established collocational items such as the Third World and an eye for the unknown. Thirdly, low-level interlanguage texts such as this one appear to be characterized by the absence of even the simple kind of forward reference in which the definite article refers to a modifying element within the same nominal group (Halliday & Language Learning & Technology 162 Terry Murphy The Emergence of Texture: An Analysis of the Functions… Hasan, 1976, p. 72). This is in contrast to "most other varieties of spoken and written English [where the] predominant function [of the] is cataphoric" (Halliday & Hasan, p. 73). Finally, lowlevel interlanguage texts like "My Favorite Bar or Restaurant" lack a developed capacity for signaling what Michael McCarthy has termed the "topical entity in current focus." The exclusive use of it as the unmarked demonstrative demonstrates an inability to highlight its noun phrase antecedent for signaling shifts in textual content (McCarthy, 1994, p. 273). This is because the use of it simply allows for the continuation of what the text is focusing on; "it does not itself perform the act of focusing" (McCarthy, p. 271). In contrast, the function of this and that is to "operate to signal that focus is either shifting or has shifted" (McCarthy, p. 272). The relative absence of the marked demonstratives therefore indicates an underdeveloped capacity for switching the focus of attention for purposes of textual interest and complexity. Nevertheless, in spite of these obvious limitations, "My Favorite Bar or Restaurant" does demonstrate the meaningfulness and usefulness of the concept of emergent texture. It is from such elementary beginnings that the capacity for establishing extensive cohesive relations will develop. THE FUNCTION OF THE NOMINAL GROUP Relations within the nominal group play a central role in the emergence of texture. An analysis of the nominal demonstratives consists therefore in a detailed study of the emergence of complex relations among elements of the nominal group. The relations among these functions in the nominal group are outlined in the table below: Table 2. The Structural Analysis of a Nominal Group (from Halliday & Hasan, 1976, p. 40) The two high stone walls along the roadside Head Postmodifier Structures: logical Premodifier experiential Deictic Numerative Epithet Classifier Thing Qualifier Classes Determiner Numeral Adjective Noun Noun [Prepositional Group] The nominal demonstratives form part of the function of Deictics, which are used for specifying by identity, both non-specific and specific, including forms of identity based on reference (Halliday & Hasan). In turn, the Numerative specifies by quantity or ordination (two trains, next train); the Epithet by reference to a property (long trains); the Classifier by reference to a subclass (express trains, passenger trains); and the Qualifier by reference to some characterizing relation or process (trains for London, train I'm on). (p. 42) The general limits on interlanguage flexibility evident in the use of lexical reiteration were evident in the behavior of the nominal group. The strongest evidence for the lack of flexibility was the absolute predominance of the four demonstratives with an accompanying unadorned Head in the corpus. This seems a reasonable and even predictable finding. There was a similar tendency for the definite article to appear in the company of an unadorned Head in the paragraph chosen to illustrate the more Language Learning & Technology 163 Terry Murphy The Emergence of Texture: An Analysis of the Functions… general limits of interlanguage lexis and cohesive relations, "My Favorite Bar or Restaurant." The full list of examples is as follows: This + Unadorned Head committee, book, idea, interest, year, allegation, article, bar (3 times), blaze, book (2 times), case, century, era, expectation (2 times), field, house (2 times), information, instance, investigation, job, kitchen, model, ordeal, person, photograph (2 times), place (7 times), plan (2 times), pressure, project (2 times), question, reason (4 times), report, restaurant (2 times), room, rule, self-development, semester, sense, shop, situation, society, stage, summer, truth, university, way (4 times), year (2 times) That + Unadorned Head case, man, method, money, point, policy, position, problem, question, reason, slum, time, way A comparison of the use of both singular demonstratives with an accompanying unadorned Head reveals three lexical items in common. Both demonstratives occur with problem, question, and way. The marked singular demonstrative that also occurs with a number of other abstract nouns: method, point, policy, position, reason, and time. There are three examples where that occurs with a concrete noun: money, slum, and man. Is there a tendency for interlanguage texts to allow this to carry the burden of lexical reiteration and that to carry the burden of the abstract construction of the unfolding argument? The corpus does tend to show a pattern of use for that in relation to the past, from texts of fairly limited sophistication to more obviously complex ones: 2. Pablo Picasso is one of the most creative artists in twentieth century. He was born in 1881, and died in 1973. Though he was Spanish, he played an active part in France. At first he studied art in Barcelona, and fixed in Paris since 1904. At that time, he showed his great interest in mouldering as well as painting. 3. Last summer, on a heavy rainy day, I was sitting on the bench in front of Lotte Department store, waiting for my friend. At that time, I could see faintly someone coming towards me from a distance. The obvious question then is under what circumstances that tends to get used with concrete nouns. In each of the three citations in the corpus, the demonstrative that is used in establishing a reference within the past: 4. The Reverend Choi is a leader of Dail Community whom I admire because of his power of love toward neighbor. He was born in Seoul, 1957 and grew up as a Christian. One day in 1988, He met a helpless and sick old man in front of railway station and couldn't pass by him, so he served that man a meal. 5. She was born in the province of Scopeye in Yugoslavia in 1910. After she was called as a nun she was dispatched to Calcutta in India, where she answered God's voice to help the poorest among the poor by establishing "Missionaries of Charity". She had served as she did God in that slum the hurt, poor and sick with whom nobody wanted to contact until the death of heart disease in 1997. 6. In 1975, it was a time that everything looked safe and stable. The restaurant was always crowded and she was six months pregnant her fifth baby. However, grandfather was defrauded of his house and every lands. His friend allured him to invest his money in a new business, but he ran away with that money. Language Learning & Technology 164 Terry Murphy The Emergence of Texture: An Analysis of the Functions… The use of the demonstrative that with more complex nominal groups provides a small amount of evidence for the idea that its major function is to establish past references. The following paragraph, which is an extended comparison between the 1970s movie Jaws and the 1990s movie Deep Blue Sea contains three examples of that used in this way: 7. Jaws and Deep Blue Sea have many similarities despite of a long time gap. They have screaming girls in bikinis, floods of bloody water and that ominous gliding fin. In fact, the opening sequence of Deep Blue Sea is almost the same with that of Jaws; It starts with a few young people attacked by an unseen object under water. Even the posters, in which a woman is swimming in the sea and a shark is just behind her with a wide open mouth, are similar. But, these two movies are very different in the way they scare people. In Jaws we had just one, dumb shark with a heavy mechanical equipments inside, but none of the scenes of the existing horror movies have ever been as scary as that moment in Jaws when the shark first lifted its nose out of the water. However, there are problems with this argument. If there is a tendency to use the singular demonstratives in this manner, with this used to establish concrete and present references and that used to establish abstract and past references, it might be expected that the same tendency would be discovered with the plural demonstratives. The corpus citations for these and those, however, are inconclusive, even if the small number of citations for the use of the latter is taken into account. These + Unadorned Head accidents, cooks, days (2 times), dishes, examples, expectations (2 times), facts, governments, instruments, international financial organizations, measures, methods (2 times), pictures, places, products, reasons (3 times), sections, shops, statistics, steps (2 times), stories, things (4 times), values, ways Those + Unadorned Head reasons, things At first glance, there does appear to be the same tendency to use the plural demonstrative those to establish abstract reference. However, the two nouns discovered in the environment of those are also found more frequently with these: reasons and things. The appearance of things is particularly important since this word in both its singular and plural form is used as an anaphoric reference in the forming of texture. If the pronoun it is temporarily excluded, it is the word thing that is the lexical item used in the establishment of anaphoric reference at the most general of textual levels. The word thing "usually excludes people and animals, as well as qualities, states and relations, and … always excludes facts and reports" (Halliday & Hasan, 1976, p. 279). The evidence for a consistent pattern involving the marked demonstratives is therefore inconclusive. The possibility that interlanguage texts use this to establish concrete and present references and that to establish abstract and past references will require the analysis of a larger corpus of interlanguage texts. This analysis would be useful in understanding whether these uses of this and that relate to interlanguage development. Specifically, it is of relevance to the issue of the writer's growing ability to distinguish textually a shift of attention to a new focus and one that "refer[s] across from the current focus to entities that are non-current," the latter being the first of McCarthy's criteria for distinguishing between the uses of this and that. Language Learning & Technology 165 Terry Murphy The Emergence of Texture: An Analysis of the Functions… THE DEMONSTRATIVES AND COMPLEX NOMINAL GROUPS The general restrictions that apply to the use of the nominal group with demonstrative reference can be seen more clearly in the few samples of greater nominal group complexity that occur in the corpus. It is particularly revealing to examine the manner in which these more complex nominal groups emerge during the course of textual development. Normally, the text prepares for the entrance of these nominal groups in significant ways. The general tendency seems to be that complex nominal groups are "grown" from previously existing textual possibilities (cf. Halliday, 1992, p. 70). Such an explanation accounts for a passage such as this one: 8. In the middle of the coffee shop, you will see an interesting plastic art made of glass. This seems like the glass boxes piled up from the floor to the ceiling. There are seven sections in this glass pillar. These sections are empty excluding second and fifth, which are filled with empty plastic bottles. This special plastic art is associated with simple and modern mood of here. The following example of a place description is unusual in its use of more complex nominal groups to achieve textual reiteration. The description varies its use of accompanying nominal group epithets, moving from the street to Insadong Street to this small street and ending up with this famous street: 9. Insadong is my favorite place in Seoul, located between the Korea Times Building and Pagoda Park. The main part of Insadong runs along the street with the same name. It is located between two east-west running avenues in the downtown area and either avenue can be considered an entrance. The south end of Insadong starts at Pagoda Park on Chongno Street. Chongno is itself a major thoroughfare passing an important business section of Seoul. From Chongno, Insadong Street runs a northwest diagonal until it reaches Yulgongno, another major avenue. This small street is loaded with antique shops selling all sorts of Korean antiques and handcrafts. Many stores are specialty shops featuring items such as chests, other furniture, stationery. But this famous street is not limited to antiques. More typically, interlanguage texts tend to employ the concluding sentence to sum up central aspects of the previous discussion. The following example of the use of a demonstrative with a complex nominal group occurs at the end of a description of a garden located on the campus of Yonsei University: 10. Finally, in the winter, the snow covered trees make an awesome scenery. I have not seen the latter scene yet, but the seniors say that it is wonderful. So I can not wait until winter. The birds singing and cute squirrels and Korean magpies running around also make "Chung-Song-Dae" rich in atmosphere. Those who have dreams have many convincing reasons why they should definitely visit this magnificent garden. The major examples of complex nominal group reference all occurred in extended comparison or contrast paragraphs. Examples of these included these two films (2 times), these two movies (2 times), and these two people. One mark of the lack of sophistication of the majority of texts in the corpus is thus measured in the strict limits placed on the complexity of the nominal group introduced by the demonstratives. REITERATION AND THE LIMITS OF INTERLANGUAGE LEXIS It is useful at this point to recall the basic idea that reiteration is a form of textual cohesion that involves a variety of lexical possibilities. In its most basic form, reiteration simply means the Language Learning & Technology 166 Terry Murphy The Emergence of Texture: An Analysis of the Functions… repetition of a lexical item. This form of textual cohesion dominates low-level interlanguage texts. Halliday and Hasan also note, however, that this type of cohesion may also involve the use of a synonym or near-synonym, a superordinate or the use of a general noun (1976, p. 278). The general principles behind this is simply that demonstratives, since (like other reference items) they identify semantically and not grammatically, when they are anaphoric require the explicit repetition of the noun, or some form of synonym, if they are to signal exact identity of specific reference; that is, to refer unambiguously to the presupposition at the identical level of particularization. A demonstrative without a following noun may refer to some more general class that includes the presupposed item…. (Halliday & Hasan, 1976, pp. 64-65) The class of general nouns is defined by Halliday and Hasan as "a small set of nouns having generalized reference within the major noun classes, those such as 'human noun,' 'place noun,' 'fact noun' and the like" (p. 274). The general noun operates on a borderline "between a lexical item (member of an open set) and a grammatical item (member of a closed set)" (p. 274). The list of the class of general nouns follows: Table 3. The Class of General Nouns Class Examples human non-human animate inanimate concrete noun inanimate concrete mass inanimate abstract action place fact people, person, man, woman, child, boy, girl creature thing, object stuff business, affair, matter move place question, idea As Halliday and Hasan (1976, p. 175) note, "a general noun in cohesive function is almost always accompanied by the reference item the … The most usual alternative to the is a demonstrative…" It is probably of significance that the only class category to be adequately represented in the corpus in conjunction with the definite article is the class of human general nouns. There were nine citations for the people, seven for the person, six for the man (but none for the boy), two for the girl (but none for the woman), and none for the child. There seemed to be a small but real tendency to use the girl as the unmarked general noun for women but the man as the unmarked general noun for men. There was also a single reference to this person in an essay on the subject of admiration for the late Korean nationalist, Kim Ku. What is striking from the point of view of lexical cohesion is that the corpus contains virtually no citations from the other classes of general nouns listed by Halliday and Hasan. The only category in which they appeared was that of general nouns of fact. The corpus contained one reference to the idea and one to the question. The latter citation, however, occurred within a portion of quoted text from an English language source. There was one citation for this question and one for this idea, with no citations for the corresponding plural forms. There was also one reference to that question but none to that idea, with no citations for the corresponding plural forms. This general under-representation of the category of general nouns in the corpus has obvious implications for the ability of these interlanguage texts to achieve the full range of lexical cohesion. What it means is that these interlanguage texts restrict their use of the items that make up the category of general nouns to lexical instances of specified anaphoric reference within the text. In Language Learning & Technology 167 Terry Murphy The Emergence of Texture: An Analysis of the Functions… other words, with the partial exception of the category of human nouns, these interlanguage texts do not appear to trade at the level of the abstract general noun. Moreover, a number of the citations relating to the category of human nouns may be the result of the fact that a suggested paragraph topic was "The Famous Person I Most Admire." Although the general nouns occur frequently as lexical items, an entire cohesive level that remains almost entirely unrealized. There are a number of possible explanations for this: Two will be considered. The first is that the absence of the full range of abstract general nouns is the result of the lexical constraints of the genres represented in this corpus. In other words, a different choice of genres, regardless of interlanguage considerations, will result in a greater overall representation of the class of abstract general nouns. The second explanation is that this absence represents a significant limitation on the lexical range actualized in interlanguage itself. It is the second explanation that this paper favors. The class of general nouns is absent because of the nature of interlanguage texts themselves. Beyond their function in achieving textual cohesion by virtue of the reference back to a previous nominal group, general nouns regularly signal the ability of the writer to refer to textual material in an interpersonal manner (Halliday & Hasan, 1976, p. 276). It seems reasonable to suggest that the capacity for referring to textual material in this way emerges only at highly sophisticated levels of interlanguage. Naturally, more extensive corpus analysis will be necessary to support or refute this working hypothesis. In this respect, one fruitful line of enquiry would be the investigation of a corpus of five paragraph essays, weighted toward the genres of argument and persuasion. Other things being equal, such a corpus might be expected to contain examples of inanimate concrete and inanimate abstract general nouns. The absence of these lexical items would offer further evidence as to whether the seeming inability to trade at the level of the general noun observed in this corpus represents a genuine limitation on the achievement of a high level of interlanguage texture. LABELS, SYNONYMS AND NEAR-SYNONYMS WITH THE DEMONSTRATIVES In low-level interlanguage texts, there is very little use of synonyms or near-synonyms to achieve lexical reiteration. At this stage in interlanguage development, the writer's lack of substantial lexical depth means that the establishment of a basic overall textual meaning takes absolute precedent. The upshot is that samples of interlanguage offer little evidence of the writer's contemplation of synonymous or near-synonymous lexical items. The overriding importance of establishing overall textual coherence explains the early use of anaphorically cohesive nominal groups as retrospective labels. As Gill Francis explains, a retrospective label "is not a repetition of a 'synonym' of any preceding element. Instead it is presented as equivalent to the clause or clauses it explains, while naming them for the first time" (Francis, 1994, p. 85). Certain genres, among them process or persuasion paragraphs, tend to encourage the use of retrospective labeling. Favored lexical items found in the corpus to achieve this kind of limited textual coherence in conjunction with the use of the demonstratives include allegation, case, examples, facts, measures, ordeal, plan, project, reason (four times), situation, steps (two times), truth, report, and way (four times). The simplest form of this type of lexical coherence occurs in the following description of a friend of the writer: 11. Moreover, Chang is not afraid of expressing himself, probably because since he was young, he was encouraged to do what he feels is right. He is very straight forward in stating his thoughts, therefore he often ends up hurting others' feeling, although not done on purpose. For this reason, there are many people who love Chang, but there are also many people who hate Chang Language Learning & Technology 168 Terry Murphy The Emergence of Texture: An Analysis of the Functions… The following four paragraph conclusions use virtually identical techniques to achieve textual unity. The paragraphs demonstrate the use of four different labels introduced by the proximate plural demonstrative: examples, facts, measures, and steps respectively: 12. Se-lim is a feminine girl, and her garments shows her feminine personality well. She usually puts on a skirt and a laced blouse and a pair of shoes which have high heels. She likes cute accessories, too. Through these examples, we can see that people's personalities affect the style of clothing. 13. There is also some belief that Korean students are spoiled and spend lots of money on drinking and playing. This idea probably comes from the fact that many commercial districts are being developed near the university areas, but the reality isn't always like that. As a matter of fact, most of my university friends have a part time job to make money for tuition. Also nowadays books have become really expensive, so more than ever, huge amounts of money go into buying books for class. From these facts, one could see that most Korean students can't afford to be spoiled. 14. You can speak to her on her face or on the telephone. You can also write her when you are scared to tell her the truth directly. It is the most powerful and definite way to make her know your resolution since she would perfectly know what you are thinking and what you are going to do. When you are looking for a reliable way to break up with your girlfriend you can change your attitude toward her, take symbolic actions and tell her what you have in our mind. These measures help you to get separated from your girlfriend without difficulty. 15. In sauna, 20 to 30 minutes of bath and 10 to 20 minutes of sauna will make your body relaxed, and then you take a nap in sleeping room in the sauna. After two or three hours of sleep, you take a shower and come home. All these process in the sauna will finally make you sober. In the evening, you have a regular dinner. However, it is extremely important to have a dinner because if the sulong-tang was the first step to cure a stomach-ache, having a regular meal is the final step. When you finish the dinner your stomach-ache will go away. Follow these steps and you will completely forget about your hangover. A related use of the plural noun reasons in conjunction with a fronted these occurred twice as an alternative way of summarizing and unifying a connected series of arguments or propositions: 16. On the wall, there are all kinds of posters: posters of movie stars, old newspapers, racing cars, scenes from movies, etc. Moreover, the lighting is suitably dim and the music is not too loud to have a conversation. As you take a seat, you'll find a redstripped tablecloth on the table, with the names of the waiter and the cook on one side. Just as you decide on what to eat, the waiter, in green shirt and black pants, will be at your side in an instant, kneeling down on the floor as he takes the order. The food is quite good and the service cannot be better. These are a few reasons why I prefer Bennigan's over other restaurants 17. The last and most irritating thing for Kim was the comparison with Pak, who had already won four times in her first year debut in the LPGA tournaments. The mass media only emphasized their scores totally ignoring their situations. In conclusion, her mental strength and continuous effort made it possible for her to surmount her physical weakness, harsh environment and stress from comparison with Pak Seri and these are the reasons why I admire Kim Mi-Hyun. Language Learning & Technology 169 Terry Murphy The Emergence of Texture: An Analysis of the Functions… Genuine examples of the use of synonyms or near-synonyms are quite rare in the corpus. An example of their use, however, is the following text on the recent government plan to reform Korean universities: 18. But this plan has problems in four ways, so this should be reconsidered seriously right now. Firstly, the project is drift in the wrong direction. The scheduled beginning of the project was postponed the day after it was first announced due to a change of the education minister. Many revisions have also been done after the original public announcement, so these are revealing that the project was put together too hastily ("The BK21" The Yonsei Annals). Secondly, this plan meets with most regional universities. They claim that a disproportionate amount of support would go to the prestigious universities in the capital area by this project. The use of retrospective labels as a means for achieving textual cohesion tends to confirm the working hypothesis that anaphoric reference predominates in interlanguage texts exhibiting basic emergent texture. In other words, advance labels, in which the label precedes the lexicalization, are uncommon in the corpus. Moreover, it seems plausible to assume that examples of advance labeling in interlanguage texts will tend to be resolved within a sentence or two. The one example of advanced labeling in the corpus, for example, which occurs in a paragraph dealing with the subject of how to break up with a boyfriend, takes place within the confines of a sentence, across the space of a full colon: 19. Everything would look perfect when you began to go out with your boy friend. As time went by, however, you found many problems in the relationship with him. Finally, you decide to break up with your boy friend. It will be a difficult experience. However, you can break up with your boy friend if you follow these steps: think about the reasons to break up, have a break time, get separate and put memorial things away. Since it is a working hypothesis that more sophisticated interlanguage texts show a gradual decrease in the frequency imbalance between anaphoric as opposed to cataphoric reference, the gradual emergence of advance labels is also a sign of interlanguage development. The ability to pursue lexical cohesion across larger portions of text is a sign of a progression beyond emergent texture. THE DEMONSTRATIVES AND CATAPHORIC REFERENCE Examples of genuine cataphoric reference were rare in the corpus. This is not surprising, given that the unmarked anaphoric reference is still a source of textual cohesive difficulty at this level of interlanguage development. Judging by the example of "My Favorite Bar or Restaurant," the plural near demonstrative those emerges at an apparently very early stage of interlanguage in the formation of cataphoric non-cohesive reference. Other typical examples of this type of non-cohesive reference included: those of gothic church; those of Shakespeare's; those who are enrolled in the science high school; those who had different political orientations; those who have dreams; those who need help; those within cultural circles; and those who agree with [sic]. The genre of interlanguage writing containing the most examples of cataphoric reference was that of advice to the reader. One example of cataphoric reference involving the demonstratives, for example, occurred in a paragraph on the subject of how to cure a hangover: 20. At first, one preventive measure is this: Never drink enough to get really drunk. The second example occurred in a paragraph dealing with the recent spate of deadly fires in public spaces in Korea in which the writer employs the notion of moral hazard to explain the apparent indifference to safety on the part of many public officials: Language Learning & Technology 170 Terry Murphy The Emergence of Texture: An Analysis of the Functions… 21. The word moral hazard is defined like this: "Moral hazard arises when individuals, in possession of private information, take actions which adversely affect the probability of bad outcomes." The main point to make about these examples is that they tend to confirm the general idea developed in the discussion of synonyms and near-synonyms: These interlanguage samples tend to develop broad rhetorical patterns of textual coherence. It is then within these broadly defined patterns that finer cohesive relations begin to emerge. In each of these examples, the particular genre is important. The genre gives to the interlanguage text abstract rhetorical possibilities for cataphoric reference. Depending on the sophistication of the writer's interlanguage, this abstract rhetorical possibility may be activated. There are two main points to make about this. The first is that cataphoric reference in the corpus is nonetheless rare, even in those paragraphs dealing with the description of process in which it might be expected. The second point is that when it occurs, the cataphoric reference is resolved quickly, indeed, in each of the three cases cited, intra-sententially. CONCLUSION The concept of emergent texture would appear to have a promising future in the ongoing investigation of interlanguage. In particular, it reveals its usefulness in its relative objectivity as a means of analyzing lexical relations above that of the individual sentence. This is so long as the concept of markedness is used in a consistent manner within a project committed openly to the investigation of actual interlanguage corpora. The marriage of functional grammar and text linguistics provides a rich store of useful concepts with which to continue this investigation of interlanguage corpora. One obvious possibility for future work would be the extension of this study of the nominal demonstratives to related examinations of the emergence and function of the systems of personal and comparative reference within the single paragraph. Within the framework provided by Halliday and Hasan, there is also the possibility of future studies extending this initial study to a corpus of five paragraph essays, taking in the full range of reference including substitution, ellipsis, and conjunction. The most interesting area for future interlanguage research, however, is undoubtedly the range of lexical cohesion. This research would involve crucial issues of relevance to many aspects of second language learning, focusing as it would on the shape and size of interlanguage semantic fields. Concretely, the analysis would involve in-depth studies of the various kinds of interlanguage reiteration, including the use of synonyms and near-synonyms, superordinates, and collocates. Much of this research would have a useful contribution to make to the new field of interlanguage semantics and second language mental lexicons (cf. Hatch & Brown, 1995). Naturally enough, the issue of cohesion does not exhaust the complex issues surrounding the evaluation of the relative sophistication of a given interlanguage text. This study has argued for the importance of integrating an analysis of the degree of emergent texture as a means for such evaluation. The analysis has attempted to demonstrate a rough and ready distinction between low-level interlanguage texts that rarely or never employ the nominal demonstratives and interlanguage texts of greater sophistication that do. Naturally, an emergent texture analysis capable of distinguishing among the full range of interlanguage achievement will require much more detailed corpus research. Moreover, there is an obvious reason why the analysis of interlanguage cohesion ought to form part of a wider investigation of interlanguage textual linguistics. Two samples of interlanguage with relatively similar kinds of cohesive relations may differ widely in terms of the ease, efficiency and appropriateness of the information they covey to the reader (Beaugrande & Dressler, 1981, p. 34). Happily enough, markedness theory, in the shape of default settings for the presentation of argument and the establishment of coherence, also has its role to play at the level of textual coherence (cf. Beaugrande & Dressler, pp. 143-161). In this regard, the Language Learning & Technology 171 Terry Murphy The Emergence of Texture: An Analysis of the Functions… role of the corpus and of corpus software will be important as a means for equipping applied linguists with a more refined set of tools for the analysis of texture and textuality. NOTE 1. Michael McCarthy, in an otherwise excellent essay, states that Halliday and Hasan do "nothing to resolve the difference between it on the one hand and this and that on the other" (1994, p. 267). This is not entirely accurate. Halliday and Hasan resolve this difference implicitly, stating that "both it and the … can be explained as being the 'neutral' or nonselective type of the nominal demonstratives" (1976, p. 58). ABOUT THE AUTHOR Dr. Terry Murphy's interest in second language writing is part of his overall interest in textual linguistics and narrative discourse within the sociology of culture. This essay is a modified version of a thesis he submitted to the University of Birmingham in partial fulfillment of the requirements for an MA in TEFL. E-mail: tmorpheme@hotmail.com REFERENCES Beaugrande, R. de. (1997). New foundations for a science of text and discourse. Norwood, NJ: Ablex Publishing Company. Beaugrande, R. de, & Dressler, W. (1981). Introduction to text linguistics. New York: Longman. Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press. Clark, H., & Clark, E. (1978). Universals, relativity and language processing. In J. Greenberg (Ed.), Method and theory, Universals of Human Language Vol. 1. (pp. 235-277). Stanford, CA: Stanford University Press. Connor, U. (1996). Contrastive rhetoric. Cambridge, UK: Cambridge University Press. Francis, G. (1994). Labelling discourse: An aspect of nominal-group lexical cohesion. In M. Coulthard (Ed.), Advances in written discourse analysis (pp 83-101). New York: Routledge. Freedman, A., Pringle, I., & Yalden, J. (Eds.). (1979). Learning to write: first language/second language. New York: Longman. Goldmann, L. (1964). The hidden god: A study of tragic vision in the Pensées of Pascal and the tragedies of Racine (P. Thody, Trans.). London: Routledge. Greenberg, J. (1966). Language universals. The Hague: Mouton. Halliday, M. (1992, August 4-8). Language as system and language as instance: The corpus as a theoretical construct. In Jan Svartnik (Ed.), Directions in corpus linguistics, Proceedings of Nobel Symposium 82 (pp. 61-77). New York: Mouton De Gruyter. Halliday, M. (1994). An introduction to functional grammar (2 nd Ed.). London: Edward Arnold. Halliday, M., & Hasan, R. (1976). Cohesion in English. New York: Longman. Hatch, E., & Brown, C. (1995). Vocabulary, semantics, and language education. Cambridge, UK: Cambridge University Press. Language Learning & Technology 172 Terry Murphy The Emergence of Texture: An Analysis of the Functions… Jakobson, R. (1957). Shifters, verbal categories and the Russian verb. Russian Language Project. Department of Slavic Languages and Literature. Cambridge, MA: Harvard University. Jakobson, R., & Pomorska, K. (1983). Dialogues. Cambridge, MA: The MIT Press. Kaplan, R. (1966). Cultural thought patterns in intercultural education. Language Learning, 16, 120. Kroll, B. (Ed.). (1990). Second language writing: research insights for the classroom. Cambridge, UK: Cambridge University Press. Laufer, B., & Nation, P. (1998). Vocabulary size and use: lexical richness in L2 written production. Applied linguistics, 19(2), 225-254. McCarthy, M. (1994). It, this and that. In M. Coulthard (Ed.), Advances in written discourse analysis (pp. 266-275). New York: Routledge. Rutherford, W. (1982). Markedness in second language acquisition. Language learning, 32(1), 85109. Rutherford, W. (1987). Second language grammar: Learning and teaching. New York: Longman. Shaw, P., & Liu, E. (1998). What develops in the development of second-language writing? Applied linguistics, 19(2), 225-254. Waugh, L. R. (1976). Roman Jakobson's science of language. Lisse, The Netherlands: The Peter De Ridder Press. Language Learning & Technology 173 Language Learning & Technology http://llt.msu.edu/vol5num3/wang/ September 2001, Vol. 5, Num. 3 pp. 174-184 EXPLORING PARALLEL CONCORDANCING IN ENGLISH AND CHINESE Wang Lixun The Open University of Hong Kong ABSTRACT This paper investigates the value of computer technology as a medium for the delivery of parallel texts in English and Chinese for language learning. An English-Chinese parallel corpus was created for use in parallel concordancing -- a technique which has been developed to respond to the desire to study language in its natural contexts of use. Specific problems of dealing with Chinese characters in concordancing are discussed. A computer program called English-Chinese Parallel Concordancer was developed for this research. The operation of the program is demonstrated through screen shots. The pedagogical application of parallel concordancing in English and Chinese is illustrated through examples from some teaching and learning experiments, and the Data-Driven Learning approach is applied and explored. It is hoped that parallel concordancing in English and Chinese will become a useful and popular tool for both English and Chinese learners in their second language learning. INTRODUCTION Parallel concordancing is a tool which has been developed to respond to the desire (fuelled by linguists such as Sinclair) to study language in its natural contexts of use. It allows us to place side by side for comparison two contexts produced for a given item -- phrase, word, or morpheme -- one being a translation of the other. It has many uses in translation studies and in translation pedagogy, such as in the compilation of bilingual dictionaries. However, in the present paper it is the pedagogical value of parallel concordancing which will receive attention. The main research interest in this paper is in the use of parallel concordancing in the teaching of languages, specifically in its use as a form of consciousness-raising, of making learners aware of the differences between the target language and their own language (Rutherford, 1987). By comparing the contexts obtained for an item in one language, with the translations of the contexts in the other language, learners can see how the item is rendered according to varying contextual elements (Roussel, 1991). This can be useful pedagogically as, for example, it can help to prevent the L2 of more advanced learners from becoming fossilised and settling into the use of cognate but contextually inappropriate structures in the target language. It can help one to look at the way a given structure is used in different styles or registers, or by different age groups, or by native and foreign speakers (King, 1989). Barlow, who developed the ParaConc (Barlow, 2001) program for parallel concordancing, claims that parallel texts (texts that are translations of each other) are a promising resource for a range of research projects related to language learning. Using parallel texts, as he puts it, "allows language learners to directly investigate (perhaps in response to queries posed by the teacher) the main correspondences between particular words and structures in two languages" (Barlow, 1996a). It helps beginning learners to create an awareness for the feel of a second language and also to obtain some concrete knowledge of correspondences. It also helps advanced learners to deepen their knowledge of words and phrases: to understand not just the main meaning or most common meanings of a word, but to understand a range of meanings and to perceive how context in terms of discourse and genre provides clues to the appropriate Copyright  2001, ISSN 1094-3501 174 Wang Lixun Parallel Concordancing in English and Chinese… meaning (Barlow, 1996a, 1996b). In this paper, some pedagogic applications of parallel concordancing are explored, making use of Barlow's insights and also the Data-driven Learning (DDL) approach (Johns, 1991, 1993, 1994), which will be discussed in the section "Parallel Concordancing for Lexical Learning." To carry out parallel concordancing in English and Chinese, I constructed an English-Chinese parallel corpus and developed a software package, English-Chinese Parallel Concordancer (Wang, 2000). A concordance example of the word xian4zai4 (now) is discussed in the paper, revealing an insight into different uses of the word, and how the findings can be applied in language learning. (Xian4zai4 is Chinese Pinyin, the Roman transliteration of Chinese characters, which is used throughout this paper for the convenience of English readers. The numbers are tone markers.) PROBLEMS OF DEALING WITH CHINESE CHARACTERS IN CONCORDANCING Although parallel concordancing has been carried out between several European languages, it seems not to have been previously extended to non-alphabetic languages such as Chinese. This is due to fundamental differences in the language systems which create complex conceptual and computational problems of alignment. The most immediate differences between Chinese and the European languages are that Chinese is written in ideograms rather than alphabetic characters, and that it lacks the properties of most European grammatical systems. For example, it has no articles, no tenses, no participles or gerunds, no moods, and virtually no inflections. It even had no punctuation, until it was introduced from the West at the end of the 19th century. Even in a language as English, the definition of "word" can be problematic. For example, is "crabmeat" one word, or two? However, it is even more problematic for a language as Chinese to define word. Written Chinese gives no indication of which characters are to be considered as words and which combine with others to form compound words. For example, according to standard Chinese grammar rules, ban4 (half), tu2 (way), er2 (but), and fei4 (give up) are four words, which should be separated by spaces. But most Chinese people consider this four-character combination a single word (give-up-halfway). This type of combination is very common in Chinese, having a similar function to that of an idiom in English, although the characters in it normally keep their original meanings rather than combine with others to form compound words. Also, unlike English idioms, the compound can function as an adjective, adverb, or verb, which might explain why people usually regard it as a single word. If we want to take account of the non-correspondence between character and word, we must first develop some way of establishing when a string of characters can be considered a word. Then, in entering the Chinese text on computer, spaces can be inserted between these conceptual words to correspond to the standard graphical indication of a word in English. Thus wo3 (I), qi2 (ride) zi4 xing2 che1 (bicycle) would be entered as wo3 qi2 zi4xing2che1. However, there are a number of technical problems associated with this form of alignment. It seems impractical to design a computer program to insert spaces automatically, since two successive characters may be either one or two words according to the context. This means that the spaces have to be added manually, which is costly in terms of time and money. Furthermore, the end-user searching for a word with the retrieval software may conceive of words differently from the original corpus compiler and may have to make several attempts to match the compiler's input. Given the technical and conceptual problems associated with non-correspondence alignment, it appeared that the only practical solution was to make an assumption of character-word correspondence and thus treat each Chinese character as a word. Having made this assumption, the inputting task was made easier by the Chinese word processor NJStar, which not only inserts spaces between Chinese characters automatically, but can also convert Chinese characters into Pinyin, which is very important for Englishspeaking people wanting to learn or pronounce Chinese. Language Learning & Technology 175 Wang Lixun Parallel Concordancing in English and Chinese… CREATING AN ENGLISH-CHINESE PARALLEL CORPUS Unlike other concordancing programs such as Microconcord (Johns, 1986) or Wordsmith (Scott, 2000), which can be used on any collection of texts, a parallel concordancer must be used on a corpus consisting of parallel texts in two or more languages. Before developing the concordancing program, then, it was necessary to select texts in order to set up an English-Chinese parallel corpus. The corpus aims at helping intermediate English or Chinese language learners, such as university students, further improve their second language. Thus, the texts chosen were English or Chinese texts which are fairly easy to understand from the point of view of vocabulary, syntax, and discourse. University students are usually interested in genres such as novels, fables, essays, autobiographies, magazines, and general scientific articles, so these genres were taken into first consideration. To keep a balance, about half the source texts were in English and half in Chinese. Only written materials were collected, as it was too difficult for the present research to cover transcribed spoken materials. To ensure that the quality of translation was good, only published translations were selected. The corpus now contains about 1 million words in English and 2 million characters in Chinese. Table 1 shows the percentage of genres distribution in the corpus. Table 1. Percentage of Genres Distribution in the Corpus Genre % novel 50 essay 15 fable 10 autobiography 5 scientific article 5 political address 5 magazine 5 other 5 Initially, the method of inputting the texts was to scan in English texts and type in Chinese texts. Subsequently, Chinese texts were scanned with SunmiPage ScanInsert OCR software (Liang, 1997) and then edited. The texts used are either copyright-free or permission has been obtained from the authors. After editing, the texts needed to be marked up. The purpose of marking up texts is to define sentence and paragraph boundaries so that a sentence in one text can be matched with its translation in the other by the parallel concordancing program. In order to keep the size of the text files as small as possible, minimal marking up was used: The only necessary element is <S> to identify sentence boundaries, as the program was developed in such a way as to recognise paragraph boundaries without special markers. Electronically, each Chinese punctuation mark occupies two bytes, while each English mark occupies only one byte. A program was developed to automatically mark up Chinese text according to Chinese punctuation and English text according to English punctuation. THE DEVELOPMENT OF THE ENGLISH-CHINESE PARLLEL CONCORDANCER Since 1997, I have been developing the English-Chinese Parallel Concordancer (E-C Concord), and the first version was successfully completed in 2000. It works in a Windows95/98 environment, and can carry out sentence-by-sentence parallel concordancing in English, Chinese, and Pinyin. The main technical problem in developing a program for parallel concordancing related to the alignment method used for identifying equivalent sentences between texts. A major problem in aligning texts arises when the number of sentences in the source language differs from that in the target language. The situation could also arise where the number of sentences in a paragraph is the same, but the divisions between them do not coincide. A program called Multiconcord (Woolls, 1997) had previously been developed at the University of Birmingham, using an algorithm which automatically looks for disturbance between the two texts and re-establishes the matches by joining several short sentences together in one language to match a long one in the other. The algorithm gives satisfactory accuracy in aligning parallel texts in European languages (Woolls, 1998). However, an adaptation of this program to align texts in English and Chinese only achieved an accuracy of about 60%, based on an accuracy test carried out by Woolls and the author. The decision was then taken that for the present research the texts would be pre-aligned -- which of course Language Learning & Technology 176 Wang Lixun Parallel Concordancing in English and Chinese… gives an accuracy of 100%. That accuracy is achieved at the cost of time-consuming manual pre-editing of the texts. Figure 1. Screen shot of the search window of E-C Concord The program allows the user to type in a search item in the "search box," and choose a Search Language and a Target Language. When entering an English or Pinyin search item, wild cards (*) are acceptable, so that "book*" can be "book," "books," "booking," "booked," and so forth, and "wang*" can be "wang1," "wang2," "wang3," or "wang4." Wild cards cannot be used with Chinese characters. The user needs to select one or more text files from the file list: These files contain the corpus data. The program provides three ways of concordancing: (a) Monolingual Concordance, Key-Word-In-Context; (b) Monolingual Concordance, Sentence-by-Sentence; and (c) Parallel Concordance, Sentence-by-Sentence. The user can also control the maximum search hits. After making all the necessary choices and pressing the "Search" button, the user will get a result such as shown in Figures 2 and 3. Language Learning & Technology 177 Wang Lixun Parallel Concordancing in English and Chinese… Figure 2. Parallel concordance of "now": Chinese character output Figure 3. Parallel concordance of "now": Chinese Pinyin output Language Learning & Technology 178 Wang Lixun Parallel Concordancing in English and Chinese… The concordance output is in sentence-by-sentence format, which consists of pairs of English and Chinese sentences, one been the translation of the other in the pair. The text can be edited on screen and saved as text files for further studies. PARALLEL CONCORDANCING FOR LEXICAL LEARNING More than one and a half centuries ago, von Humboldt (1836/1988) pointed out that "we cannot, properly speaking, teach a foreign language: all we can do is create the conditions under which it can be awakened in the soul" (p. 236). Using Humboldt's insights, and based on the data generated by the concordancer, Johns (1991) proposed a new language-learning approach, which he called Data-Driven Learning (DDL). The DDL approach puts emphasis on the inductive acquisition on the part of students of grammatical rules or regularities through the process of analysing the patterns of language use of specially selected items as revealed through corpora (Johns, 1991; Tribble & Jones, 1990). Johns's remark "Every student a Sherlock Holmes" implies that the role of the learner has changed in DDL: A learner is a researcher, testing hypotheses and revising them in the light of data; a learner is a detective, finding and interpreting linguistic clues. DDL can focus on different aspects of language. This paper focuses on lexical learning using DDL. The following is an example of what a learner can detect by analysing parallel concordance data. The lexical item studied here is the adverb xian4zai4 (now), as it is a very common and important word, but one not satisfactorily covered by bilingual dictionaries. Some differences in the use of xian4zai4 and "now" in the two languages are discussed below. One hundred and twenty-eight examples were found in four different texts (novels). Forty examples were randomly selected from them, and were classified into several groups. The idea was to ask Chinese students at an intermediate English level to identify the linguistic bases of the grouping. In order to compare Chinese characters with English words more clearly, the Pinyin transcription identifies its separate "words." The following abbreviations, as used by Li & Thompson (1981), were used in the examples: Abbreviation Term T translation O original CRS currently relevant state (le) PFV perfective aspect (le) ASSOC associative (de) GEN genitive (de) CL classifier 3sg third person singular Some of the above abbreviations were used because certain Chinese characters, such as those for de and le, cannot be translated directly into English words. Furthermore, each of these two has two distinct meanings which depend on the context. Many Chinese classifiers cannot be translated into English, as they simply do not exist in English, where, for example, one speaks of "a herd of cows," but there is no classifier for a single cow. The third person singular pronoun ta1 in Pinyin does not show the gender, so it cannot be automatically translated into "he" or "she." Eight Chinese students in the University of Birmingham were asked to accomplish the following tasks concerning the adverb xian4zai4 (now). Language Learning & Technology 179 Wang Lixun Parallel Concordancing in English and Chinese… Task 1 Look at the following data: 1. T: di2que4 shi4 zhe4 yang4: ta1 xian4zai4 zhi1 you3 shi2 ying1cun4 gao1 le5, ...... truly be like this 3sg now only have ten inch high CRS O: And so it was indeed: she was now only ten inches high, … 2. T: shi4shi2shang4, ta1 xian4zai4 yi3 yuan3 bu4zhi3 jiu3 ying1chi3 gao1, ... in fact 3sg now already much not less than nine feet high O: in fact she was now rather more than nine feet high, … 3. T: ta1 wan2quan2 wang4ji4 le5 ta1 xian4zai4 bi3 tu4zi3 da4 shang4 yi1qian1 bei4, 3sg completely forget PFV 3sg now compare rabbit big up a thousand times O: …quite forgetting that she was now about a thousand times as large as the Rabbit, … 4. O: wo3 xian4zai4 yi3jing1 cheng2 le5 ming2 fu4 qi2 shi2 de5 gong1ren2 ... I now already become PFV name agree that fact GEN worker T: I was now a bona fide worker … Question: What underlying pattern can be detected in the above parallel texts? What the students found was that, in the Chinese examples, xian4zai4 immediately follows the subject, while in the English ones, now follows "subject + be." They were then asked whether this was always the case. They carried out more concordancing and found that there was no such structure as "subject + verb (be) + xian4zai4" in Chinese in the corpus. The conclusion they drew was that Chinese speakers should pay special attention to the structure "subject + verb (be) + now" in English, as this structure does not exist in Chinese. They also suggested that English speakers learning Chinese should avoid adding an unwanted verb (be) to a Chinese sentence. Task 2 5. T: "xian4zai4 gai1 dao4 hua1yuan2 li3qu4 la5!" now should go to garden into ! O: "And now for the garden!" 6. T: "kuai4dian3, xian4zai4 jiu4 qu4!" quick now immediately go O: "Quick, now!" 7. T: ba3 ta1de5 tou2 tai2 gao1 -- xian4zai4 na2 bai2lan2di4 lai2 -make his head raise high now O: Hold up his head -- Brandy now -- fetch brandy come Question: Why are the English versions of the above sentences so much shorter than the Chinese ones? The students found that in the English sentences various subjects and verbs around now were not present. For example, "And now (I should head) for the garden," "Quick, now (you go there immediately)," and "(You go and fetch some) Brandy now." In the Chinese translation, however, the words struck through were presented, such as "should go to ... into" in Example 5, "immediately go" in Example 6 and "fetch ... come" in Example 7. The students concluded that in Chinese the adverb xian4zai4 could not be used independently, and some words not present in the English sentences were required in the Chinese translation. They realised that certain structures which are acceptable in English are not acceptable in Chinese, and vice versa. It seems that in the above Chinese sentences, 'the law of least effort' was not followed. Language Learning & Technology 180 Wang Lixun Parallel Concordancing in English and Chinese… Task 3 8. O: wo3 xian4zai4 bu4 chi1 zhi1shi4 wo3 bu4 xiang3 chi1 ta1 ba4 le5. I now not eat only I not want eat 3sg CRS T: But I didn't choose to just yet. 9. O: wo3 xian4 zai4 shi4 "zu3 zhang3" le5, geng4 zhu3yao4 de I now be group leader PFV even mainly T: Because I was "group leader" and, even more, … shi4 ...... GEN be 10. O: wo3 xiang3 ta1 bu4 shi4 sui2 kou3 zhe4yang4 shuo1 de, ke3neng2 shi4 you3yi4shi4di4 I think 3sg not be casually in this way speak ASSOC may be intentionally yao4 rang4 wo3 zhi1dao4 wo3 xian4zai4 bu4 tong2 yu2 guo4qu4 de shen1fen1. want let I know I now not same past ASSOC status T: I suspected that he said this to let me know my changed status. 11. O: na3me5, wo3 xian4 zai4 sheng1huo2 yu2 qi2jian1 de zhe4ge4 xin1 de sheng1cun2 then I now live in between ASSOC this new ASSOC living huan2jing4 shi4 zen3yang4 de5 ne5? surroundings be what GEN? T: So what about my life in these new surroundings? Question: What is missing from the English versions of the above sentences? Why? The students easily found that xian4zai4 occurs in the Chinese text but now did not appear in the English translation. The students observed that the English translation in Example 8 simplified the original Chinese sentence. There were two sentence structures parallel to each other in the Chinese sentence, the first stating the fact that "I now (do) not eat," the second telling the reason "I (do) not want (to) eat." Having further studied the extended context of the sentence in the original text, the students realised that the narrator of the sentence was in a state of starvation most of the time, so to be able to choose whether to eat or not was very satisfying, and the feeling was expressed through the parallel sentence structure. The English translation used prospective contrast, and it simplified the sentence. The students felt that it was not as expressive as the original Chinese sentence. In Example 9, the students argued that now was not used in the English translation because the past tense "was" was clear enough and now was not necessary. In the Chinese version, the combination "now ... le (PFV)" served the same purpose as "was." In Example 10, the students found that the Chinese version used contrastive structures twice: "casually in this way speak" versus "intentionally want let I know" and "now" versus "past," but neither appeared in the English translation. They argued that contrastive structures were frequently used in Chinese to make the meaning of sentences absolutely clear, but in English quite often such structures were not used so as to make sentences simpler. In Example 11, the students found it logically reasonable that the word now did not appear in the English translation: One could not live in the past in "new surroundings." Although it sounded redundant, the word xian4zai4 should not be omitted from the Chinese sentence. Having studied examples where xian4zai4 occurred in the Chinese original but now did not appear in the English translation, the students carried out more parallel concordancing looking for examples where now occurred in an English original but xian4zai4 did not appear in the Chinese translation. The following are some examples they found: Language Learning & Technology 181 Wang Lixun Parallel Concordancing in English and Chinese… 12. O: "Now, Dinah, tell me the truth: did you ever eat bat?" T: "wei4, dai4na4, gen1 wo3 shuo1 shi2 hua4, ni3 chi1 guo4 bian1fu2 mei2you3?" wei (draw attention) Dinah to me say real words you eat PFV bat not 13. O: ...her face brightened up to think that she was now the right size for going through the little door into that lovely garden. T: xiang3 dao4 ta1 mu4qian2 de shen1cai2 zheng4hao3 neng2 tong1 guo4 na3 shan4 xiao3 think 3sg in front of eyes ASSOC size right can go through that CL little men2, ke3yi3 jin4ru4 na3 ke3ai4 de hua1yuan2, ta1 xi3 xing2 yu2 se4. door can enter that love -ly garden 3sg joy reflect through (face) colour 14. O: She found that she was now about two feet high, ... T: ta1 fa1xian4 ci3ke4 zi4ji3 shen1 gao1 da4yue1 liang3 ying1chi3... 3sg find this moment self body height about two feet 15. O: "Now tell me, Pat, what's that in the window?" T: "hao3le, gao4 su4 wo3, pa4 te4 , chuang1zi3 li3 na3 dong1xi1 shi4 shen2me?" all right tell me Pat window in that thing be what Having studied the examples, the students realised that xian4zai4 is not the only translation of now, it can be translated as mu4qian2 ("in front of eyes"), ci3ke4 ("this moment"), and possibly other words, and sometimes now is used as a word for drawing attention rather than for referring to time: wei4 ("well" or "listen") and hao3le ("all right"). Discoveries like this certainly help learners to be more aware of different uses of words in different contexts. Their L2 is less likely to become fossilised, and they will be able to see more of the subtle differences between meanings, and will try to avoid using cognate but contextually inappropriate structures in the target language. The above discussion shows the possibility of using parallel concordance data as teaching materials for Data-driven Learning purposes. The teacher can either put data into groups for students to study, or ask them to carry out concordancing on a particular lexical item, analyse the data, and ask them to submit what they have found through the analysis. CONCLUSION Technically, parallel concordancing between English and Chinese has been established successfully, and further tasks can be developed and experimented with students at different level to increase their, and their teachers', familiarity with the methodology. It is highly possible that the English-Chinese Concordancer (Wang, 2000) can be extended to Japanese and Korean, as like Chinese, they use ideograms rather than alphabetic letters. Experience suggests that the parallel concordancer is one of the most powerful tools that computer science can offer to language researchers. The distinctive feature of the Data-driven Learning approach to inductive language teaching is that the language data are primary, and the teacher does not know in advance exactly what rules or patterns the learner will discover. DDL with the support of parallel concordancing will help the learner to develop in-depth knowledge of lexical meaning and use based on evidence from authentic language. Language Learning & Technology 182 Wang Lixun Parallel Concordancing in English and Chinese… ABOUT THE AUTHOR Wang Lixun was born in China. He was awarded a PhD in Computational Linguistics at the University of Birmingham, UK, in 2000. His research interests include computer-assisted language learning; corpus linguistics; Web-based language learning. He has developed the software English-Chinese Parallel Concordancer, Bilingual Sentence Shuffler, and MatchUp. He has also developed his homepage and the ECLEPT Web site. He currently works in the School of Arts and Social Sciences at The Open University of Hong Kong. E-mail: lxwang@ouhk.edu.hk REFERENCES Barlow, M. (1996a). Parallel texts in language teaching. In S. Botley, J. Glass, A. M. McEnery, & A. Wilson (Eds.), Proceedings of teaching and language corpora 1996 (UCREL Technical Papers Volume 9; pp. 45-56). Lancaster, UK: University Centre for Computer Corpus Research on Language. Barlow, M. (1996b). Corpora for theory and practice. International Journal of Corpus Linguistics, 1(1), 1-37. Barlow, M. (2001). ParaConc [Computer software]. Houston, TX: Athelstan. Humboldt, W. von. (1836/1988). On language: The diversity of human language-structure and its influence on the mental development of mankind (P. Heath, Trans.). Originally published as the introduction to Uber die Kavi-Sprache auf der Insel Java (1836-1840). Cambridge, UK: Cambridge University Press. Johns, T. F. (1986). Microconcord: A language-learner's research tool. System, 14(2), 151-162. Johns, T. F. (1991). Should you be persuaded -- two samples of data-driven learning materials. In T. F. Johns & P. King (Eds.), Classroom concordancing (English Language Research Journal 4; pp. 1-13). Birmingham, UK: Birmingham University. Johns, T. F. (1993) Data-driven learning: An update. TELL & CALL, 1993(2), 4-10. Johns, T. F. (1994) From printout to handout: Grammar and vocabulary teaching in the context of datadriven learning. In T. Odlin (Ed.), Approaches to pedagogic grammar (pp. 293-313).Cambridge, UK: Cambridge University Press. King, P. (1989) The uncommon core: some discourse features of student writing. System, 17(1), 13-20. Li, C., & Thompson, S. (1981). Mandarin Chinese. Berkeley, CA: University of California Press. Liang, X. M. (1997). SunmiPage ScanInsert OCR [Computer software]. Singapore: Computek Enterprises Pte Ltd. Roussel, F. (1991). Parallel concordances and tonic auxiliaries. In T.F. Johns & P. King (Eds.), Classroom concordancing (English Language Research Journal 4; pp. 71-103). Birmingham, UK: Birmingham University. Rutherford, W. E. (1987). Second language grammar: Learning and teaching. London: Longman. Scott, M. (2000). WordSmith Tools Version 3.0 [Computer software]. Oxford, UK: Oxford University Press. Tribble, C., & Jones, G. (1990). Concordances in the classroom: A resource book for teachers. London: Longman. Language Learning & Technology 183 Wang Lixun Parallel Concordancing in English and Chinese… Wang, L. X. (2000). English-Chinese Parallel Concordancer [Computer software]. Birmingham, UK: University of Birmingham. Woolls, D. (1998, July 24-27). Multilingual Parallel Concordancing for Pedagogical Use. Teaching and Language Corpora 98 (pp 222-227). Oxford, UK: Keble College. Woolls, D. (1997). Multiconcord [Computer software]. Birmingham, UK: CFL Software Development. Language Learning & Technology 184 Language Learning & Technology http://llt.msu.edu/vol5num3/stjohn/ September 2001, Vol. 5, Num. 3 pp. 185-203 A CASE FOR USING A PARALLEL CORPUS AND CONCORDANCER FOR BEGINNERS OF A FOREIGN LANGUAGE Elke St.John University of Sheffield, UK ABSTRACT This pilot study set out to determine whether a parallel corpus and a concordancer would be appropriate tools to supplement a teaching programme of German at the beginners' level in an unsupervised environment. In this instance, a beginner student of German was asked to find satisfactory answers to unknown vocabulary and formulate appropriate grammar rules for himself using the parallel corpus and concordancer as the only tools. It is shown that these tools can be of great benefit for beginners. AIMS AND OBJECTIVES I describe a pilot study involving a beginner student of German who undertook a supplementary unsupervised programme of learning German using a concordancer and a parallel corpus. I investigate how a beginner student of German fares using a concordancer, Multiconcord (see King & Wools, 1996; St.John & Chattle, 1998), and a parallel German/English corpus, INTERSECT (Salkie, 1995) consisting of the original German source texts and their English translations. The aim of this study was to determine how this student copes using the parallel corpus and what conclusions he comes to when comparing the two languages, and in particular, when investigating lexical items. As students at the beginner and intermediate levels are still very dependent on a dictionary, their lack of vocabulary in the new language can often cause problems for them in class. As a consequence, most of the questions set were related to investigating the meaning of words (see Student Tasks). Additionally, using corpora and a concordancer can be motivating and rewarding not only for the learner but also for the teacher. For the teacher, these tools can provide contextualised examples to confounding lexical questions. Moreover, the learner can develop an ability to "learn how to learn" (Johns, 1991a, p. 1) by being allowed to assume the role of an explorer. This study supports Barlow's (1995a, 1996a, p. 2) claim that one of the roles the language learner plays when using corpora is that of a language researcher and explains why "a suitable research environment" must be provided (Barlow, 1996b, p. 45; see also Johns, 1986, p. 151, 1991a, p. 2). This therefore assists the student in exploring the language in great detail and thereby gaining further insights into its grammar and vocabulary. The use of concordancing in language teaching is not new. However, this pilot study demonstrates for the first time the potential of concordancing in learning German at the beginner's level. CONCORDANCER AND CORPORA IN LANGUAGE ENVIRONMENTS Concordancing is a tool that has been used extensively by linguistic and literary researchers. A concordance is a list of the occurrences of either a particular word, or a part of a word or a combination of words in context and it is drawn from a text corpus, which is presented in context. A corpus is a large body of text often in electronic format. (see Baker 1995, p. 226; Francis, 1993, p. 138; Johansson, 1995, p. 19; Leech, 1991, p. 8 for more detailed definitions) Linguistic and applied linguistic researchers are not the only group who can benefit from the use of concordancing as a tool for language learning (i.e., as a means of exploring the meanings and uses of Copyright © 2001, ISSN 1094-3501 185 Elke St.John A Case for Using a Parallel Corpus… words in their authentic contexts; see Aston, 1997a; Tribble, 1997). A concordance program enables research into the lexical, syntactic, semantic, and stylistic patterns of a language. Concordancer and monolingual text corpora (comprising only one language) have already been employed by both the language teacher and learner in classroom exercises. Typical exercises using a monolingual English corpus have included vocabulary building and the exploration of the grammatical and discourse features of texts. For specific descriptions of classroom activities (mainly for EFL teaching, however) using a monolingual English corpus, see, for example, Aston (1997a, p. 51-64), Mindt (1997, p. 40-50), Minugh (1997, p. 67-82), Murphy (1996), Flowerdew (1993, 1996), Stevens (1991a, 1991b), Tribble (1990), and Johns (1986, 1991a, 1991b). In a well-known quote, Johns advocates the DDL (Data Driven Language) approach. The advantage of this approach is that, in a classroom situation, it enables the teacher to play a less active role whilst at the same time exposes the student to authentic texts like those found in a monolingual corpus: What distinguishes the DDL approach is the attempt to cut out the middleman as much as possible and give direct access to the data so that the learner can take part in building his or her own profiles of meanings and uses. The assumption that underlies this approach is that effective language learning is itself a form of linguistic research, and that the concordance printout offers a unique resource for the stimulation of inductive learning strategies -- in particular, the strategies of perceiving similarities and differences and of hypothesis formation and testing. (Johns, 1991b, p. 30) Experiments in data driven learning and corpus-based methods (e.g., Baker, Francis, & Tognini-Bonelli, 1993; Barlow, 1995b, 1996a; Dickens & Salkie, 1996; Lewandowska-Tomaszcyk & Melia, 1997; Salkie, 1995, 1996; Tognini-Bonelli, 1996; Wichmann, Fligelstone, McEnery, & Knowles, 1997) are beginning to bear fruit in a wide range of language environments although there is as yet only a limited amount of experience on which to draw regarding learning German using a parallel corpus. With regard to monolingual corpora, they have already been used to teach German. Dodd (1997) exploits a corpus of written German for advanced language learning. After browsing through a raw corpus, his students compare corpus evidence with reference works. Dodd concludes that a computer-supported investigation of language corpora provides a powerful and simple tool for language learning. Fernández-Villanueva (1996) used a German monolingual corpus of oral language to research the function of German particles. She describes it as a very positive experience because it allows students to investigate the function of the particles, which do not have a direct equivalent in their mothertongue. Wichmann (1995) used a monolingual English corpus for teaching German and sorting out problems of lexical choice. She proposes the use of both corpora and concordancer because dictionaries do not provide enough information of meaning in context (see Barlow, 1996b, p. 54). However, Wichmann's study does not explain what kind of exercises she set her students. Parallel corpora (sometimes also called translation corpora) have already been successfully used by linguistic researchers for their research into the nature of translation. Zanettin (1994) focuses on the use of concordancing software on bilingual English/Italian parallel subcorpora to design language activities aimed at developing translation skills. Like this pilot study, he emphasises that concordancing programs "can be run by students at any time in a self-access environment, provided that instructional sheets explaining the background for the activity are supplied" (p. 108). Salkie (1996) also employs a parallel corpus to investigate grammar problems but concentrates on epistemic modality in English and French. Dickens and Salkie (1996) compare French/English bilingual dictionaries with a parallel corpus and show in analogy to this study how many equivalents one single word can actually have. Barlow (1996a) discusses research based on the analysis of parallel texts (English/Spanish) with particular regard to the translation of reflexive pronouns. He also advocates some uses for parallel texts in the language Language Learning & Technology 186 Elke St.John A Case for Using a Parallel Corpus… classroom as it is carried out in this study. The unifying theme in his article is the notion that the use of corpora and a concordancer allows everyone, from the theoretical linguist to the student learning a second language, to become a researcher (p. 2). This notion is actually combined in the present study because the student observed is both a linguist (his major) and a language learner. In the analysis in Meaning of Particles (tasks 2-6), the student discovers that there are many English equivalents for a certain German particle. This reflects Barlow's (1996b, p. 53) observation that a basic search for concordances can make students aware that the French translation of head is not always tête. Barlow (1996b, p. 54) concludes that a parallel text provides an online contextualised dictionary, which language learners can exploit in a similar way to that demonstrated in the student's tasks 2-6 under Meaning of Particles. Danielsson and Ridings (1996) report on their tool for work in parallel corpora (Skandinavian languages/English) and their efforts to integrate it into an academic programme for training translators. However, parallel corpora have not only been used for research into translation and translator training (see Baker, 1993, 1995; Buyse, 1997; Piotrowska, 1997; Schmied, 1994; Ulrych, 1997), they can also prove very useful to non-advanced language learners, as this pilot study will endeavour to demonstrate. Finally, McEnery, Wilson, and Baker examine how corpora can meet the needs of grammar teaching at the pre-tertiary level in the UK. In general, they come to the conclusion that a corpus should be at least integrated into teaching. They further conclude that "corpus data present a means by which grammar teaching may be more effective -- and more importantly may be rated more positively by learners" (1997, p. 15). It can be seen from the literature that parallel corpora have already been successfully employed in a number of studies. However German/English parallel corpora have not yet formed part of a study. In this present study, the student had to research a set of questions on his own and what is novel in this study about classroom concordancing is that the student is at beginner's level working on his own and that a German/English parallel corpus as opposed to a monolingual corpus was used. A parallel corpus was used, not only for investigating patterns in the language he was learning, but also to compare it with his mother tongue and to draw conclusions from it. BACKGROUND Corpus Used in This Study The German-English INTERSECT corpus (Salkie, 1995) which was used for this study has about 800,000 words and comprises the following files: Table 1. Composition of the INTERSECT Corpus (parts not used in the present study in italics) file name Dbank newsapr newsjan Euro UN hertzgog Basiclaw content Annual bank reports news reports news reports EU texts United Nations documents Transcripts of speeches by the President of Germany Constitution Texts comment Hoechst, BASF, Siemens From the "German News" Web site From the "German News" Web site Spoken (President Herzog) Germany, Switzerland, & Austria The student worked with six files only. The constitution texts which are also part of the INTERSECT Corpus were not used because of the complexity of German legal language structures. Language Learning & Technology 187 Elke St.John A Case for Using a Parallel Corpus… The corpus includes a variety of text types including spoken language, and it is thus both appropriate and sufficient for this pilot study because tendencies rather than rules are discovered. Corpus size is obviously a matter of considerable discussion and is not the point of this particular paper but the subject of further research. However, the problem with large corpora for language learners, especially beginners and intermediate students, is that concordances of frequent words can easily become too long and meaningless. This can be very demotivating for the beginner student. Aston comments in this respect that "work with small specialised corpora can not only be a valuable activity in its own right, as a means of discovering the characteristics of a particular area of language use, but also an instrument to help and train learners to use larger ones appropriately" (1997b, p. 61). The use of a small corpus has both advantages and disadvantages: Since the amount of data searched is relatively small, any observations on frequency of occurrence may be ungeneralisable, while on the other hand it avoids a proliferation of examples, particularly of common words which would prove too daunting to learners. When using a small corpus, the obvious strategy to employ is to focus on common words. In comparing the corpus with dictionaries, this is a logical approach in any case: if the corpus gives some clues about which words occur fairly often, this is in itself useful information as will be shown in the analysis. Student As already mentioned, I decided to only use one student for this particular study for several reasons. The literature review already shows that beginner language students had not previously been involved in corpus-based studies. In my view, it would present too great a risk if several students were included in the very first experiment of this kind. As with other new technologies before it, such as the language laboratory, a step-by-step introduction is probably most effective. As Flowerdew puts it: There is a danger of the enthusiasm for concordancing being inflated to such an extent that concordancing is seen as a sort of language teaching panacea. (1996, p. 112) Therefore, carefully conducted evaluative studies will ensure that such an inflated view will not prevail. A study carried out on a small scale such as this, will be able to offer proper guidance to large-scale studies using concordance tools. Furthermore, in a beginners' class, where the students are generally less confident than in an immediate or advanced class, it is usually more difficult to encourage and motivate them to take part in a project. A project involving new technologies would present in my judgement an even heavier threat to the students. Just as Stevens (1995, p. 2) divides language teachers into three groups, namely those who have never heard of concordances, those who have not yet taken them seriously, and those who actively use them, students could be divided into the same groups with beginner students most likely falling into the first group. Therefore, caution needs to be exercised when starting a project involving relatively new technology. I therefore decided to introduce only one new variable at a time, starting in this study with a beginner with a background in linguistics. I then propose to introduce a second variable (a beginner with no linguistic background, e.g., a student majoring in science) in a future experiment. Out of all non-specialist language learners I teach, I considered a student with his main subject in linguistics to be most appropriate in this instance, rather than a student majoring, for example, in science. It is generally agreed that in a beginner class, one of the teacher's tasks is to maintain the students' interest in the language concerned. A project of this kind could prove counter-productive and possibly discourage non-linguists. The student observed in this study had just finished his first year at university studying linguistics with German as a subsidiary subject. At the beginning of the project, he had already completed one year of German at university (3 hours a week) and his level of German was approximately equivalent to basic GCSE level. However, it has to be stressed that this level is achieved within 1 year of intensive study at university in comparison to an average of 4 years at school. It is also important to mention that the student, unlike many other so-called "false beginners," had no knowledge of German before studying the language at university. The student was one of the best students in his year and fond of grammar. Language Learning & Technology 188 Elke St.John A Case for Using a Parallel Corpus… However, there were still doubts about whether his level of German would be good enough to cope with some of the questions set. In particular, the language was thought to be too difficult as it was at a level to which only more advanced learners are exposed. Consultation with the student revealed that he actually regarded the project as a challenge. The parallel corpus and the parallel concordancer were the learner's only resource. In the process of answering his set of questions, he was able to teach himself how to use the concordancer without using a manual and went on to describe the program as very user-friendly. Student Tasks Since the reference works most often used by undergraduate students of foreign languages seem to be dictionaries, one of the student's first tasks consisted of word or phrase searching. In this instance, he had to enter the word/phrase he wanted to examine. The software would then browse through the corpus of texts and look for the wanted expression in the search language while the correspondence would be shown in the target language parallel to the search language. Unlike KWIC (Key Word In Context) concordancers, which show the search word centralised in a single line of text, the format for the parallel display is the sentence and paragraph, with the results of each search being given as parallel sentences or paragraphs. This is mainly because, although the context word is known in the search language, there is no way of knowing where in the target language paragraph the relevant correspondence word will appear or, indeed, if it appears at all. There is even the possibility that the required word or words may appear in a preceding or following sentence, rather than the equivalent single sentence of the search language. In this pilot study, the emphasis is on the behaviour of words in context in both German and English. The student had 17 tasks to choose from. If one question/search produced too many hits he went on to the next task, which again proves that too large a corpus would not be appropriate for a non-advanced learner (see Aston, 1997b, p. 61). From the hits of the other tasks, he also only selected sentences he could easily understand. Considering the learner's degree of proficiency, the level of the corpus as a whole was probably too demanding for him, but he correctly employed a strategy of finding his own level in the corpora by searching for shorter sentences. The examples in this paper show this. ANALYSIS Introduction The set of tasks consisted of common lexical and grammar problems usually encountered by beginner students and was therefore considered as appropriate for this study. The following results show how the student coped with the given resources and whether he managed to find appropriate answers without the input and guidance of the teacher. Task 1 The very first question the student was recommended to choose was based on two phrases that are often introduced in the first lesson of a beginner class when students have to learn phrases of introduction such as Wie ist Ihr Name? (What's your name?) and Wie ist Ihre Telefonnummer? (What's your telephone number?). Both interrogatives in the two questions are translated into English as what and the student was asked whether it is a pattern that wie always translates as what and not how as described in dictionaries. After using just wie and was as the search words in the input field of the interface which produced too many hits, the student decided to enter ist in the context. He subsequently came up with the following data and comments: dbank.de 1a dbank.en 1b dbank.de 2a Wie ist diese Differenz zu bewerten? How is such a spread to be assessed? Wie ist die Option „runde Wechselkurse" zu bewerten? Language Learning & Technology 189 Elke St.John dbank.en 2b dbank.de3a dbank.en3b dbank.de4a dbank.en4b A Case for Using a Parallel Corpus… How is the option "round exchange rates" to be assessed? Was ist die EWU? What is EMU? Was ist die Alternative zur EWU? What is the alternative to EMU? In general was translates into English as "what." However, anyone with a basic knowledge of German knows that there are cases where wie equates to "what" in English. The examples in the question show this. The system did provide examples where wie translates as "how" and from this evidence a student of German would conclude that, in general, wie equals "how" in English except in certain cases. The above phrases were recommended to the student as the basis of his very first question because it required a simple search for a particular phrase with which the student was very familiar; and it also involved a simple examination of the meaning. It is also worth pointing out that the student felt sufficiently independent enough to go a step further when there were too many hits for was and wie and he then inserted an ist into the context field of the interface in order to reduce the number of hits. Even though it was the very first question, the student did not ask for the tutor's assistance but just tried to find a solution for himself, which is also very rewarding from a teacher's perspective. Meaning of Particles In the next set of tasks, the learner was asked to find out how certain German modal particles and conjunctions translate into English. In this case, all he had to search for was a particular particle and then examine the correspondence. Doherty (1982, p. 95) stated that the English language has no equivalents for these modal particles, so it was interesting to see what solutions the learner would actually provide. Task 2 The first search term was wohl which produced 57 hits altogether (see Table 1 in Appendix A). The particle wohl gives the sentence a sense of uncertainty that is required in these kinds of texts (Helbig, 1994, p. 238). What was striking was that 41 of the 57 hits occurred in the dbank file alone. One would probably expect to find most hits in the dbank file considering that in financial reports many forecasts are made for future years that are based on hypotheses. The student produced many concordances and also categorised them (see Appendix B). He commented as follows: "Wohl" produced an interesting batch of searches. The general trend was that "wohl" introduced doubt into the sentence/paragraph. These were broken down into: "Wohl erst"; " wohl aber/aber wohl"; "wohl auch/wohl auch nicht"; "werden wohl"; "wohl nicht." When the English translations were read in conjunction with the German, it was noticed that most of the sentences tended to say: "probably"; "will probably"; "may well"; "is likely" etc. The general feeling when reading these sentences/paragraphs is one of doubt or caution and the word "wohl" appears with one of the aforementioned words. From a teacher's point of view, the student's investigations are more than satisfactory because he managed to deduce the right meaning and quite rightly discovered the uncertainty of wohl. His attention was not, however, drawn to the fact that the majority of the hits were in the dbank file. The comments show nevertheless in what detail the student observed the concordance output. It becomes apparent that he no longer writes about a translation as in the first search. He probably started to realise that there is not always a one-to-one equivalent available. This can be very rewarding for the teacher who might find it very frustrating that s/he is not always able to provide the student with one definite answer. The student's comments also show how reading in the foreign language is practised whilst searching through the target Language Learning & Technology 190 Elke St.John A Case for Using a Parallel Corpus… language to find patterns. It is moreover interesting to note how the student grouped the different meanings of wohl according to its collocation and meaning. Task 3 The next search term was also which can be used either as a particle or an adverb depending on syntax and context. It gives a sentence a sense of conclusion and is also used as a connective particle between two successive sentences (Helbig, 1994, p. 86-87). Furthermore, also belongs to the category of false friends (Pascoe & Pascoe, 1985, p. 12). Beginner students very often translate it as also into English whilst auch is in fact the correct German word for also. The student's search produced 74 hits altogether, probably too many for a low-level student to work through (see Table 2 in Appendix A). The student decided to only work on the following output with the following explanations afterwards: dbank.de 5a dbank.en 5b dbank.de 6a dbank.en 6b Es kann also kaum einen Zweifel daran geben, daß die EWU kommt - wenn der politische Wille stark genug ist und genügend Länder die Aufnahmeprüfung bestehen. There can, therefore, be little doubt that EMU will come - if the political will is strong enough and a sufficient number of countries pass the convergence examination. Da sich also der Umstellungskurs an Devisenmarktkursen orientieren wird, ist durch die Festlegung der Umrechnungskurse weder ein Gewinn noch ein Verlust zu erwarten. Since, therefore, the conversion rate will be geared to forex market rates, fixing the conversion rates should produce neither a profit nor a loss. The examples, "also" revealed a pattern and the English translation was "therefore." The position in the sentence in German corresponded with the position in English in almost all cases. It would appear from the searches that, when "also" translates as "therefore," the relative position of the word in both languages is the same or very near. Another pattern appeared where "also" translated as "thus." This was deduced because there appeared to be no other function for the word in the sentence. Unlike the translation "therefore," the relative position in each language varied. However, the translation could be worked out by reading the German and then the English. When the two were then compared, a deduction was made. The examples below demonstrate this. herzgog.de 7a herzgog.en 7b herzgog.de 8a herzgog.en 8b Auch in Zukunft muß das Motto also heißen: Freiheit ist das höchste Gut. Thus, in the future as well, our motto must be: Freedom is our most precious asset. Wir stehen also nicht ohne Orientierung da. Thus we do not stand here devoid of orientation. The learner discovered its correct function as a modifier in at least four examples. Although the question did not ask for a pattern in terms of word order, the learner mainly concentrated on this aspect. This might be due to the fact that the student had a linguistic background and natural interest in exploring more but this also shows very interesting aspects of using concordances with students, namely the experience they gain of how the languages operate. It also demonstrates that he was examining and comparing the languages and developing some insight into both languages simultaneously. This example also shows that the English translations can prove to be very useful to the learner. Language Learning & Technology 191 Elke St.John A Case for Using a Parallel Corpus… Task 4 The next search word was eben, which only produced 15 hits with 11 hits alone in the herzgog file (see Table 3 in Appendix A). Eben is used as an adjective, adverb, or particle; in the latter case its meaning being very difficult to determine (Helbig, 1994, p. 124). This fact was also discovered by the student and the particle use is not found much in written language. That is why most hits occurred in the herzgog file, that is, the transcription of President Herzog's speech. König remarks in this respect that some scalar particles like eben "have a wider use in English than their German 'counterparts,' in other words, some particles in English will have several translational equivalents in German" (1982, p. 79). Thus the exact opposite can apply when working from English to German. It is interesting to examine the student's findings: herzgog.de 9a herzgog.en 9b herzgog.de 10a herzgog.en 10b Und nach allem, was ich eben über das europäische Erbe gesagt habe, wäre eine undemokratische Lösung auch eine uneuropäische Lösung. And after all that I have just said about the European inheritance, an undemocratic solution would also be an "un–European" solution. man wechselte eben zu anderen. one simply changed to others. "Eben" appeared only 15 times in the corpora. In the searches below, "eben" seems to equate to "just" in English. When not translated exactly as "just," "just" seems to be implied as in example 10 where "eben" equates to "simply": "simply" could easily be replaced with "just" and carry the same meaning. Looking through the other examples from the corpora, there were many interpretations, which could have been made for the translation of "eben." These data and his comments suggest that the student is becoming aware that a word may not even be lexicalised at all in one language. This is a very important learning process and linguistic insight into languages for a student to grasp when starting to study a foreign language. The fact that he was not taught this but that he could find it out for himself is one of the most valuable aspects of concordancing and from the teacher's point of view very satisfactory. It is not easy for the teacher to tell students that there is just no translation available. It is more rewarding for both sides if the students can find out this fact himself. Task 5 The next search term was the particle doch, which produced 170 hits altogether (see Table 4 in Appendix A): The particle doch has seven different uses as a modal particle (Helbig, 1994, p. 111-119). Its main use is adversative in contradictions (Helbig, p. 119). The student carefully chose to work on the following output: newsjan.de 11a newsjan.en 11b newsjan.de 12a newsjan.en 12b dbank.de 13a dbank.en 13b dbank.de 14a Schmuggelplutonium stammt angeblich doch aus Moskau. Smuggled plutonium indeed from Moscow Nichts fuer sensible Gemueter - aber leider doch passiert Not for the faint-hearted - but it did happen Zwar ist es seiner Meinung nach zu früh, um einen Erfolg oder ein mögliches Scheitern der EWU vorauszusagen, doch sieht er die Strukturen, auf denen die EWU aufbaut, als durchaus vernünftig an. Although it is too early to tell, in his opinion, whether or not EMU will succeed, its design does make sense. Dies hätte zweifellos negative Auswirkungen auf Spaniens Haushaltsposition, doch wären diese sehr viel geringer als im Falle Italiens. Language Learning & Technology 192 Elke St.John dbank.en 14b A Case for Using a Parallel Corpus… While a collapse of EMU would undoubtedly have a negative impact on Spain's budget position, the effect would be considerably smaller than in the case of Italy. He described the output as follows: What seemed to be evident was that the word "doch" had a modifying effect on the sentence. In the sentences, "doch" seems to refer to words like "indeed' and "did." In other examples, "doch" has many uses: One of which is to add a positive nature to a sentence. In trying to find a trend for its use in German, there was also evidence that it had a positive modifying effect on a sentence. However, this was not the only use for the word. It soon became clear that "doch" is used in a variety of subtle ways to shape a phrase or sentence. Some good examples of the versatility of "doch" can be seen when it is used at the beginning of a sentence. In some of these examples, "doch" translates into "but" in English. The above example again shows that not only detailed reading in the target language is practised when using corpora but also that text analysis is employed merely by going through the data and trying to find patterns when analysing the sentences carefully. Task 6 Strictly speaking, this was not a task set but a search, which was initiated by the learner himself. It shows that the learner adopted a very interesting behaviour pattern, which might be ascribed to the fact that he has a linguistic background. After searching several German particles, the student spotted however several times and started becoming curious about the German correspondence. As a result, he carried out a search of however to investigate what translation the system would come up with: The purpose of this exercise was to test the corpora when the search words found by the system varied. Here, the English word "however" was entered and the search found different German translations. When this happened, it was decided to try and cross-reference the German word in each case. The corpora produced many other examples for each word but the idea here was to test whether the same reference could be found in the corresponding German search. In this way the student can find what he/she wants to know by using a search in either language. This is useful if the student is weak in either language and needs to find a particular answer. dbank.en 15a dbank.de 15b dbank.en 16a dbank.de 16b dbank.de 17a dbank.en 17b Formal participation in the exchange rate mechanism is, however, a binding condition of the Treaty. Die formale Teilnahme am Wechselkursverbund ist jedoch im Vertrag zwingend vorgeschrieben. However, a lasting improvement will probably only occur when Switzerland's economic outlook brightens. Allerdings sollte eine nachhaltige Stärkung wohl erst einsetzen, wenn auch die konjunkturellen Perspektiven der Schweiz sich verbessern. Die formale Teilnahme am Wechselkursverbund ist jedoch im Vertrag zwingend vorgeschrieben. Formal participation in the exchange rate mechanism is, however, a binding condition of the Treaty. Language Learning & Technology 193 Elke St.John A Case for Using a Parallel Corpus… The student came to a constructive conclusion and the fact that he carried out a cross-reference shows his interest in research and exploring. Here it would be most interesting to see whether a more typical language learner, that is, one without a linguistic background would behave in the same way. It also demonstrates the fact that concordances allow students to generate and collate the language data needed to invent their own rules of grammar and to develop the most appropriate ways of learning for themselves. This example clearly shows that the learner assumed control of his learning process. Once the student had seen how to use the program, he could, to a certain extent, set his own agenda for its use, as illustrated above with however and the cross-reference research. Grammatical and Lexical Tasks Task 7 Another trouble spot for English learners of German is the distinction of aber and sondern both translating as "but" into English. For this reason, the student was asked to find a possible semantic and/or syntactical distinction between the two. The concordance below helped him to grasp the difference almost by himself. There were 576 hits of aber whereas sondern only showed 178 entries the latter having a specific use (it only occurs after a preceding negative clause), which can also explain the fewer entries (see Tables 5 and 6 in Appendix A). Aber is also used as a co-ordinating conjunction and has two different uses as a modal particle (Helbig, 1994, p. 80-81). This can be another reason why on the whole there are more hits for aber. However, the frequency of a particle obviously also depends on the nature of the text. The subject came up with the following data and conclusion: dbank.de 18a dbank.en 18b dbank.de 19a dbank.en 19b Schuldenstand: Rückläufig, aber immer noch hoch Public sector debt: falling, but still high Aber kann man da sicher sein? But can we be sure here? "Aber" translates into English as "but" when the sentence uses "but" as a straight-forward conjunction linking two main clauses of the sentence. The examples show the use of "aber" and the search produced many more examples. English also uses the word "but" when the German uses the word "sondern." "Sondern" is used in a different way to "aber" although it still translates as "but." "Sondern" is used when the sentence has a negative preceding the word. euro.de 20a euro.en 20b euro.de 21a euro.en 21b Deshalb strebten die Gründerväter der Europäischen Gemeinschaften nach einer gemeinsamen Energiepolitik: nicht etwa als Selbstzweck, sondern als Motor für die politische Integration. Appreciating its importance, the founding fathers of the European Community desired an energy policy not only for itself but also as a motor for political integration. Der Europäische Rat gibt seiner ernsten Besorgnis Ausdruck über die anhaltende Gewalt im Gebiet der Großen Seen, von der nicht nur Ost-Zaire, sondern auch Burundi betroffen ist. The European Council expresses grave concern about the continuing violence in the Great Lakes Region, not only in Eastern Zaire but also in Burundi. As with "aber," there were many examples in the files which could also have been shown here. The system did throw up what looks like an exception. However, this could be correct in the context of this Language Learning & Technology 194 Elke St.John A Case for Using a Parallel Corpus… particular sentence and without the time to explore more of the files, it will have to remain an exception in this project: dbank.de 22a dbank.en 22b Um die — nicht kurzfristig, möglicherweise aber auf längere Sicht — bestehenden Risiken von Sanktionen im Rahmen des Stabilitätspakts zu minimieren… In order to minimize the risks of sanctions in the framework of the stability pact — not so much on a short-term horizon but possibly over the longer term… The student observed quite rightly that aber and not sondern occurred after nicht. In a classroom situation, students typically react negatively to the introduction of an exception to a rule but, by taking over control of his own learning, the student even analyses the exception he found. Also from the teacher's point of view, it is a better outcome to let students search for exceptions rather than merely presenting it to them. The reaction of the students will be more positive and learners should in turn be motivated if they can find such things for themselves, though it could be argued that, for this particular question, a monolingual context would be sufficient. The student, however, bearing in mind his level, always found it helpful to have the translation available. He also mentioned in his feedback that he learned new words by reading both the German and the English translation. With regard to distinctions made in a target language and non-existent in the learner's language, Barlow comments concerning a distinction in Spanish, which is non-existent in English: By studying the context of instances of English for that correspond to Spanish por, compared with those that correspond to para, it is possible to form hypotheses about which of the meanings of for match up with por and which with para. (1996b, p. 54) As can be seen above, the strategy Barlow describes is exactly practised by the learner who managed to work out the distinction for himself. Task 8 The student had to find out a meaning for denn, a word many beginner students tend to equate with then, especially when it occurs at the beginning of a sentence. Denn occurred 75 times in all files together (see Table 7 in Appendix A). It has seven different uses as a modal particle and is also used as causal conjunction and adverb (Helbig 1994, p. 105-110). The learner's comments regarding his data were that denn at the beginning of the sentence translates as for. However, he also discovered that it also occurs within commas with the words es sei. He concluded quite rightly that it then always translates as unless (see Appendix C). In the searches here "denn" at the beginning of the sentence translates as "for," however "denn" occurs within commas with the words: "es sei." This translates on all occasions as "unless." It is very interesting that the student discovered that denn collocates with es sei, that is,. es sei denn. This demonstrates that concordancing makes hidden structures visible, and enhances the imagination. Task 9 The last question the student chose was more challenging and complex in my judgement. He was asked to find possible meanings for man. With the word man, German has a very useful all-purpose impersonal pronoun that the 242 hits in all files reflect. The student produced the following data (see Table 8 in Appendix A): Language Learning & Technology 195 Elke St.John herzgog.de 23a herzgog.en 23b herzgog.de 24a herzgog.en 24b newsjan.de 25a newsjan.en 25b A Case for Using a Parallel Corpus… Aus der eigenen Geschichte lernt man immer noch am besten. One's own history teaches one the best lesson. Man sah weg, als jüdischen Ärzten und Rechtsanwälten die Zulassung entzogen wurde; One looked away when Jewish doctors and lawyers lost their licences; Ausserdem koenne man den Menschen nicht anlasten, dass sie keine Arbeit faenden. She added that no one could be held liable for not being able to find a job. The subject wrote afterwards: "Man" generally translated into English as the pronoun "one." It appeared in different places in the sentence including the beginning. The examples demonstrate this very clearly. The examples also show that "man" does not always have an apparent translation but when the sentence is read as a whole, it would appear that "man" is being used to refer to the general idea, i.e. "it" or the situation in general. Furthermore "man" tends to refer to "people," "nobody," "we" (the people). There were many examples in the corpora like these when "man" was used to refer to someone or something. His comments show once more how concerned he was about word order. They further indicate that he again looked for a translation but in the end accepted that there is not always a translation for one particular word. This analysis provides an illustration of how the common content of parallel corpora can be exploited to gain linguistic insights into the structure and function of languages. However, it must also be stressed again that only one student used the corpora and concordancer on a self-access basis. Multiconcord was installed on a computer in the Self-Access-Centre where the student could use it as and when he wanted during open hours. Given that there was no tutor observation during the project period, even of the data that the learner ultimately produced, it is remarkable to see that a beginner student of German can actually discover and learn on his own. In answering my initial question, all his answers can be regarded as fully satisfactory and appropriate with regard to the language learning process. In most questions, the student's conclusions were the only correct answer. However, considering that the student might have shown a natural interest in exploring the data in more detail, taking into account that his main subject of study was linguistics, any generalisations drawn from this study need confirmation. The next step would be to include students from other subjects like engineering or science and to see whether they come up with the same and or similar conclusions before expanding the experiment to a whole beginners' class. Student Observation and Feedback The student was interviewed after the pilot study; there was no student/teacher interaction during the project time. The learner found the concordancer very user-friendly and he did not use any tools other than the corpora and the concordancer. He later said that he ignored sentences that were too difficult due to a long and complicated word order, that is, he selected the sentences he wanted to use for the data, which in itself is a very important "help yourself" learning strategy. Indeed, the data used consisted of sentences only without any complex structures. This obviously means that the student's analysis is incomplete because, in order to reach reliable conclusions, all data should be considered and analysed. However, it was certainly a step forward in the learning process of a beginner student as it enabled him to draw certain conclusions about the language based on short and simple sentences. It was interesting, to see that the student used the corpus in two Language Learning & Technology 196 Elke St.John A Case for Using a Parallel Corpus… ways: to answer the set questions and to look up things that were not directly related to the questions, for example his search for however. He spent on average 2 hours on each question but he noticed that he became more efficient after each question. His explanation for this was that he became more confident in the course of time, that he knew what to do, and also that he became more used to the system. He also knew what to look for because he became more selective by choosing shorter sentences. The learner followed the following procedure: After selecting a question, he first tested how many hits it produced. If there were enough hits but not too many to cope with, the concordanced evidence of this point was assembled in both languages. He then tried to find prominent features and classified them into up to four categories. The student then saved the sentences and/or printed them out. He tried to discover a pattern in the language and, by generalising found the rules, which governed those patterns (see Johns, 1991a, p. 4). The student's work became more exploratory and thus motivating and highly experimental. In addressing the theme of this study, that is, whether corpora and concordancer are appropriate tools at beginners' level, it can be said that the student not only found the meaning of the search words (i.e., learned new vocabulary), but he also had the satisfactory feeling of having achieved something. CONCLUSION In this paper, I have shown the use of parallel corpora and concordance software, in particular its usefulness in the very early stages of language acquisition for both teacher and learner alike. Learners often pose questions and answers that teachers cannot predict. A corpus and concordance can supplement the teaching. As Johns put it, "we simply provide the evidence needed to answer the learner's questions, and rely on the learner's intelligence to find answers" (1991a, p. 2). In view of the degree of proficiency in German this student had, it was the correct decision to concentrate mainly on lexical questions. These were indeed neither easy nor straightforward. This pilot study proves that, when the translation is available, even beginner students can make use of concordancing. German was in most cases (except in the search for however) the search language and English was used to help understand the German. In this pilot study, the selected student might be regarded as a rather untypical learner and therefore further research must involve more typical language learners to find out whether low level language students can generally cope with corpus work. Nevertheless, when carrying out a study on a bigger scale, the two groups of typical and untypical learners have to be clearly distinguished. It was important however to first carry out a pilot study of this kind with one student to avoid any possible failures, which could have lead to a demotivation of the students. This experiment must be seen as a pilot study to design more carefully prepared, objective, large scale experiments. For that reason, I would like to address the following issues: Firstly, the data and subsequently the answers obtained here are relevant and appropriate for this particular pilot study. The data represents language that has been used in authentic and naturally occurring communicative situations. Secondly, the conclusions cannot be generalised because of the nature of the student and also because of the fact that the student did not consider all data. The choice of student has also effected the outcome and a study on a bigger scale will provide an answer. Finally, this study supports Zanettin's (1994, p. 108) claim that the interactive concordancer is a potential learning resource, which can be used freely and on their own initiative by all students from beginner to advanced in a self-access centre. The role of the teacher/language adviser is to suggest points at which the interactive concordancer may help to solve learning difficulties or, with instructional sheets, to explain the background for the activity and to give operational directions. Language Learning & Technology 197 Elke St.John A Case for Using a Parallel Corpus… The use of parallel corpus and concordancing in the early stages of a German learning programme can add to grammar teaching and certainly make the work with new vocabulary more interesting and rewarding. As already stated, preferably, the study should be repeated on a larger number of students and on other types of students before conclusions are drawn as to whether a non-advanced learner of German can actually benefit from using the concordancer and a parallel corpus. I, however, strongly believe that corpora and concordancing are of great potential value in the very early stages of a language learning programme and I am positive that further studies will reinforce my claim. APPENDIX A Table 1 WORD WOHL FILE all dbank HITS 57 41 FILE All HITS 74 FILE All Herzgog HITS 15 11 FILE All HITS 170 FILE All HITS 576 Table 2 WORD ALSO Table 3 WORD EBEN Table 4 WORD DOCH Table 5 WORD ABER Table 6 WORD SONDERN FILE all HITS 178 Table 7 WORD DENN FILE all HITS 75 FILE all HITS 242 Table 8 WORD MAN Language Learning & Technology 198 Elke St.John A Case for Using a Parallel Corpus… APPENDIX B Wohl erst dbank.de 1a dbank.en 1b Der Großteil des privaten Bankgeschäfts wird aber wohl erst umgestellt, wenn auch die Euro-Banknoten und Münzen eingeführt werden. The bulk of retail banking business will, however, probably not make the switch until euro notes and coins are introduced. Wohl aber/ aber wohl dbank.de 2a dbank.en 2b dbank.de 3a dbank.en 3b Daran wird die EWU aber wohl nicht scheitern. But EMU is unlikely to fail because of this. 1996 nicht signifikant, wohl aber in Relation zum IEP. this does not apply to the IEP. Wohl auch dbank.de 4a dbank.en 4b Beide Staaten werden wohl auch hohe D-Mark-Anteile in der Reservehaltung aufweisen. Dies wird vor allem für Österreich vermutet. Both countries, particularly Austria, probably hold a large proportion of their reserves in DEM. Wohl auch nicht dbank.de 5a dbank.en 5b Die Stärke eines Finanzplatzes hängt allerdings nicht nur von der Marktgröße ab, also von der Höhe der Staatsverschuldung eines Landes, sie sollte es wohl auch nicht. However, a financial centre's strength and attractiveness does not (and should not!) solely depend on the amount of government paper available, i.e. on the size of the public debt. Werden wohl dbank.de 6a dbank.en 6b Der Anteil an den offiziellen Devisenreserven der Welt wird wohl über das Niveau der jetzigen Währungen des Wechselkursmechanismus, das bei etwa achtzehn Prozent liegt, hinaus anwachsen. Its share in world foreign exchange reserves may well rise to a level above the combined 18 per cent of the major ERM currencies today. Wohl nicht dbank.de 7a dbank.en 7b Zweifel an der Erfüllung des Maastrichter Zinskriteriums bestehen wohl nicht mehr; die weit weniger als in den EWS-Kernländern vorangeschrittene Zinskonvergenz könnte vielmehr in absehbarer Zukunft eine treibende Kraft der irischen Kapitalmarktbewegungen bleiben. Doubts about Ireland meeting the Maastricht interest rate criterion appear to have vanished: interest rate convergence, which has not progressed in Ireland nearly as far as in the EMS core countries, could well remain a driving force in the Irish capital market in the foreseeable future. Language Learning & Technology 199 Elke St.John A Case for Using a Parallel Corpus… APPENDIX C herzgog.de 8a herzgog.en 8b un.de 9a un.en 9b dbank.de 10a dbank.en 10b dbank.de 11a dbank.en 11b Denn die Zukunft gestaltete sich anders, als es die meisten am 8. Mai 1945 erwarteten, auch anders, als es dem soeben zitierten Dichterwort eigentlich entsprochen hätte. For the future turned out differently from most people's expectations on 8 May 1945 and from the image conveyed by the prayer I have just quoted. Denn es mag für unseren Planeten, der nunmehr aus anderen Gründen nach wie vor in Gefahr schwebt, nicht noch eine dritte Chance geben. For there may not be a third opportunity for our planet which, now for different reasons, remains endangered. ob das Verhältnis des geplanten oder tatsächlichen öffentlichen Defizits zum Bruttoinlandsprodukt einen bestimmten Referenzwert überschreitet, es sei denn, daß entweder das Verhältnis erheblich zurückgegangen ist. whether the ratio of the planned or actual government deficit to gross domestic product exceeds a specified reference value, unless either the ratio has declined substantially. Die schwedische Regierung dürfte 1997 mit einem „sanften Nein" gegen die EWU stimmen, es sei denn, es gelingt ihr, die schwedischen Wähler umzustimmen. The Swedish government is likely to opt for a "soft no" to EMU in 1997, unless it is able to reverse public opposition to the single currency. ABOUT THE AUTHOR Elke St.John is German Co-ordinator at Modern Languages Teaching Centre at the University of Sheffield in the United Kingdom. Her research interests include corpus-based translation studies and corpus-based learning and legal translation. E-mail: E.StJohn@sheffield.ac.uk REFERENCES Aston, G. (1997a). Enriching the learning environment: Corpora in ELT. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 51-64). New York: Longman. Aston, G. (1997b). Small and large corpora in language learning. In B. Lewandowska-Tomaszcyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 51-62). Lodz, Poland: University Press. Baker, M. (1993). Corpus linguistics and translation studies -- Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology (pp. 233-250). Philadelphia: John Benjamins. Baker, M. (1995.) Corpora in translation studies: An overview and some suggestions for future research. Target 7(2), 223-243. Baker, M., Francis, G., & Tognini-Bonelli, E. (Eds.). (1993). Text and technology. Philadelphia: John Benjamins. Barlow, M. (1995a). A guide to ParaConc. Houston, TX: Athelstan. Barlow, M. (1995b). A concordancer for parallel texts. Computers and Texts, 10, 14-16. Barlow, M. (1996a). Corpora for theory and practice. International Journal of Corpus Linguistics, 1(1), 1-37. Language Learning & Technology 200 Elke St.John A Case for Using a Parallel Corpus… Barlow, M. (1996b). Parallel texts in language teaching. In S. Botley, J. Glass, T. McEnery, & A. Wilson (Eds.), Proceedings of teaching and language corpora 1996 (pp. 45-56). Lancaster, UK: UCREL Technical Papers Volume 9. Buyse, K. (1997). The study of multi- and unilingual corpora as a tool for the development of translation studies: A case study. Unpublished doctoral dissertation, Katholieke Universiteit Leuven, Belgium. Danielsson, P., & Ridings, D. (1996). Corpus and terminology: Software for the translation program at Göteborgs Universitet or getting students to do the work. In S. Botley, J. Glass, T. McEnery, & A. Wilson (Eds.), Proceedings of teaching and language corpora 1996 (Technical Papers Volume 9; pp. 57-67). Lancaster, UK: UCREL. Dickens, A., & Salkie, R. (1996). Comparing bilingual dictionaries with a parallel corpus. In M. Gellerstam, J. Järborg, S. G. Malgren, K. Norén, L. Rogström, & C. Röjder Papmehl (Eds.), EUROLEX '96 proceedings I –II (pp. 551-559). Göteborg, Sweden: Göteborg University Department of Swedish. Doherty, M. (1982). Epistemische Ausdrucksmittel im Deutschen und Englischen [Epistemic means of expressions in German and English]. Fremdsprachen, 26, 92-97. Dodd, B. (1997). Exploiting a corpus of written German for advanced language learning. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 131145). New York: Longman. Fernández-Villanueva, M. (1996). Research into the functions of German modal particles in a corpus. In S. Botley, J. Glass, T. McEnery, & A. Wilson (Eds.) Proceedings of teaching and language corpora 1996 (Technical Papers Volume 9; pp. 83-93). Lancaster, UK: UCREL Flowerdew, J. (1993). Concordancing as a tool in course design. System, 21(2), 231-244. Flowerdew, J. (1996). Concordancing in language learning. In M. Pennington (Ed.), The power of call (pp. 97-113). Houston, TX: Athelstan. Francis, G. (1993). A corpus driven approach to grammar -- principles, methods and examples In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and Technology (pp. 137-156). Amsterdam/Philadelphia: Benjamins Helbig, G. (1994). Lexikon deutscher Partikeln [Encyclopedia of German particles]. München, Germany: Langenscheidt. Johansson, S. (1995). Mens Sana in corpore sano: On the role of corpora in linguistic research. The European English Messenger, 4(2), 19-25. Johns, T. (1986). Micro-concord: A language learner's research tool. System, 4(2), 151-162. Johns, T. (1991a). Should you be persuaded: Two examples of data driven. ELR Journal 4, 1-16, University of Birmingham. Johns, T. (1991b). From printout to handout: Grammar and vocabulary learning in the context of datadriven learning. ELR Journal 4, 27-45. King, P., & Woolls, D. (1996). Creating and using a multilingual parallel concordancer. Translation and Meaning, 4, 459-466. König, E. (1982). Scalar particles in German and their English equivalents. In W. F. W. Lohnes & E. A. Hopkins (Eds.), The contrastive grammar of English and German (pp. 76-101). Ann Arbor, MI: Karoma Publishers. Language Learning & Technology 201 Elke St.John A Case for Using a Parallel Corpus… Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in hon Lewandowska-Tomaszcyk, B., & Melia, P. J. (Eds.). (1997). Practical applications in language corpora. Lodz, Poland: University Press. McEnery, T., Wilson, A., & Baker, P. (1997). Teaching grammar again after twenty years: Corpus-based help for teaching grammar. ReCALL, 9(2), 8-16. Mindt, D. (1997). Corpora and the teaching of English in Germany. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 40-50). New York: Longman. Minugh, D. (1997). All the language that's fit to print: Using British and American newspaper CD-ROMs as corpora. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 67-82). New York: Longman. Murphy, B. (1996). Computer, corpora and vocabulary study. Language Learning Journal, 14, 53-57. Pascoe, G., & Pascoe, H. (1985). Sprachfallen im Englischen. Wörterbuch der falschen Freunde [Difficulties in English. Dictionary of false friends.]. München, Germany: Hueber. Piotrowska, M. (1997). Criteria for selecting parallel texts in teaching a translation course. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 411420). Lodz, Poland: Lodz University Press. Salkie, R. (1995, May). INTERSECT: A parallel corpus project at Brighton University. Computers & Texts 9, 4-5. Salkie, R. (1996) Modality in English and French: A corpus-based approach. Language Sciences, 18(1-2), 381-392. Schmied, J. (1994). Translation and cognitive structures. Hermes, Journal of Linguistics, 13, 169-181. Stevens, V. (1991a). Classroom concordancing: Vocabulary materials derived from relevant, authentic text. English for Specific Purposes Journal 10, 35-46. Stevens, V. (1991b). Concordance-based vocabulary exercises: A viable alternative to gap-filling. ELR Journal, 4, 47-61. Stevens, V. (1995). Concordancing with language learners: Why?When?What? CAELL Journal 6(2), 210. St.John, E,. & Chattle, M. (1998.) Multiconcord: The Lingua Multilingual Parallel Concordancer for Windows. ReCALL Newsletter, 13, 7-9. Tognini-Bonelli, E. (1996). Towards translation equivalence from a corpus linguistics Perspective. International Journal of Lexicography, 9(3), 197-217 Tribble, C. (1990). Concordancing in an EAP writing program. CAELL Journal, 1(2), 10-15. Tribble, C. (1997.) Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 106-117). Lodz, Poland: Lodz University Press. Ulrych, M. (1997). The impact of multilingual parallel concordancing on translation. In B. LewandowskaTomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 421-435). Lodz, Poland: Lodz University Press. Language Learning & Technology 202 Elke St.John A Case for Using a Parallel Corpus… Wichmann, A. (1995). Using concordances for the teaching of modern languages in higher education. Language Learning Journal, 11, 61-63. A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.). (1997). Teaching and language corpora. New York: Longman. Zanettin, F. (1994). Parallel words: Designing a bilingual database for translation activities. In A. Wilson & T. McEnery, (Eds.), Corpora in language education and research: A selection of papers from Talc94 (Technical Papers, Volume 4; pp. 99-111). Lancaster, UK: UCREL. Language Learning & Technology 203 Language Learning & Technology http://llt.msu.edu/call_for_papers.html September 2001, Vol. 5, Num. 3 p. 204 Call for Papers for Special Issue of LLT Theme: Distance Learning Guest Editor: Margo Glew This special issue of Language Learning and Technology will focus on all aspects relating to distance teaching and learning of languages and how both processes are best facilitated in distance education courses. Articles must report on original empirical research in this area, or address issues in the theory and practice of implementing distance education language courses. Suggested topics include, but are not limited to • • • • • the educational context for distance learning of languages pedagogically effective practices for distance education crucial elements of effective distance language courses issues of student assessment and program evaluation in distance education new technologies in distance language learning Please note that all articles published in LLT, including in this special issue, should either report on original research or present an original framework that links previous research, educational theory, and teaching practices. Please send an e-mail of intent with a 250-word abstract by January 31, 2001 to Margo Glew (glewmarg@msu.edu). Language Learning & Technology is published exclusively on the World Wide Web. You may see current or back issues, and take out your free subscription, at http://llt.msu.edu. Copyright  2001, ISSN 1094-3501 204